11. V for ...
Volume
Scale
Sources
Variety
Relational
NoSQL
Velocity
Operational
Analytical
12. V for ...
Volume
Scale
Sources
Variety
Relational
NoSQL
Velocity
Operational
Analytical
13. How Big is our Data?
M mega million 106
G giga billion 109
T tera trillion 1012
P peta quadrillion 1015
E exa quintillion 1018
Z zetta sextillion 1021
Y yotta septillion 1024
Check The Powers of Ten (1977) on YouTube
14. Big Data Sources
Million of servers (logs)
Billion of users (social networks)
Billion of devices (smartphones)
+ Time/Space = Big Data
15. Big Data Examples
Facebook collects 500 TB per day (1)
Google processes 24 PB per day (2)
We create 2.5 EB per day (3)
(1) http://gigaom.com/data/facebook-is-collecting-your-data-500-terabytes-a-day/
(2) http://en.wikipedia.org/wiki/Petabyte (2009)
(3) http://www-01.ibm.com/software/data/bigdata/
16. How Small is our Wisdom?
Wisdom
Knowledge
Information
Big Data
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?
T. S. Eliot, The Rock
17. V for ...
Volume
Scale
Sources
Variety
Relational
NoSQL
Velocity
Operational
Analytical
18. Scalability
Scaling up and Scaling out
Partitioning and Sharding
24. Key-Value Stores
(Key:string) => Value
fast read, low write latency
used for sessions, carts
Dynamo: Amazon’s Highly Available Key-value Store (2007)
25. Bigtable Clones
Google's Distributed Storage System
(row:string, col:string, ts:int64) => string
used by Google & most companies
Bigtable: A Distributed Storage System for Structured Data (2006)
26. Document Databases
document-oriented (content query)
semi-structured data (JSON)
used for web apps
27. Graph Databases
property graph
index-free adjacency
used for recommendations, social networks
29. Property Graph
A property graph is a directed, labeled, attributed graph
30. Graph Traversal
Gremlin is jumping
- from vertex to vertex
- from vertex to edge
- from edge to vertex
https://github.com/tinkerpop/gremlin/wiki
31. DBpedia Traversal
+ +
gremlin> g = new SparqlRepositorySailGraph("http://dbpedia.org/sparql")
gremlin> r = g.v('http://dbpedia.org/resource/Tim_Berners-Lee')
gremlin> r.out('http://www.w3.org/2000/01/rdf-schema#comment').has('lang','fr').value
==>Sir Timothy John Berners-Lee est un citoyen britannique surtout connu comme le principal inventeur
du World Wide Web. En juillet 2004, il est anobli par la reine Elizabeth II pour ce travail et son nom
officiel devient Sir Timothy John Berners-Lee. Depuis 1994, il préside le World Wide Web Consortium
(W3C), organisme qu'il a fondé.
gremlin> r.in('http://dbpedia.org/ontology/influenced')
==>v[http://dbpedia.org/resource/Paul_Otlet]
gremlin> r.in('http://dbpedia.org/ontology/influenced').out('http://dbpedia.org/ontology/influenced')
==>v[http://dbpedia.org/resource/Douglas_Engelbart]
==>v[http://dbpedia.org/resource/Ted_Nelson]
==>v[http://dbpedia.org/resource/Vannevar_Bush]
==>v[http://dbpedia.org/resource/Tim_Berners-Lee]
...
32. Triple/RDF Stores
Subject-Predicate-Object
SPARQL as query language
AllegroGraph, OpenLink Virtuoso, ...
33. V for ...
Volume
Scale
Sources
Variety
Relational
NoSQL
Velocity
Operational
Analytical
34. Big Data Processing
Batch Processing
MapReduce
Interactive Analysis
BigQuery
35. MapReduce
MapReduce: Simplified Data Processing on Large Clusters (2004)
36. Apache Hadoop
Distributed Data + MapReduce
http://hadoop.apache.org/
37. Last Trends
http://www.google.com/trends/explore#q=hadoop%2C%20mongodb%2C%20neo4j
38. NoSQL issues
No Distributed Transactions
No SQL as query language