Small, Medium and Big Data

Small, Medium & Big Data
Pierre De Wilde
23 November 2012
ULB - MASTIC
http://mastic.ulb.ac.be

Sir Tim Berners-Lee

http://www.w3.org/People/Berners-Lee/

Semantic Web Trends

http://www.google.com/trends/explore#q=semantic%20web

Linked Data Trends

http://www.google.com/trends/explore#q=semantic%20web%2C%20linked%20data

Linked Data Cloud

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Semantic Web

Semantic
URI, RDF(S), OWL, SPARQL

Web
Scale ?

Web Scale

Million of servers
Billion of users
Billion of objects

=> it's really Big

Big Data Trends

http://www.google.com/trends/explore#q=semantic%20web%2C%20big%20data

Big Data 3 V's

It's not only about big volume of data...

V for ...

Source: Anonymous

V for ...
Volume
Scale
Sources

Variety
Relational
NoSQL

Velocity
Operational
Analytical

How Big is our Data?

M mega million 106
G giga billion 109
T tera trillion 1012
P peta quadrillion 1015
E exa quintillion 1018
Z zetta sextillion 1021
Y yotta septillion 1024

Check The Powers of Ten (1977) on YouTube

Big Data Sources

Million of servers (logs)

Billion of users (social networks)

Billion of devices (smartphones)

+ Time/Space = Big Data

Big Data Examples

Facebook collects 500 TB per day (1)

Google processes 24 PB per day (2)

We create 2.5 EB per day (3)

(1) http://gigaom.com/data/facebook-is-collecting-your-data-500-terabytes-a-day/
(2) http://en.wikipedia.org/wiki/Petabyte (2009)
(3) http://www-01.ibm.com/software/data/bigdata/

How Small is our Wisdom?

Wisdom

Knowledge

Information

Big Data

Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?

T. S. Eliot, The Rock

Scalability

Scaling up and Scaling out

Partitioning and Sharding

RDBMS

Row Store

B-tree indexing

SQL as query language

RDBMS issues

Scale up (big servers)

Schemaful (structured)

Index-intensive (join)

NoSQL

Scale out (commodity servers)

Schemaless (semi-structured)

Index-free adjacency (graph)

NoSQL databases

Credit: Neo Technology

Key-Value Stores

(Key:string) => Value

fast read, low write latency

used for sessions, carts

Dynamo: Amazon’s Highly Available Key-value Store (2007)

Bigtable Clones

Google's Distributed Storage System

(row:string, col:string, ts:int64) => string

used by Google & most companies

Bigtable: A Distributed Storage System for Structured Data (2006)

Document Databases

document-oriented (content query)

semi-structured data (JSON)

used for web apps

Graph Databases

property graph

index-free adjacency

used for recommendations, social networks

Property Graph

A property graph is a directed, labeled, attributed graph

Graph Traversal

Gremlin is jumping

- from vertex to vertex
- from vertex to edge
- from edge to vertex

https://github.com/tinkerpop/gremlin/wiki

DBpedia Traversal

+ +
gremlin> g = new SparqlRepositorySailGraph("http://dbpedia.org/sparql")

gremlin> r = g.v('http://dbpedia.org/resource/Tim_Berners-Lee')

gremlin> r.out('http://www.w3.org/2000/01/rdf-schema#comment').has('lang','fr').value
==>Sir Timothy John Berners-Lee est un citoyen britannique surtout connu comme le principal inventeur
du World Wide Web. En juillet 2004, il est anobli par la reine Elizabeth II pour ce travail et son nom
officiel devient Sir Timothy John Berners-Lee. Depuis 1994, il préside le World Wide Web Consortium
(W3C), organisme qu'il a fondé.

gremlin> r.in('http://dbpedia.org/ontology/influenced')
==>v[http://dbpedia.org/resource/Paul_Otlet]

gremlin> r.in('http://dbpedia.org/ontology/influenced').out('http://dbpedia.org/ontology/influenced')
==>v[http://dbpedia.org/resource/Douglas_Engelbart]
==>v[http://dbpedia.org/resource/Ted_Nelson]
==>v[http://dbpedia.org/resource/Vannevar_Bush]
==>v[http://dbpedia.org/resource/Tim_Berners-Lee]
...

Triple/RDF Stores

Subject-Predicate-Object

SPARQL as query language

AllegroGraph, OpenLink Virtuoso, ...

Big Data Processing

Batch Processing
MapReduce

Interactive Analysis
BigQuery

MapReduce

MapReduce: Simplified Data Processing on Large Clusters (2004)

Apache Hadoop

Distributed Data + MapReduce

http://hadoop.apache.org/

Last Trends

http://www.google.com/trends/explore#q=hadoop%2C%20mongodb%2C%20neo4j

NoSQL issues

No Distributed Transactions

No SQL as query language

NewSQL

NoSQL + Distributed Transactions + SQL

Spanner: Google's Globally-Distributed Database (2012)

Thank you

Credit: Most images created by Flickr Creative Commons Artists or Wikipedia Commons Artists

Small, Medium and Big Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Small, Medium and Big Data

Similaire à Small, Medium and Big Data (20)

Dernier

Dernier (20)

Small, Medium and Big Data