SlideShare une entreprise Scribd logo
1  sur  40
Télécharger pour lire hors ligne
Small, Medium & Big Data
Pierre De Wilde
23 November 2012
ULB - MASTIC
http://mastic.ulb.ac.be
Sir Tim Berners-Lee




             http://www.w3.org/People/Berners-Lee/
Semantic Web Trends




        http://www.google.com/trends/explore#q=semantic%20web
Linked Data Trends




   http://www.google.com/trends/explore#q=semantic%20web%2C%20linked%20data
Linked Data Cloud




 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
Semantic Web


               Semantic
                 URI, RDF(S), OWL, SPARQL



               Web
                 Scale ?
Web Scale


            Million of servers
            Billion of users
            Billion of objects


            => it's really Big
Big Data Trends




    http://www.google.com/trends/explore#q=semantic%20web%2C%20big%20data
Big Data 3 V's




    It's not only about big volume of data...
V for ...




            Source: Anonymous
V for ...
            Volume
              Scale
              Sources

            Variety
              Relational
              NoSQL

            Velocity
              Operational
              Analytical
V for ...
            Volume
              Scale
              Sources

            Variety
              Relational
              NoSQL

            Velocity
              Operational
              Analytical
How Big is our Data?


        M     mega            million             106
        G     giga            billion             109
        T     tera            trillion            1012
        P     peta            quadrillion         1015
        E     exa             quintillion         1018
        Z     zetta           sextillion          1021
        Y     yotta           septillion          1024



            Check The Powers of Ten (1977) on YouTube
Big Data Sources


       Million of servers (logs)

       Billion of users (social networks)

       Billion of devices (smartphones)

       + Time/Space = Big Data
Big Data Examples


            Facebook collects 500 TB per day (1)

            Google processes 24 PB per day (2)

            We create 2.5 EB per day (3)




    (1) http://gigaom.com/data/facebook-is-collecting-your-data-500-terabytes-a-day/
                       (2) http://en.wikipedia.org/wiki/Petabyte (2009)
                     (3) http://www-01.ibm.com/software/data/bigdata/
How Small is our Wisdom?

                           Wisdom




                        Knowledge



                      Information


                   Big Data

            Where is the wisdom we have lost in knowledge?
          Where is the knowledge we have lost in information?

                                        T. S. Eliot, The Rock
V for ...
            Volume
              Scale
              Sources

            Variety
              Relational
              NoSQL

            Velocity
              Operational
              Analytical
Scalability


        Scaling up and Scaling out

        Partitioning and Sharding
Relational Databases
RDBMS


        Row Store

        B-tree indexing

        SQL as query language
RDBMS issues


      Scale up (big servers)

      Schemaful (structured)

      Index-intensive (join)
NoSQL


        Scale out (commodity servers)

        Schemaless (semi-structured)

        Index-free adjacency (graph)
NoSQL databases




              Credit: Neo Technology
Key-Value Stores


       (Key:string) => Value

       fast read, low write latency

       used for sessions, carts




        Dynamo: Amazon’s Highly Available Key-value Store (2007)
Bigtable Clones


        Google's Distributed Storage System

        (row:string, col:string, ts:int64) => string

        used by Google & most companies




       Bigtable: A Distributed Storage System for Structured Data (2006)
Document Databases


       document-oriented (content query)

       semi-structured data (JSON)

       used for web apps
Graph Databases


       property graph

       index-free adjacency

       used for recommendations, social networks
Graph




        G = (V, E)
Property Graph




     A property graph is a directed, labeled, attributed graph
Graph Traversal


                              Gremlin is jumping

                              - from vertex to vertex
                              - from vertex to edge
                              - from edge to vertex




            https://github.com/tinkerpop/gremlin/wiki
DBpedia Traversal


                                 +                                 +
gremlin> g = new SparqlRepositorySailGraph("http://dbpedia.org/sparql")

gremlin> r = g.v('http://dbpedia.org/resource/Tim_Berners-Lee')

gremlin> r.out('http://www.w3.org/2000/01/rdf-schema#comment').has('lang','fr').value
==>Sir Timothy John Berners-Lee est un citoyen britannique surtout connu comme le principal inventeur
du World Wide Web. En juillet 2004, il est anobli par la reine Elizabeth II pour ce travail et son nom
officiel devient Sir Timothy John Berners-Lee. Depuis 1994, il préside le World Wide Web Consortium
(W3C), organisme qu'il a fondé.

gremlin> r.in('http://dbpedia.org/ontology/influenced')
==>v[http://dbpedia.org/resource/Paul_Otlet]

gremlin> r.in('http://dbpedia.org/ontology/influenced').out('http://dbpedia.org/ontology/influenced')
==>v[http://dbpedia.org/resource/Douglas_Engelbart]
==>v[http://dbpedia.org/resource/Ted_Nelson]
==>v[http://dbpedia.org/resource/Vannevar_Bush]
==>v[http://dbpedia.org/resource/Tim_Berners-Lee]
...
Triple/RDF Stores


        Subject-Predicate-Object

        SPARQL as query language

        AllegroGraph, OpenLink Virtuoso, ...
V for ...
            Volume
              Scale
              Sources

            Variety
              Relational
              NoSQL

            Velocity
              Operational
              Analytical
Big Data Processing



        Batch Processing
          MapReduce


        Interactive Analysis
          BigQuery
MapReduce




      MapReduce: Simplified Data Processing on Large Clusters (2004)
Apache Hadoop




        Distributed Data + MapReduce




                http://hadoop.apache.org/
Last Trends




   http://www.google.com/trends/explore#q=hadoop%2C%20mongodb%2C%20neo4j
NoSQL issues


       No Distributed Transactions

       No SQL as query language
NewSQL




    NoSQL + Distributed Transactions + SQL




         Spanner: Google's Globally-Distributed Database (2012)
Thank you




Credit: Most images created by Flickr Creative Commons Artists or Wikipedia Commons Artists

Contenu connexe

Tendances

Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
Lewis Crawford
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
Vamshikrishna Goud
 

Tendances (20)

Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
A brief history of "big data"
A brief history of "big data"A brief history of "big data"
A brief history of "big data"
 
Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)Intro to bigdata on gcp (1)
Intro to bigdata on gcp (1)
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Tools for Big Data
Big Tools for Big DataBig Tools for Big Data
Big Tools for Big Data
 
Microsoft on Big Data
Microsoft on Big DataMicrosoft on Big Data
Microsoft on Big Data
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Bigdata " new level"
Bigdata " new level"Bigdata " new level"
Bigdata " new level"
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.re:Introduce Big Data and Hadoop Eco-system.
re:Introduce Big Data and Hadoop Eco-system.
 
Big data and data science
Big data and data scienceBig data and data science
Big data and data science
 
Anita Graser: Analyzing Movment Data with MovingPandas
Anita Graser: Analyzing Movment Data  with MovingPandas Anita Graser: Analyzing Movment Data  with MovingPandas
Anita Graser: Analyzing Movment Data with MovingPandas
 
Token
TokenToken
Token
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folks
 
Introduction of Big data and Hadoop
Introduction of Big data and Hadoop Introduction of Big data and Hadoop
Introduction of Big data and Hadoop
 
Storing and Querying Semantic Data in the Cloud
Storing and Querying Semantic Data in the CloudStoring and Querying Semantic Data in the Cloud
Storing and Querying Semantic Data in the Cloud
 
PUC Masterclass Big Data
PUC Masterclass Big DataPUC Masterclass Big Data
PUC Masterclass Big Data
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 
Introduction to Big Data
Introduction to Big Data Introduction to Big Data
Introduction to Big Data
 

Similaire à Small, Medium and Big Data

Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
Andre Freitas
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
Attila Barta
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
MongoDB
 

Similaire à Small, Medium and Big Data (20)

Addressing dm-cloud
Addressing dm-cloudAddressing dm-cloud
Addressing dm-cloud
 
Scaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBaseScaling Out With Hadoop And HBase
Scaling Out With Hadoop And HBase
 
Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-Scaling the (evolving) web data –at low cost-
Scaling the (evolving) web data –at low cost-
 
NoSQL Basics - a quick tour
NoSQL Basics - a quick tourNoSQL Basics - a quick tour
NoSQL Basics - a quick tour
 
Sailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0sSailing on the ocean of 1s and 0s
Sailing on the ocean of 1s and 0s
 
STI Summit 2011 - Digital Worlds
STI Summit 2011 - Digital WorldsSTI Summit 2011 - Digital Worlds
STI Summit 2011 - Digital Worlds
 
Introduction to question answering for linked data & big data
Introduction to question answering for linked data & big dataIntroduction to question answering for linked data & big data
Introduction to question answering for linked data & big data
 
INF2190_W1_2016_public
INF2190_W1_2016_publicINF2190_W1_2016_public
INF2190_W1_2016_public
 
Big Data = Big Decisions
Big Data = Big DecisionsBig Data = Big Decisions
Big Data = Big Decisions
 
Lunch & Learn Intro to Big Data
Lunch & Learn Intro to Big DataLunch & Learn Intro to Big Data
Lunch & Learn Intro to Big Data
 
NoSQL and MapReduce
NoSQL and MapReduceNoSQL and MapReduce
NoSQL and MapReduce
 
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
Big Data Analytics 3: Machine Learning to Engage the Customer, with Apache Sp...
 
Big Data: hype or necessity?
Big Data: hype or necessity?Big Data: hype or necessity?
Big Data: hype or necessity?
 
NOSQL Overview Lightning Talk (Scalability Geekcruise 2009)
NOSQL Overview Lightning Talk (Scalability Geekcruise 2009)NOSQL Overview Lightning Talk (Scalability Geekcruise 2009)
NOSQL Overview Lightning Talk (Scalability Geekcruise 2009)
 
Galaxy of bits
Galaxy of bitsGalaxy of bits
Galaxy of bits
 
Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Data
 
Cs501 dm intro
Cs501 dm introCs501 dm intro
Cs501 dm intro
 
Trivadis Azure Data Lake
Trivadis Azure Data LakeTrivadis Azure Data Lake
Trivadis Azure Data Lake
 
Big Data Story - From An Engineer's Perspective
Big Data Story - From An Engineer's PerspectiveBig Data Story - From An Engineer's Perspective
Big Data Story - From An Engineer's Perspective
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 

Dernier

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

Small, Medium and Big Data

  • 1. Small, Medium & Big Data Pierre De Wilde 23 November 2012 ULB - MASTIC http://mastic.ulb.ac.be
  • 2. Sir Tim Berners-Lee http://www.w3.org/People/Berners-Lee/
  • 3. Semantic Web Trends http://www.google.com/trends/explore#q=semantic%20web
  • 4. Linked Data Trends http://www.google.com/trends/explore#q=semantic%20web%2C%20linked%20data
  • 5. Linked Data Cloud Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  • 6. Semantic Web Semantic URI, RDF(S), OWL, SPARQL Web Scale ?
  • 7. Web Scale Million of servers Billion of users Billion of objects => it's really Big
  • 8. Big Data Trends http://www.google.com/trends/explore#q=semantic%20web%2C%20big%20data
  • 9. Big Data 3 V's It's not only about big volume of data...
  • 10. V for ... Source: Anonymous
  • 11. V for ... Volume Scale Sources Variety Relational NoSQL Velocity Operational Analytical
  • 12. V for ... Volume Scale Sources Variety Relational NoSQL Velocity Operational Analytical
  • 13. How Big is our Data? M mega million 106 G giga billion 109 T tera trillion 1012 P peta quadrillion 1015 E exa quintillion 1018 Z zetta sextillion 1021 Y yotta septillion 1024 Check The Powers of Ten (1977) on YouTube
  • 14. Big Data Sources Million of servers (logs) Billion of users (social networks) Billion of devices (smartphones) + Time/Space = Big Data
  • 15. Big Data Examples Facebook collects 500 TB per day (1) Google processes 24 PB per day (2) We create 2.5 EB per day (3) (1) http://gigaom.com/data/facebook-is-collecting-your-data-500-terabytes-a-day/ (2) http://en.wikipedia.org/wiki/Petabyte (2009) (3) http://www-01.ibm.com/software/data/bigdata/
  • 16. How Small is our Wisdom? Wisdom Knowledge Information Big Data Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information? T. S. Eliot, The Rock
  • 17. V for ... Volume Scale Sources Variety Relational NoSQL Velocity Operational Analytical
  • 18. Scalability Scaling up and Scaling out Partitioning and Sharding
  • 20. RDBMS Row Store B-tree indexing SQL as query language
  • 21. RDBMS issues Scale up (big servers) Schemaful (structured) Index-intensive (join)
  • 22. NoSQL Scale out (commodity servers) Schemaless (semi-structured) Index-free adjacency (graph)
  • 23. NoSQL databases Credit: Neo Technology
  • 24. Key-Value Stores (Key:string) => Value fast read, low write latency used for sessions, carts Dynamo: Amazon’s Highly Available Key-value Store (2007)
  • 25. Bigtable Clones Google's Distributed Storage System (row:string, col:string, ts:int64) => string used by Google & most companies Bigtable: A Distributed Storage System for Structured Data (2006)
  • 26. Document Databases document-oriented (content query) semi-structured data (JSON) used for web apps
  • 27. Graph Databases property graph index-free adjacency used for recommendations, social networks
  • 28. Graph G = (V, E)
  • 29. Property Graph A property graph is a directed, labeled, attributed graph
  • 30. Graph Traversal Gremlin is jumping - from vertex to vertex - from vertex to edge - from edge to vertex https://github.com/tinkerpop/gremlin/wiki
  • 31. DBpedia Traversal + + gremlin> g = new SparqlRepositorySailGraph("http://dbpedia.org/sparql") gremlin> r = g.v('http://dbpedia.org/resource/Tim_Berners-Lee') gremlin> r.out('http://www.w3.org/2000/01/rdf-schema#comment').has('lang','fr').value ==>Sir Timothy John Berners-Lee est un citoyen britannique surtout connu comme le principal inventeur du World Wide Web. En juillet 2004, il est anobli par la reine Elizabeth II pour ce travail et son nom officiel devient Sir Timothy John Berners-Lee. Depuis 1994, il préside le World Wide Web Consortium (W3C), organisme qu'il a fondé. gremlin> r.in('http://dbpedia.org/ontology/influenced') ==>v[http://dbpedia.org/resource/Paul_Otlet] gremlin> r.in('http://dbpedia.org/ontology/influenced').out('http://dbpedia.org/ontology/influenced') ==>v[http://dbpedia.org/resource/Douglas_Engelbart] ==>v[http://dbpedia.org/resource/Ted_Nelson] ==>v[http://dbpedia.org/resource/Vannevar_Bush] ==>v[http://dbpedia.org/resource/Tim_Berners-Lee] ...
  • 32. Triple/RDF Stores Subject-Predicate-Object SPARQL as query language AllegroGraph, OpenLink Virtuoso, ...
  • 33. V for ... Volume Scale Sources Variety Relational NoSQL Velocity Operational Analytical
  • 34. Big Data Processing Batch Processing MapReduce Interactive Analysis BigQuery
  • 35. MapReduce MapReduce: Simplified Data Processing on Large Clusters (2004)
  • 36. Apache Hadoop Distributed Data + MapReduce http://hadoop.apache.org/
  • 37. Last Trends http://www.google.com/trends/explore#q=hadoop%2C%20mongodb%2C%20neo4j
  • 38. NoSQL issues No Distributed Transactions No SQL as query language
  • 39. NewSQL NoSQL + Distributed Transactions + SQL Spanner: Google's Globally-Distributed Database (2012)
  • 40. Thank you Credit: Most images created by Flickr Creative Commons Artists or Wikipedia Commons Artists