SlideShare a Scribd company logo
1 of 36
Hadoop and Graph Data Management:
   Challenges and Opportunities

                 Daniel Abadi
     Assistant Professor at Yale University
           Chief Scientist at Hadapt
            Tuesday, November 8th

      (Collaboration with Jiewen Huang)
Two Major Trends
 Hadoop
   Becoming de facto standard for large scale data
    processing
   Becoming more than just MapReduce
   Ecosystem growing rapidly --- lot’s of great tools
    around it
 Graph data is proliferating; huge value if analyzed
 and processed
     Social graphs
     Linked data
     Telecom
     Advertising
     Defense
Use Hadoop to Process Graph
Data?
 Of course!
 BUT:
   Little automatic optimization (non-declarative
    interface)
   It’s possible to do graphs with Hadoop:
     Really badly
     Suboptimally
     Really well
   Case in point: subgraph pattern-matching
   benchmark on three different Hadoop-centered
   solutions:
     ~1000s,
     ~100s,
     < ~10s
Case study: Linked Data
Example Linked Data
 Entities (or “resources”)
  are nodes in the graph
 Relationships between
  entities are labeled
  directed edges in the
  graph
 (Resources typically have
  unique URI identifies to
  allow for global
  references --- not shown
  in example to the right)
 Any resource can
  connect to any other
  resource
Graph Representation
 Linked data graph
  can be parsed
  into a series of
  vertex-entity-
  vertex triples
 First entity
  referred to as the
  subject; second
  as the object;
  edge connecting
  them as the
  predicate
 We will call these
  “RDF triples”
Querying Linked Data
 Linked Data is typically queried in SPARQL
 Processing a SPARQL query is basically
 equivalent to the subgraph pattern matching
 problem
Scalable SPARQL Processing
 Single-node options are abundant, e.g.,
     Sesame
     Jena
     RDF-3X
     3store
 Fewer options that can scale across multiple
 machines
   This is where Hadoop comes in!
     One cool solution: SHARD (presented by Kurt Rohloff at
      HadoopWorld 2010)
       Uses HDFS to store graph, MapReduce jobs for
        subgraph pattern matching
       Much nicer than a naïve Hadoop solution
Another Example: Twitter Data
                               @joe_hellerst
                                   ein

                      follow                      follow
                      s                           s
                                follow
                                s
           @daniel_ab                               @mikeolso
              adi                                      n
                                 follow
                                 s                                  retweete
retweete
d                                                           retweeted
           retweete            retweete          retweete   d
           d                   d                 d

     @hadoop_is_my_life        @super_hadooper        @hadoop_is_the answer
Example Query over Twitter
 Graph

Who has retweeted both @daniel_abadi and
@mikeolson?

      @daniel_aba
                               @mikeolson
          di



          retweeted          retweeted


                      ???
Issues With Hadoop Defaults
 Hadoop does not default to pipelined algorithms
 between MapReduce jobs
   SHARD algorithm matches each clause of SPARQL
   subgraph in a separate MapReduce job
     Need full barrier synchronization between jobs which is
      unnecessary
 Hadoop does not default to pipelined algorithms
 between Map and Reduce
   Each job performs a join by vertex, where the object
    of one triple is joined with the subject of another
    triple
   Joins work much faster if you can pipeline between
    Map and Reduce (e.g. pipelined hash join)
Issues with Hadoop Defaults
(cont.)
 Hadoop hash partitions data across nodes
   Data in graph is randomly partitioned across nodes
   Data close in the graph could be on a physically
    distant node
   Therefore each join requires a complete
    redistribution of the entire data set across the
    network (and there are many joins)
 Hadoop defaults to replicating all data 3 times
  with no preferences for what is replicated or
  where it is replicated to
 Hadoop defaults to using the Hadoop Distributed
  File System (HDFS) for storage, which is not
  optimized for graph data
All is not lost! Don’t throw away
Hadoop!
 All we have to do is change the defaults and add
 to it a little
System Architecture
Partitioning
 Graphs can be represented as vertex1-edge-
  vertex2 triples
 Hash partitioning by vertex1 is straightforward
 Great for queries like:

Query: Find the names of the strikers that play for FC Barcelona.

SELECT ?name
WHERE { ?player type        footballer      .
          ?player name      ?name           .
          ?player position striker          .
          ?player playsFor FC_Barcelona . }
The problem with hash partitioning
 …

Find football players playing for clubs in a populous region where he was born.
Graph Partitioning
 Data close together in the graph should be
  physically close together (preferably on the same
  machine)
 Subgraph pattern matching can be done without
  require huge amounts of communication via joins
  across the cluster
Graph Partitioning
Graph Partitioning




   Machine 1   Machine 2   Machine 3
Edge/Triple Placement
●   Minimizing data exchange
    ●   Allowing data overlap
●   N-hop guarantee
    ●   The extent of data overlap
    ●   If a vertex is assigned to a machine, any
        vertex that is within n-hop of this vertex is
        also stored in this machine
0 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
1 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
2 Hop of Machine 1




  Machine 1   Machine 2   Machine 3
0 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
1 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
2 Hop of Machine 3




  Machine 1   Machine 2   Machine 3
High Degree Vertexes
●   Problem: High-degree vertexes make the
    graph well-connected and difficult to
    partition
●   Solution: Ignore them during graph
    partitioning

●   Problem: High-degree vertexes cause
    data explosion with a n-hop guarantee
●   Solution: Selectively weaken the n-hop
    guarantee
Query Processing
  ●   Query execution is more efficient if
      pushed to optimized storage (RDF-
      stores)
      ●   Minimizing the number of Hadoop jobs
      ●   The larger the hop guarantee, the more
          work is done in RDF-stores
To Exchange, or not to Exchange?
  ●   Given a query and n-hop guarantee, is
      data exchange (Hadoop job) between
      nodes needed?
      ●   Choose the “center” of the query graph
      ●   Calculate the distance from the “center” to
          the furthest edge
      ●   If distance > n, data exchange is needed;
          not needed otherwise
Data Exchange

Find football players playing for clubs in a populous region where he was born.
Experimental Setup
●   20-machine cluster
●   Leigh University Benchmark (LUBM):
    270 million triples
●   Things to compare:
    ●   Single-node RDF-3X
    ●   SHARD
    ●   Graph partitioning (the proposed system)
    ●   Hash partitioning on subjects
Performance Comparison
Speedup
●   Better than linear speedup
Analysis
 Difference between fastest implementation and
 slowest implementation was a factor of 1340!
   Using Hadoop does not mean that performance is
   fixed
 More improvements are possible
   Experiments used MapReduce whenever data
   communication was necessary
     NextGen Hadoop allows other programming paradigms
     besides MR
      MPI is a good candidate

   Still need to fix the data pipelining problem
 Factor of 1340 possible via focusing on storage --
 - similar in theme to Hadapt
How this fits with Hadapt

                                        Full SQL interface, Map
                                         Reduce, and JDBC Connector
  Flexible Query Interface
    (Full SQL Support, MR, JDBC)
                                        10x-50x faster than Hadoop and
                                         Hive
                                          Queries go from hours to minutes,
                   Split Query             and minutes to seconds
   Hadoop          Execution
                    (Patent Pending)
                                        Analytics across structured and
                                         unstructured data in one
                                         platform
   Hadapt Storage Engine
        (Relational + HDFS)
                                        3.5 Patents Pending

                                        $9.5 Series A financing, lead by
                                         Norwest Venture Partners and
                                         Bessemer Venture Partners
Optimized Storage Matters
 HDFS appropriate for unstructured data
 Relational storage appropriate for relational data
 Graph storage appropriate for graph data
 Hadapt allows for pluggable storage inside
 Hadoop (amongst other things)




 Bottom line: Hadoop can be used for scalable
 graph processing, but it might need some
 Hadapting ;)

More Related Content

What's hot

Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
Christopher Pezza
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Simplilearn
 

What's hot (20)

Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Hadoop and Big Data
Hadoop and Big DataHadoop and Big Data
Hadoop and Big Data
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
Introduction and Overview of BigData, Hadoop, Distributed Computing - BigData...
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Big data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guideBig data Hadoop Analytic and Data warehouse comparison guide
Big data Hadoop Analytic and Data warehouse comparison guide
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop Technology
Hadoop TechnologyHadoop Technology
Hadoop Technology
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, GuindyScaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
Scaling up with hadoop and banyan at ITRIX-2015, College of Engineering, Guindy
 
Comparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs ApacheComparison - RDBMS vs Hadoop vs Apache
Comparison - RDBMS vs Hadoop vs Apache
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 

Viewers also liked

CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
Daniel Abadi
 

Viewers also liked (9)

CAP, PACELC, and Determinism
CAP, PACELC, and DeterminismCAP, PACELC, and Determinism
CAP, PACELC, and Determinism
 
Invisible loading
Invisible loadingInvisible loading
Invisible loading
 
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs Leopard: Lightweight Partitioning and Replication  for Dynamic Graphs
Leopard: Lightweight Partitioning and Replication for Dynamic Graphs
 
Consistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System DesignConsistency Tradeoffs in Modern Distributed Database System Design
Consistency Tradeoffs in Modern Distributed Database System Design
 
VLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-StoresVLDB 2009 Tutorial on Column-Stores
VLDB 2009 Tutorial on Column-Stores
 
The Power of Determinism in Database Systems
The Power of Determinism in Database SystemsThe Power of Determinism in Database Systems
The Power of Determinism in Database Systems
 
Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?Column-Stores vs. Row-Stores: How Different are they Really?
Column-Stores vs. Row-Stores: How Different are they Really?
 
Apache HAWQ Architecture
Apache HAWQ ArchitectureApache HAWQ Architecture
Apache HAWQ Architecture
 
MPP vs Hadoop
MPP vs HadoopMPP vs Hadoop
MPP vs Hadoop
 

Similar to Hadoop and Graph Data Management: Challenges and Opportunities

Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
Ashish Saraf
 

Similar to Hadoop and Graph Data Management: Challenges and Opportunities (20)

Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
Hadoop World 2011: Hadoop and Graph Data Management: Challenges and Opportuni...
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Cred_hadoop_presenatation
Cred_hadoop_presenatationCred_hadoop_presenatation
Cred_hadoop_presenatation
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Python in big data world
Python in big data worldPython in big data world
Python in big data world
 
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Real time hadoop + mapreduce intro
Real time hadoop + mapreduce introReal time hadoop + mapreduce intro
Real time hadoop + mapreduce intro
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Hadoop info
Hadoop infoHadoop info
Hadoop info
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
Big Data: An Overview
Big Data: An OverviewBig Data: An Overview
Big Data: An Overview
 
Big data Analytics Hadoop
Big data Analytics HadoopBig data Analytics Hadoop
Big data Analytics Hadoop
 
Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)Big Data & Hadoop. Simone Leo (CRS4)
Big Data & Hadoop. Simone Leo (CRS4)
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Hadoop and Graph Data Management: Challenges and Opportunities

  • 1. Hadoop and Graph Data Management: Challenges and Opportunities Daniel Abadi Assistant Professor at Yale University Chief Scientist at Hadapt Tuesday, November 8th (Collaboration with Jiewen Huang)
  • 2. Two Major Trends  Hadoop  Becoming de facto standard for large scale data processing  Becoming more than just MapReduce  Ecosystem growing rapidly --- lot’s of great tools around it  Graph data is proliferating; huge value if analyzed and processed  Social graphs  Linked data  Telecom  Advertising  Defense
  • 3. Use Hadoop to Process Graph Data?  Of course!  BUT:  Little automatic optimization (non-declarative interface)  It’s possible to do graphs with Hadoop:  Really badly  Suboptimally  Really well  Case in point: subgraph pattern-matching benchmark on three different Hadoop-centered solutions:  ~1000s,  ~100s,  < ~10s
  • 5. Example Linked Data  Entities (or “resources”) are nodes in the graph  Relationships between entities are labeled directed edges in the graph  (Resources typically have unique URI identifies to allow for global references --- not shown in example to the right)  Any resource can connect to any other resource
  • 6. Graph Representation  Linked data graph can be parsed into a series of vertex-entity- vertex triples  First entity referred to as the subject; second as the object; edge connecting them as the predicate  We will call these “RDF triples”
  • 7. Querying Linked Data  Linked Data is typically queried in SPARQL  Processing a SPARQL query is basically equivalent to the subgraph pattern matching problem
  • 8. Scalable SPARQL Processing  Single-node options are abundant, e.g.,  Sesame  Jena  RDF-3X  3store  Fewer options that can scale across multiple machines  This is where Hadoop comes in!  One cool solution: SHARD (presented by Kurt Rohloff at HadoopWorld 2010)  Uses HDFS to store graph, MapReduce jobs for subgraph pattern matching  Much nicer than a naïve Hadoop solution
  • 9. Another Example: Twitter Data @joe_hellerst ein follow follow s s follow s @daniel_ab @mikeolso adi n follow s retweete retweete d retweeted retweete retweete retweete d d d d @hadoop_is_my_life @super_hadooper @hadoop_is_the answer
  • 10. Example Query over Twitter Graph Who has retweeted both @daniel_abadi and @mikeolson? @daniel_aba @mikeolson di retweeted retweeted ???
  • 11. Issues With Hadoop Defaults  Hadoop does not default to pipelined algorithms between MapReduce jobs  SHARD algorithm matches each clause of SPARQL subgraph in a separate MapReduce job  Need full barrier synchronization between jobs which is unnecessary  Hadoop does not default to pipelined algorithms between Map and Reduce  Each job performs a join by vertex, where the object of one triple is joined with the subject of another triple  Joins work much faster if you can pipeline between Map and Reduce (e.g. pipelined hash join)
  • 12. Issues with Hadoop Defaults (cont.)  Hadoop hash partitions data across nodes  Data in graph is randomly partitioned across nodes  Data close in the graph could be on a physically distant node  Therefore each join requires a complete redistribution of the entire data set across the network (and there are many joins)  Hadoop defaults to replicating all data 3 times with no preferences for what is replicated or where it is replicated to  Hadoop defaults to using the Hadoop Distributed File System (HDFS) for storage, which is not optimized for graph data
  • 13. All is not lost! Don’t throw away Hadoop!  All we have to do is change the defaults and add to it a little
  • 15. Partitioning  Graphs can be represented as vertex1-edge- vertex2 triples  Hash partitioning by vertex1 is straightforward  Great for queries like: Query: Find the names of the strikers that play for FC Barcelona. SELECT ?name WHERE { ?player type footballer . ?player name ?name . ?player position striker . ?player playsFor FC_Barcelona . }
  • 16. The problem with hash partitioning … Find football players playing for clubs in a populous region where he was born.
  • 17. Graph Partitioning  Data close together in the graph should be physically close together (preferably on the same machine)  Subgraph pattern matching can be done without require huge amounts of communication via joins across the cluster
  • 19. Graph Partitioning Machine 1 Machine 2 Machine 3
  • 20. Edge/Triple Placement ● Minimizing data exchange ● Allowing data overlap ● N-hop guarantee ● The extent of data overlap ● If a vertex is assigned to a machine, any vertex that is within n-hop of this vertex is also stored in this machine
  • 21. 0 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 22. 1 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 23. 2 Hop of Machine 1 Machine 1 Machine 2 Machine 3
  • 24. 0 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 25. 1 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 26. 2 Hop of Machine 3 Machine 1 Machine 2 Machine 3
  • 27. High Degree Vertexes ● Problem: High-degree vertexes make the graph well-connected and difficult to partition ● Solution: Ignore them during graph partitioning ● Problem: High-degree vertexes cause data explosion with a n-hop guarantee ● Solution: Selectively weaken the n-hop guarantee
  • 28. Query Processing ● Query execution is more efficient if pushed to optimized storage (RDF- stores) ● Minimizing the number of Hadoop jobs ● The larger the hop guarantee, the more work is done in RDF-stores
  • 29. To Exchange, or not to Exchange? ● Given a query and n-hop guarantee, is data exchange (Hadoop job) between nodes needed? ● Choose the “center” of the query graph ● Calculate the distance from the “center” to the furthest edge ● If distance > n, data exchange is needed; not needed otherwise
  • 30. Data Exchange Find football players playing for clubs in a populous region where he was born.
  • 31. Experimental Setup ● 20-machine cluster ● Leigh University Benchmark (LUBM): 270 million triples ● Things to compare: ● Single-node RDF-3X ● SHARD ● Graph partitioning (the proposed system) ● Hash partitioning on subjects
  • 33. Speedup ● Better than linear speedup
  • 34. Analysis  Difference between fastest implementation and slowest implementation was a factor of 1340!  Using Hadoop does not mean that performance is fixed  More improvements are possible  Experiments used MapReduce whenever data communication was necessary  NextGen Hadoop allows other programming paradigms besides MR  MPI is a good candidate  Still need to fix the data pipelining problem  Factor of 1340 possible via focusing on storage -- - similar in theme to Hadapt
  • 35. How this fits with Hadapt  Full SQL interface, Map Reduce, and JDBC Connector Flexible Query Interface (Full SQL Support, MR, JDBC)  10x-50x faster than Hadoop and Hive  Queries go from hours to minutes, Split Query and minutes to seconds Hadoop Execution (Patent Pending)  Analytics across structured and unstructured data in one platform Hadapt Storage Engine (Relational + HDFS)  3.5 Patents Pending  $9.5 Series A financing, lead by Norwest Venture Partners and Bessemer Venture Partners
  • 36. Optimized Storage Matters  HDFS appropriate for unstructured data  Relational storage appropriate for relational data  Graph storage appropriate for graph data  Hadapt allows for pluggable storage inside Hadoop (amongst other things)  Bottom line: Hadoop can be used for scalable graph processing, but it might need some Hadapting ;)