Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Fishing Graphs in a Hadoop Data
Lake
Max Neunhöffer
Munich, 6 April 2017
www.arangodb.com
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are ...
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are ...
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are ...
Usual approach: data in HDFS, use Spark/GraphFrames
v = spark.read.option("header",true).csv("hdfs://...")
e = spark.read....
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Wa...
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Wa...
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Wa...
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Wa...
Graph Databases
Graph Databases
Can store and persist graphs.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their...
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their...
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their...
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their...
Graph Traversals
A
B
C
D
J
E
H
F
G
A
Graph Traversals
A
B
C
D
J
E
H
F
G
AB
Graph Traversals
A
B
C
D
J
E
H
F
G
ABC
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCE
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCED
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJ
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJ
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJF
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFG
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFG
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFGH
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and i...
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and i...
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and i...
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and i...
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and i...
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with trans...
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with trans...
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with trans...
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with trans...
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with trans...
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed a...
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed a...
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed a...
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed a...
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed a...
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deplo...
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deplo...
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deplo...
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deplo...
Import data into ArangoDB
hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/patents.csv
hdfs dfs -get hdfs://name-1-node.hd...
Run a graph traversal
This query finds patents cited by patents/6009503 (depth ≤ 3) recursively:
Recursive traversal, 500 r...
Run a graph traversal
This query finds patents cited by patents/6009503 (depth ≤ 3) recursively:
Recursive traversal, 500 r...
Run a graph traversal
This query finds all patents that cite patents/3541687 directly or in two steps:
Recursive traversal ...
Run a graph traversal
This query finds all patents that cite patents/3541687 directly or in two steps:
Recursive traversal ...
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store...
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store...
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store...
Links
http://hadoop.apache.org/
http://spark.apache.org/
https://graphframes.github.io/
https://www.arangodb.com
https://g...
Prochain SlideShare
Chargement dans…5
×

Fishing Graphs in a Hadoop Data Lake

597 vues

Publié le

Hadoop clusters can store nearly everything in a cheap and blazingly fast way to your data lake. Answering questions and gaining insights out of this ever growing stream becomes the decisive part for many businesses. Increasingly data has a natural structure as a graph, with vertices linked by edges, and many questions arising about the data involve graph traversals or other complex queries, for which one does not have an a priori given bound on the length of paths.

Spark with GraphX is great for answering relatively simple graph questions which are worth starting a Spark job for, because they essentially involve the whole graph. But does it make sense to start one for every ad-hoc query or is it suitable for complex real-time queries?

In this talk I will introduce an alternative solution that adds those features to an existing Hadoop/Spark setup and enables real-time insights. I will address the following topics:

* Challenges in gaining deeper insights from large amounts of graph data
* Benefits and limitations of graph analysis with Spark
* Introduction to ArangoDB SmartGraphs
* Deployment of Hadoop, Spark and ArangoDB using DC/OS
* Performing complex queries on billions of nodes and vertices leveraging ArangoDB SmartGraphs (Live Demo)

Publié dans : Technologie
  • Soyez le premier à commenter

Fishing Graphs in a Hadoop Data Lake

  1. 1. Fishing Graphs in a Hadoop Data Lake Max Neunhöffer Munich, 6 April 2017 www.arangodb.com
  2. 2. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x)
  3. 3. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies
  4. 4. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies Indeed any relation
  5. 5. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies Indeed any relation Sometimes directed, sometimes undirected.
  6. 6. Usual approach: data in HDFS, use Spark/GraphFrames v = spark.read.option("header",true).csv("hdfs://...") e = spark.read.option("header",true).csv("hdfs://...") g = GraphFrame(v,e) g.inDegrees.show() g.outDegrees.groupBy("outDegree").count().sort("outDegree").show(1000) g.vertices.groupBy("GYEAR").count().sort("GYEAR").show() g.find("(a)-[e]->(b);(b)-[ee]->(c)").filter("a.id = 6009536").count() results = g.pageRank(resetProbability=0.01, maxIter=3)
  7. 7. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data.
  8. 8. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds.
  9. 9. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them.
  10. 10. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them. Examples: friends of friends of one person find all immediate dependencies of one item find all direct and indirect citations of one article find all descendants of one member of a hierarchy
  11. 11. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them. Examples: friends of friends of one person find all immediate dependencies of one item find all direct and indirect citations of one article find all descendants of one member of a hierarchy IDEA: Use a Graph Database
  12. 12. Graph Databases Graph Databases Can store and persist graphs.
  13. 13. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries.
  14. 14. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices.
  15. 15. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices. =⇒ Graph Traversals
  16. 16. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices. =⇒ Graph Traversals Crucial: Number of steps a priori unknown!
  17. 17. Graph Traversals A B C D J E H F G A
  18. 18. Graph Traversals A B C D J E H F G AB
  19. 19. Graph Traversals A B C D J E H F G ABC
  20. 20. Graph Traversals A B C D J E H F G ABCE
  21. 21. Graph Traversals A B C D J E H F G ABCED
  22. 22. Graph Traversals A B C D J E H F G ABCEDJ
  23. 23. Graph Traversals A B C D J E H F G ABCEDJ
  24. 24. Graph Traversals A B C D J E H F G ABCEDJF
  25. 25. Graph Traversals A B C D J E H F G ABCEDJFG
  26. 26. Graph Traversals A B C D J E H F G ABCEDJFG
  27. 27. Graph Traversals A B C D J E H F G ABCEDJFGH
  28. 28. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store,
  29. 29. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models.
  30. 30. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf.
  31. 31. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf. Allows for polyglot persistence using a single database technology.
  32. 32. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf. Allows for polyglot persistence using a single database technology. In a microservice architecture, there will be several different deployments.
  33. 33. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries,
  34. 34. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics,
  35. 35. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins,
  36. 36. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries,
  37. 37. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries, AQL is independent of the driver used and
  38. 38. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries, AQL is independent of the driver used and offers protection against injections by design.
  39. 39. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems.
  40. 40. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone.
  41. 41. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic.
  42. 42. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization.
  43. 43. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed.
  44. 44. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed. runs on Apache Mesos and Mesosphere DC/OS clusters.
  45. 45. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery
  46. 46. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together!
  47. 47. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together! Consequence: We can easily deploy multiple systems alongside each other.
  48. 48. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together! Consequence: We can easily deploy multiple systems alongside each other. Example: HDFS, Spark and ArangoDB
  49. 49. Import data into ArangoDB hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/patents.csv hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/citations.csv dcos package install arangodb3 arangosh --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos var g = require("@arangodb/general-graph"); var G = g._create("G",[g._relation("citations",["patents"],["patents"])]); arangoimp --collection patents --file patents.csv --type csv --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos arangoimp --collection citations --file citations.csv --type csv --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
  50. 50. Run a graph traversal This query finds patents cited by patents/6009503 (depth ≤ 3) recursively: Recursive traversal, 500 results, 317 ms FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G" RETURN v
  51. 51. Run a graph traversal This query finds patents cited by patents/6009503 (depth ≤ 3) recursively: Recursive traversal, 500 results, 317 ms FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G" RETURN v This one finds all patents that cite any of those cited by patents/6009503: One step forward and one back, 35 results, 59 ms FOR v IN 1..1 OUTBOUND "patents/6009503" GRAPH "G" FOR w IN 1..1 INBOUND v._id GRAPH "G" FILTER w._id != v._id RETURN w
  52. 52. Run a graph traversal This query finds all patents that cite patents/3541687 directly or in two steps: Recursive traversal backwards, 22 results, 15 ms FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G" RETURN v._key
  53. 53. Run a graph traversal This query finds all patents that cite patents/3541687 directly or in two steps: Recursive traversal backwards, 22 results, 15 ms FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G" RETURN v._key This one counts all patents that cite patents/3541687 recursively: Deep recursion backwards, count 398, 311 ms FOR v IN 1..10 INBOUND "patents/3541687" GRAPH "G" COLLECT WITH COUNT INTO c RETURN c
  54. 54. Yet another approach If your graph data changes rapidly in a transactional fashion...
  55. 55. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database.
  56. 56. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database. Regularly dump to HDFS and run larger analysis jobs there.
  57. 57. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database. Regularly dump to HDFS and run larger analysis jobs there. Or: Use ArangoDB’s Spark Connector: https://github.com/arangodb/arangodb-spark-connector
  58. 58. Links http://hadoop.apache.org/ http://spark.apache.org/ https://graphframes.github.io/ https://www.arangodb.com https://github.com/arangodb/arangodb-spark-connector https://docs.arangodb.com/cookbook/index.html http://mesos.apache.org/ https://mesosphere.com/ https://github.com/dcos/demos

×