SlideShare a Scribd company logo
1 of 58
Download to read offline
Fishing Graphs in a Hadoop Data
Lake
Max Neunhöffer
Munich, 6 April 2017
www.arangodb.com
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
Indeed any relation
What is a graph?
E
A
C
D
F
B
pq
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10
sin(x)
Social networks (edges are friendship)
Dependency chains
Computer networks
Citations
Hierarchies
Indeed any relation
Sometimes directed, sometimes undirected.
Usual approach: data in HDFS, use Spark/GraphFrames
v = spark.read.option("header",true).csv("hdfs://...")
e = spark.read.option("header",true).csv("hdfs://...")
g = GraphFrame(v,e)
g.inDegrees.show()
g.outDegrees.groupBy("outDegree").count().sort("outDegree").show(1000)
g.vertices.groupBy("GYEAR").count().sort("GYEAR").show()
g.find("(a)-[e]->(b);(b)-[ee]->(c)").filter("a.id = 6009536").count()
results = g.pageRank(resetProbability=0.01, maxIter=3)
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Examples:
friends of friends of one person
find all immediate dependencies of one item
find all direct and indirect citations of one article
find all descendants of one member of a hierarchy
Limitations/missed opportunities
Ad hoc queries
Often, one would like to perform smallish ad hoc queries on graph data.
Want to bring down latency from minutes to seconds or from seconds
to milliseconds. Usually, we would like to run many of them.
Examples:
friends of friends of one person
find all immediate dependencies of one item
find all direct and indirect citations of one article
find all descendants of one member of a hierarchy
IDEA: Use a Graph Database
Graph Databases
Graph Databases
Can store and persist graphs.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
=⇒ Graph Traversals
Graph Databases
Graph Databases
Can store and persist graphs. However, the crucial ingredient of a graph
database is their ability to do graph queries.
Graph queries:
Find paths in graphs according to a pattern.
Find everything reachable from a vertex.
Find shortest paths between two given vertices.
=⇒ Graph Traversals Crucial: Number of steps a priori unknown!
Graph Traversals
A
B
C
D
J
E
H
F
G
A
Graph Traversals
A
B
C
D
J
E
H
F
G
AB
Graph Traversals
A
B
C
D
J
E
H
F
G
ABC
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCE
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCED
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJ
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJ
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJF
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFG
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFG
Graph Traversals
A
B
C
D
J
E
H
F
G
ABCEDJFGH
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
Allows for polyglot persistence using a single database technology.
The Multi-Model Approach
Multi-model database
A multi-model database combines a document store with a graph
database and is at the same time a key/value store,
with a common query language for all three data models.
Important:
Is able to compete with specialised products on their turf.
Allows for polyglot persistence using a single database technology.
In a microservice architecture, there will be several different deployments.
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
AQL is independent of the driver used and
Powerful query language
AQL
The built in Arango Query Language allows
complex, powerful and convenient queries,
with transaction semantics,
allowing to do joins,
and to do graph queries,
AQL is independent of the driver used and
offers protection against injections by design.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
Fault tolerance, self-healing and automatic failover is guaranteed.
is a Data Center Operating System App
These days, computing clusters run Data Center Operating Systems.
Idea
Distributed applications can be deployed as easily as one installs a mobile
app on a phone.
Cluster resource management is automatic.
This leads to significantly better resource utilization.
Fault tolerance, self-healing and automatic failover is guaranteed.
runs on Apache Mesos and Mesosphere DC/OS clusters.
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Consequence: We can easily deploy multiple systems alongside each other.
Back to topic: DC/OS as infrastructure
DC/OS is the perfect environment for our needs
DC/OS manages for us:
Software deployment
Resource management (increased utilization)
Service discovery
Allows to plug things together!
Consequence: We can easily deploy multiple systems alongside each other.
Example: HDFS, Spark and ArangoDB
Import data into ArangoDB
hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/patents.csv
hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/citations.csv
dcos package install arangodb3
arangosh 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
var g = require("@arangodb/general-graph");
var G = g._create("G",[g._relation("citations",["patents"],["patents"])]);
arangoimp --collection patents --file patents.csv --type csv 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
arangoimp --collection citations --file citations.csv --type csv 
--server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
Run a graph traversal
This query finds patents cited by patents/6009503 (depth ≤ 3) recursively:
Recursive traversal, 500 results, 317 ms
FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G"
RETURN v
Run a graph traversal
This query finds patents cited by patents/6009503 (depth ≤ 3) recursively:
Recursive traversal, 500 results, 317 ms
FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G"
RETURN v
This one finds all patents that cite any of those cited by patents/6009503:
One step forward and one back, 35 results, 59 ms
FOR v IN 1..1 OUTBOUND "patents/6009503" GRAPH "G"
FOR w IN 1..1 INBOUND v._id GRAPH "G"
FILTER w._id != v._id
RETURN w
Run a graph traversal
This query finds all patents that cite patents/3541687 directly or in two steps:
Recursive traversal backwards, 22 results, 15 ms
FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G"
RETURN v._key
Run a graph traversal
This query finds all patents that cite patents/3541687 directly or in two steps:
Recursive traversal backwards, 22 results, 15 ms
FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G"
RETURN v._key
This one counts all patents that cite patents/3541687 recursively:
Deep recursion backwards, count 398, 311 ms
FOR v IN 1..10 INBOUND "patents/3541687" GRAPH "G"
COLLECT WITH COUNT INTO c
RETURN c
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Regularly dump to HDFS and run larger analysis jobs there.
Yet another approach
If your graph data changes rapidly in a transactional fashion...
Graph database as primary data store
You can turn things around:
Keep and maintain the graph data in a graph database.
Regularly dump to HDFS and run larger analysis jobs there.
Or: Use ArangoDB’s Spark Connector:
https://github.com/arangodb/arangodb-spark-connector
Links
http://hadoop.apache.org/
http://spark.apache.org/
https://graphframes.github.io/
https://www.arangodb.com
https://github.com/arangodb/arangodb-spark-connector
https://docs.arangodb.com/cookbook/index.html
http://mesos.apache.org/
https://mesosphere.com/
https://github.com/dcos/demos

More Related Content

What's hot

What's hot (20)

Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Deep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profitDeep Learning using Spark and DL4J for fun and profit
Deep Learning using Spark and DL4J for fun and profit
 
Cloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World ConsiderationsCloudy with a Chance of Hadoop - Real World Considerations
Cloudy with a Chance of Hadoop - Real World Considerations
 
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing SparkDon't Let the Spark Burn Your House: Perspectives on Securing Spark
Don't Let the Spark Burn Your House: Perspectives on Securing Spark
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...Improving Python and Spark Performance and Interoperability with Apache Arrow...
Improving Python and Spark Performance and Interoperability with Apache Arrow...
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
HAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged DataHAWQ Meets Hive - Querying Unmanaged Data
HAWQ Meets Hive - Querying Unmanaged Data
 
Applied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4jApplied Deep Learning with Spark and Deeplearning4j
Applied Deep Learning with Spark and Deeplearning4j
 
Running Spark in Production
Running Spark in ProductionRunning Spark in Production
Running Spark in Production
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
Dancing Elephants - Efficiently Working with Object Stores from Apache Spark ...
 
HDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFSHDFS Tiered Storage: Mounting Object Stores in HDFS
HDFS Tiered Storage: Mounting Object Stores in HDFS
 
Hadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the expertsHadoop in the Cloud - The what, why and how from the experts
Hadoop in the Cloud - The what, why and how from the experts
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Spark + HBase
Spark + HBase Spark + HBase
Spark + HBase
 

Similar to Fishing Graphs in a Hadoop Data Lake

Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
renjan131
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) Had
Deborah Gastineau
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
Carmen Sanborn
 

Similar to Fishing Graphs in a Hadoop Data Lake (20)

Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
Fishing Graphs in a Hadoop Data Lake by Jörg Schad and Max Neunhoeffer at Big...
 
Fishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data LakeFishing Graphs in a Hadoop Data Lake
Fishing Graphs in a Hadoop Data Lake
 
Deep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDBDeep dive into the native multi model database ArangoDB
Deep dive into the native multi model database ArangoDB
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Mr bi
Mr biMr bi
Mr bi
 
Oslo bekk2014
Oslo bekk2014Oslo bekk2014
Oslo bekk2014
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
GraphTech Ecosystem - part 1: Graph Databases
GraphTech Ecosystem - part 1: Graph DatabasesGraphTech Ecosystem - part 1: Graph Databases
GraphTech Ecosystem - part 1: Graph Databases
 
aRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con RaRangodb, un package per l'utilizzo di ArangoDB con R
aRangodb, un package per l'utilizzo di ArangoDB con R
 
DataHub
DataHubDataHub
DataHub
 
Mr bi amrp
Mr bi amrpMr bi amrp
Mr bi amrp
 
The Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) HadThe Recent Pronouncement Of The World Wide Web (Www) Had
The Recent Pronouncement Of The World Wide Web (Www) Had
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and SparkVital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
Vital AI MetaQL: Queries Across NoSQL, SQL, Sparql, and Spark
 
Data Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZoneData Structure Graph DMZ #DMZone
Data Structure Graph DMZ #DMZone
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?Is multi-model the future of NoSQL?
Is multi-model the future of NoSQL?
 
Processing large-scale graphs with Google Pregel
Processing large-scale graphs with Google PregelProcessing large-scale graphs with Google Pregel
Processing large-scale graphs with Google Pregel
 
Ssas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesiSsas dmx ile kurum içi verilerin i̇şlenmesi
Ssas dmx ile kurum içi verilerin i̇şlenmesi
 

More from DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Recently uploaded (20)

Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Fishing Graphs in a Hadoop Data Lake

  • 1. Fishing Graphs in a Hadoop Data Lake Max Neunhöffer Munich, 6 April 2017 www.arangodb.com
  • 2. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x)
  • 3. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies
  • 4. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies Indeed any relation
  • 5. What is a graph? E A C D F B pq -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 sin(x) Social networks (edges are friendship) Dependency chains Computer networks Citations Hierarchies Indeed any relation Sometimes directed, sometimes undirected.
  • 6. Usual approach: data in HDFS, use Spark/GraphFrames v = spark.read.option("header",true).csv("hdfs://...") e = spark.read.option("header",true).csv("hdfs://...") g = GraphFrame(v,e) g.inDegrees.show() g.outDegrees.groupBy("outDegree").count().sort("outDegree").show(1000) g.vertices.groupBy("GYEAR").count().sort("GYEAR").show() g.find("(a)-[e]->(b);(b)-[ee]->(c)").filter("a.id = 6009536").count() results = g.pageRank(resetProbability=0.01, maxIter=3)
  • 7. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data.
  • 8. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds.
  • 9. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them.
  • 10. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them. Examples: friends of friends of one person find all immediate dependencies of one item find all direct and indirect citations of one article find all descendants of one member of a hierarchy
  • 11. Limitations/missed opportunities Ad hoc queries Often, one would like to perform smallish ad hoc queries on graph data. Want to bring down latency from minutes to seconds or from seconds to milliseconds. Usually, we would like to run many of them. Examples: friends of friends of one person find all immediate dependencies of one item find all direct and indirect citations of one article find all descendants of one member of a hierarchy IDEA: Use a Graph Database
  • 12. Graph Databases Graph Databases Can store and persist graphs.
  • 13. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries.
  • 14. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices.
  • 15. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices. =⇒ Graph Traversals
  • 16. Graph Databases Graph Databases Can store and persist graphs. However, the crucial ingredient of a graph database is their ability to do graph queries. Graph queries: Find paths in graphs according to a pattern. Find everything reachable from a vertex. Find shortest paths between two given vertices. =⇒ Graph Traversals Crucial: Number of steps a priori unknown!
  • 28. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store,
  • 29. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models.
  • 30. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf.
  • 31. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf. Allows for polyglot persistence using a single database technology.
  • 32. The Multi-Model Approach Multi-model database A multi-model database combines a document store with a graph database and is at the same time a key/value store, with a common query language for all three data models. Important: Is able to compete with specialised products on their turf. Allows for polyglot persistence using a single database technology. In a microservice architecture, there will be several different deployments.
  • 33. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries,
  • 34. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics,
  • 35. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins,
  • 36. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries,
  • 37. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries, AQL is independent of the driver used and
  • 38. Powerful query language AQL The built in Arango Query Language allows complex, powerful and convenient queries, with transaction semantics, allowing to do joins, and to do graph queries, AQL is independent of the driver used and offers protection against injections by design.
  • 39. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems.
  • 40. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone.
  • 41. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic.
  • 42. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization.
  • 43. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed.
  • 44. is a Data Center Operating System App These days, computing clusters run Data Center Operating Systems. Idea Distributed applications can be deployed as easily as one installs a mobile app on a phone. Cluster resource management is automatic. This leads to significantly better resource utilization. Fault tolerance, self-healing and automatic failover is guaranteed. runs on Apache Mesos and Mesosphere DC/OS clusters.
  • 45. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery
  • 46. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together!
  • 47. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together! Consequence: We can easily deploy multiple systems alongside each other.
  • 48. Back to topic: DC/OS as infrastructure DC/OS is the perfect environment for our needs DC/OS manages for us: Software deployment Resource management (increased utilization) Service discovery Allows to plug things together! Consequence: We can easily deploy multiple systems alongside each other. Example: HDFS, Spark and ArangoDB
  • 49. Import data into ArangoDB hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/patents.csv hdfs dfs -get hdfs://name-1-node.hdfs.mesos:9001/citations.csv dcos package install arangodb3 arangosh --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos var g = require("@arangodb/general-graph"); var G = g._create("G",[g._relation("citations",["patents"],["patents"])]); arangoimp --collection patents --file patents.csv --type csv --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos arangoimp --collection citations --file citations.csv --type csv --server.endpoint srv://_arangodb3-coordinator1._tcp.arangodb3.mesos
  • 50. Run a graph traversal This query finds patents cited by patents/6009503 (depth ≤ 3) recursively: Recursive traversal, 500 results, 317 ms FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G" RETURN v
  • 51. Run a graph traversal This query finds patents cited by patents/6009503 (depth ≤ 3) recursively: Recursive traversal, 500 results, 317 ms FOR v IN 1..3 OUTBOUND "patents/6009503" GRAPH "G" RETURN v This one finds all patents that cite any of those cited by patents/6009503: One step forward and one back, 35 results, 59 ms FOR v IN 1..1 OUTBOUND "patents/6009503" GRAPH "G" FOR w IN 1..1 INBOUND v._id GRAPH "G" FILTER w._id != v._id RETURN w
  • 52. Run a graph traversal This query finds all patents that cite patents/3541687 directly or in two steps: Recursive traversal backwards, 22 results, 15 ms FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G" RETURN v._key
  • 53. Run a graph traversal This query finds all patents that cite patents/3541687 directly or in two steps: Recursive traversal backwards, 22 results, 15 ms FOR v IN 1..2 INBOUND "patents/3541687" GRAPH "G" RETURN v._key This one counts all patents that cite patents/3541687 recursively: Deep recursion backwards, count 398, 311 ms FOR v IN 1..10 INBOUND "patents/3541687" GRAPH "G" COLLECT WITH COUNT INTO c RETURN c
  • 54. Yet another approach If your graph data changes rapidly in a transactional fashion...
  • 55. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database.
  • 56. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database. Regularly dump to HDFS and run larger analysis jobs there.
  • 57. Yet another approach If your graph data changes rapidly in a transactional fashion... Graph database as primary data store You can turn things around: Keep and maintain the graph data in a graph database. Regularly dump to HDFS and run larger analysis jobs there. Or: Use ArangoDB’s Spark Connector: https://github.com/arangodb/arangodb-spark-connector