SlideShare une entreprise Scribd logo
1  sur  33
Télécharger pour lire hors ligne
UC 
BERKELEY 
GraphX 
Graph Analytics in Spark 
Ankur Dave 
Graduate Student, UC Berkeley AMPLab 
Joint work with Joseph Gonzalez, Reynold Xin, Daniel 
Crankshaw, Michael Franklin, and Ion Stoica
Graphs are Everywhere
Social Networks
Web Graphs
Web Graphs 
“The political blogosphere and the 2004 U.S. election: divided they blog.” 
Adamic and Glance, 2005.
User-Item Graphs 
⋆⋆⋆⋆⋆ 
⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆⋆
Graph Algorithms
PageRank
Triangle Counting
Collaborative Filtering 
⋆⋆⋆⋆⋆ 
⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆ 
⋆⋆⋆⋆⋆ 
⚙ 
♫ 
⚙ 
♫
The Graph-Parallel Pattern
The Graph-Parallel Pattern
The Graph-Parallel Pattern
Many Graph-Parallel Algorithms 
Collaborative Filtering 
» Alternating Least Squares 
» Stochastic Gradient Descent 
» Tensor Factorization 
Structured Prediction 
» Loopy Belief Propagation 
» Max-Product Linear 
Programs 
» Gibbs Sampling 
Semi-supervised ML 
» Graph SSL 
» CoEM 
Community Detection 
» Triangle-Counting 
» K-core Decomposition 
» K-Truss 
Graph Analytics 
» PageRank 
» Personalized PageRank 
» Shortest Path 
» Graph Coloring 
Classification 
» Neural Networks
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Editor 
Table 
Editor Title 
Top Communities 
Com. PR.. 
Modern Analytics
Tables 
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Top Communities 
Com. PR.. 
Editor 
Table 
Editor Title
Editor 
Table 
Editor Title 
Raw 
Wikipedia 
 / /   /  
XML 
Hyperlinks PageRank Top 20 Pages 
Title PR 
Link Table 
Title Link 
Editor Graph 
Community 
Detection 
User 
Community 
User Com. 
Top Communities 
Com. PR.. 
Graphs
The GraphX API
Property Graphs 
Vertex Property: 
• User Profile 
• Current PageRank Value 
Edge Property: 
• Weights 
• Relationships 
• Timestamps
type 
Creating a Graph (Scala) 
VertexId 
= 
Long 
Graph 
val 
vertices: 
RDD[(VertexId, 
String)] 
= 
sc.parallelize(List( 
(1L, 
“Alice”), 
(2L, 
“Bob”), 
(3L, 
“Charlie”))) 
class 
Edge[ED]( 
val 
srcId: 
VertexId, 
val 
dstId: 
VertexId, 
val 
attr: 
ED) 
val 
edges: 
RDD[Edge[String]] 
= 
sc.parallelize(List( 
Edge(1L, 
2L, 
“coworker”), 
Edge(2L, 
3L, 
“friend”))) 
val 
graph 
= 
Graph(vertices, 
edges) 
1 
2 
3 
Alice 
Bob 
Charlie 
coworker 
friend
Graph Operations (Scala) 
class 
Graph[VD, 
ED] 
{ 
// 
Table 
Views 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
vertices: 
RDD[(VertexId, 
VD)] 
def 
edges: 
RDD[Edge[ED]] 
def 
triplets: 
RDD[EdgeTriplet[VD, 
ED]] 
// 
Transformations 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
mapVertices[VD2](f: 
(VertexId, 
VD) 
= 
VD2): 
Graph[VD2, 
ED] 
def 
mapEdges[ED2](f: 
Edge[ED] 
= 
ED2): 
Graph[VD2, 
ED] 
def 
reverse: 
Graph[VD, 
ED] 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
// 
Joins 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
outerJoinVertices[U, 
VD2] 
(tbl: 
RDD[(VertexId, 
U)]) 
(f: 
(VertexId, 
VD, 
Option[U]) 
= 
VD2): 
Graph[VD2, 
ED] 
// 
Computation 
-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ 
def 
mapReduceTriplets[A]( 
sendMsg: 
EdgeTriplet[VD, 
ED] 
= 
Iterator[(VertexId, 
A)], 
mergeMsg: 
(A, 
A) 
= 
A): 
RDD[(VertexId, 
A)]
// 
Continued 
from 
previous 
slide 
def 
pageRank(tol: 
Double): 
Graph[Double, 
Double] 
def 
triangleCount(): 
Graph[Int, 
ED] 
def 
connectedComponents(): 
Graph[VertexId, 
ED] 
// 
...and 
more: 
org.apache.spark.graphx.lib 
} 
Built-in Algorithms (Scala) 
PageRank Triangle Count Connected 
Components
The triplets view 
} 
class 
EdgeTriplet[VD, 
ED]( 
val 
srcId: 
VertexId, 
val 
dstId: 
VertexId, 
val 
attr: 
ED, 
val 
srcAttr: 
VD, 
val 
dstAttr: 
VD) 
RDD 
class 
Graph[VD, 
ED] 
{ 
def 
triplets: 
RDD[EdgeTriplet[VD, 
ED]] 
Graph 
1 
2 
3 
Alice 
Bob 
Charlie 
coworker 
friend 
srcAttr dstAttr attr 
Alice coworker Bob 
Bob friend Charlie 
triplets
The subgraph transformation 
class 
Graph[VD, 
ED] 
{ 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
} 
graph.subgraph(epred 
= 
(edge) 
= 
edge.attr 
!= 
“relative”) 
subgraph 
Graph 
Alice Bob 
relative 
Charlie 
coworker 
friend 
relative David 
Graph 
Alice Bob 
Charlie 
coworker 
friend 
David
The subgraph transformation 
class 
Graph[VD, 
ED] 
{ 
def 
subgraph(epred: 
EdgeTriplet[VD, 
ED] 
= 
Boolean, 
vpred: 
(VertexId, 
VD) 
= 
Boolean): 
Graph[VD, 
ED] 
} 
graph.subgraph(vpred 
= 
(id, 
name) 
= 
name 
!= 
“Bob”) 
subgraph 
Graph 
Alice Bob 
relative 
Charlie 
coworker 
friend 
relative David 
Graph 
Alice 
relative 
Charlie 
relative David
Computation with mapReduceTriplets 
class 
Graph[VD, 
ED] 
{ 
def 
mapReduceTriplets[A]( 
upgrade to aggregateMessages 
sendMsg: 
EdgeTriplet[VD, 
ED] 
= 
Iterator[(VertexId, 
A)], 
mergeMsg: 
(A, 
A) 
= 
A): 
RDD[(VertexId, 
A)] 
} 
graph.mapReduceTriplets( 
edge 
= 
Iterator( 
(edge.srcId, 
1), 
(edge.dstId, 
1)), 
_ 
+ 
_) 
Graph 
Alice Bob 
relative 
Charlie 
friend 
coworker 
relative David 
mapReduceTriplets 
RDD 
vertex id degree 
Alice 2 
Bob 2 
Charlie 3 
David 1 
in Spark 1.2.0
How GraphX Works
Encoding Property Graphs as RDDs 
Part. 1 
A D 
Part. 2 
Vertex 
Table 
(RDD) 
C 
A D 
F 
E 
D 
Property Graph 
B A 
Machine 1 Machine 2 
Edge Table 
(RDD) 
A B 
A C 
B C 
C D 
A E 
A F 
E D 
E F 
A 
B 
C 
D 
E 
F 
Routing 
Table 
(RDD) 
A 
B 
C 
D 
E 
F 
1 
2 
1 
1 
1 
2 
2 
2 
Vertex Cut
Graph System Optimizations 
Specialized 
Vertex-Cuts 
Remote 
Data-Structures 
Partitioning 
Caching / Mirroring 
Message Combiners Active Set Tracking
3500 
3000 
EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE 
Seconds) 
2500 
2000 
(Runtime 1500 
1000 
500 
0 
state-of-the-art graph processing systems. PageRank Benchmark 
Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 
9000 
8000 
7000 
6000 
5000 
7x 18x 
4000 
3000 
2000 
1000 
0 
GraphX performs comparably to
Future of GraphX 
1. Language support 
a) Java API: PR #3234 
b) Python API: collaborating with Intel, SPARK-3789 
2. More algorithms 
a) LDA (topic modeling): PR #2388 
b) Correlation clustering 
c) Your algorithm here? 
3. Speculative 
a) Streaming/time-varying graphs 
b) Graph database–like queries
Raw 
Wikipedia 
 / /   /  
Try GraphX Today 
XML Hyperlinks 
PageRank 
Top 20 Pages 
Title PR 
Text 
Table 
Title Body 
Berkeley 
subgraph
Thanks! 
http://spark.apache.org/graphx 
ankurd@eecs.berkeley.edu 
jegonzal@eecs.berkeley.edu 
rxin@eecs.berkeley.edu 
crankshaw@eecs.berkeley.edu

Contenu connexe

Tendances

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks DataWorks Summit/Hadoop Summit
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in SparkPaco Nathan
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Graph computation
Graph computationGraph computation
Graph computationSigmoid
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer spaceGraphAware
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and GiraphDoug Needham
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™Databricks
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Spark Summit
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...MLconf
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Spark Summit
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiDatabricks
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Introjeykottalam
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Databricks
 

Tendances (20)

Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Graph Analytics in Spark
Graph Analytics in SparkGraph Analytics in Spark
Graph Analytics in Spark
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Spark graphx
Spark graphxSpark graphx
Spark graphx
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Graph computation
Graph computationGraph computation
Graph computation
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Signals from outer space
Signals from outer spaceSignals from outer space
Signals from outer space
 
Gephi, Graphx, and Giraph
Gephi, Graphx, and GiraphGephi, Graphx, and Giraph
Gephi, Graphx, and Giraph
 
GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™GraphFrames: DataFrame-based graphs for Apache® Spark™
GraphFrames: DataFrame-based graphs for Apache® Spark™
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
 
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin HuaiA Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
A Deep Dive into Spark SQL's Catalyst Optimizer with Yin Huai
 
AMP Camp 5 Intro
AMP Camp 5 IntroAMP Camp 5 Intro
AMP Camp 5 Intro
 
Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™Web-Scale Graph Analytics with Apache® Spark™
Web-Scale Graph Analytics with Apache® Spark™
 

Similaire à GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...bhargavi804095
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide trainingSpark Summit
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingNesreen K. Ahmed
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Spark Summit
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Databricks
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with GoJames Tan
 
BUILDING WHILE FLYING
BUILDING WHILE FLYINGBUILDING WHILE FLYING
BUILDING WHILE FLYINGKamal Shannak
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingDatabricks
 
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19MoscowJS
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleDataWorks Summit
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADtab0ris_1
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQLRoberto Franchini
 

Similaire à GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20) (20)

Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...Graphs in data structures are non-linear data structures made up of a finite ...
Graphs in data structures are non-linear data structures made up of a finite ...
 
Transformations and actions a visual guide training
Transformations and actions a visual guide trainingTransformations and actions a visual guide training
Transformations and actions a visual guide training
 
High-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and ModelingHigh-Performance Graph Analysis and Modeling
High-Performance Graph Analysis and Modeling
 
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
Advanced Data Science with Apache Spark-(Reza Zadeh, Stanford)
 
DEX: Seminar Tutorial
DEX: Seminar TutorialDEX: Seminar Tutorial
DEX: Seminar Tutorial
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
3DRepo
3DRepo3DRepo
3DRepo
 
SVGD3Angular2React
SVGD3Angular2ReactSVGD3Angular2React
SVGD3Angular2React
 
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
 
GraphQL & DGraph with Go
GraphQL & DGraph with GoGraphQL & DGraph with Go
GraphQL & DGraph with Go
 
BUILDING WHILE FLYING
BUILDING WHILE FLYINGBUILDING WHILE FLYING
BUILDING WHILE FLYING
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Structuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and StreamingStructuring Spark: DataFrames, Datasets, and Streaming
Structuring Spark: DataFrames, Datasets, and Streaming
 
Apache spark core
Apache spark coreApache spark core
Apache spark core
 
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
"Визуализация данных с помощью d3.js", Михаил Дунаев, MoscowJS 19
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Scalding big ADta
Scalding big ADtaScalding big ADta
Scalding big ADta
 
OrientDB - The 2nd generation of (multi-model) NoSQL
OrientDB - The 2nd generation of  (multi-model) NoSQLOrientDB - The 2nd generation of  (multi-model) NoSQL
OrientDB - The 2nd generation of (multi-model) NoSQL
 
The D3 Toolbox
The D3 ToolboxThe D3 Toolbox
The D3 Toolbox
 

Dernier

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 

Dernier (20)

Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 

GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)

  • 1. UC BERKELEY GraphX Graph Analytics in Spark Ankur Dave Graduate Student, UC Berkeley AMPLab Joint work with Joseph Gonzalez, Reynold Xin, Daniel Crankshaw, Michael Franklin, and Ion Stoica
  • 5. Web Graphs “The political blogosphere and the 2004 U.S. election: divided they blog.” Adamic and Glance, 2005.
  • 6. User-Item Graphs ⋆⋆⋆⋆⋆ ⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆⋆
  • 10. Collaborative Filtering ⋆⋆⋆⋆⋆ ⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆ ⋆⋆⋆⋆⋆ ⚙ ♫ ⚙ ♫
  • 14. Many Graph-Parallel Algorithms Collaborative Filtering » Alternating Least Squares » Stochastic Gradient Descent » Tensor Factorization Structured Prediction » Loopy Belief Propagation » Max-Product Linear Programs » Gibbs Sampling Semi-supervised ML » Graph SSL » CoEM Community Detection » Triangle-Counting » K-core Decomposition » K-Truss Graph Analytics » PageRank » Personalized PageRank » Shortest Path » Graph Coloring Classification » Neural Networks
  • 15. Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Editor Table Editor Title Top Communities Com. PR.. Modern Analytics
  • 16. Tables Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Top Communities Com. PR.. Editor Table Editor Title
  • 17. Editor Table Editor Title Raw Wikipedia / / / XML Hyperlinks PageRank Top 20 Pages Title PR Link Table Title Link Editor Graph Community Detection User Community User Com. Top Communities Com. PR.. Graphs
  • 19. Property Graphs Vertex Property: • User Profile • Current PageRank Value Edge Property: • Weights • Relationships • Timestamps
  • 20. type Creating a Graph (Scala) VertexId = Long Graph val vertices: RDD[(VertexId, String)] = sc.parallelize(List( (1L, “Alice”), (2L, “Bob”), (3L, “Charlie”))) class Edge[ED]( val srcId: VertexId, val dstId: VertexId, val attr: ED) val edges: RDD[Edge[String]] = sc.parallelize(List( Edge(1L, 2L, “coworker”), Edge(2L, 3L, “friend”))) val graph = Graph(vertices, edges) 1 2 3 Alice Bob Charlie coworker friend
  • 21. Graph Operations (Scala) class Graph[VD, ED] { // Table Views -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def vertices: RDD[(VertexId, VD)] def edges: RDD[Edge[ED]] def triplets: RDD[EdgeTriplet[VD, ED]] // Transformations -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def mapVertices[VD2](f: (VertexId, VD) = VD2): Graph[VD2, ED] def mapEdges[ED2](f: Edge[ED] = ED2): Graph[VD2, ED] def reverse: Graph[VD, ED] def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] // Joins -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def outerJoinVertices[U, VD2] (tbl: RDD[(VertexId, U)]) (f: (VertexId, VD, Option[U]) = VD2): Graph[VD2, ED] // Computation -­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐-­‐ def mapReduceTriplets[A]( sendMsg: EdgeTriplet[VD, ED] = Iterator[(VertexId, A)], mergeMsg: (A, A) = A): RDD[(VertexId, A)]
  • 22. // Continued from previous slide def pageRank(tol: Double): Graph[Double, Double] def triangleCount(): Graph[Int, ED] def connectedComponents(): Graph[VertexId, ED] // ...and more: org.apache.spark.graphx.lib } Built-in Algorithms (Scala) PageRank Triangle Count Connected Components
  • 23. The triplets view } class EdgeTriplet[VD, ED]( val srcId: VertexId, val dstId: VertexId, val attr: ED, val srcAttr: VD, val dstAttr: VD) RDD class Graph[VD, ED] { def triplets: RDD[EdgeTriplet[VD, ED]] Graph 1 2 3 Alice Bob Charlie coworker friend srcAttr dstAttr attr Alice coworker Bob Bob friend Charlie triplets
  • 24. The subgraph transformation class Graph[VD, ED] { def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] } graph.subgraph(epred = (edge) = edge.attr != “relative”) subgraph Graph Alice Bob relative Charlie coworker friend relative David Graph Alice Bob Charlie coworker friend David
  • 25. The subgraph transformation class Graph[VD, ED] { def subgraph(epred: EdgeTriplet[VD, ED] = Boolean, vpred: (VertexId, VD) = Boolean): Graph[VD, ED] } graph.subgraph(vpred = (id, name) = name != “Bob”) subgraph Graph Alice Bob relative Charlie coworker friend relative David Graph Alice relative Charlie relative David
  • 26. Computation with mapReduceTriplets class Graph[VD, ED] { def mapReduceTriplets[A]( upgrade to aggregateMessages sendMsg: EdgeTriplet[VD, ED] = Iterator[(VertexId, A)], mergeMsg: (A, A) = A): RDD[(VertexId, A)] } graph.mapReduceTriplets( edge = Iterator( (edge.srcId, 1), (edge.dstId, 1)), _ + _) Graph Alice Bob relative Charlie friend coworker relative David mapReduceTriplets RDD vertex id degree Alice 2 Bob 2 Charlie 3 David 1 in Spark 1.2.0
  • 28. Encoding Property Graphs as RDDs Part. 1 A D Part. 2 Vertex Table (RDD) C A D F E D Property Graph B A Machine 1 Machine 2 Edge Table (RDD) A B A C B C C D A E A F E D E F A B C D E F Routing Table (RDD) A B C D E F 1 2 1 1 1 2 2 2 Vertex Cut
  • 29. Graph System Optimizations Specialized Vertex-Cuts Remote Data-Structures Partitioning Caching / Mirroring Message Combiners Active Set Tracking
  • 30. 3500 3000 EC2 Cluster of 16 x m2.4xLarge (8 cores) + 1GigE Seconds) 2500 2000 (Runtime 1500 1000 500 0 state-of-the-art graph processing systems. PageRank Benchmark Twitter Graph (42M Vertices,1.5B Edges) UK-Graph (106M Vertices, 3.7B Edges) 9000 8000 7000 6000 5000 7x 18x 4000 3000 2000 1000 0 GraphX performs comparably to
  • 31. Future of GraphX 1. Language support a) Java API: PR #3234 b) Python API: collaborating with Intel, SPARK-3789 2. More algorithms a) LDA (topic modeling): PR #2388 b) Correlation clustering c) Your algorithm here? 3. Speculative a) Streaming/time-varying graphs b) Graph database–like queries
  • 32. Raw Wikipedia / / / Try GraphX Today XML Hyperlinks PageRank Top 20 Pages Title PR Text Table Title Body Berkeley subgraph
  • 33. Thanks! http://spark.apache.org/graphx ankurd@eecs.berkeley.edu jegonzal@eecs.berkeley.edu rxin@eecs.berkeley.edu crankshaw@eecs.berkeley.edu