SlideShare une entreprise Scribd logo
1  sur  37
Télécharger pour lire hors ligne
Tegra: Time-evolving Graph Processing
on Commodity Clusters
Anand Iyer, Ion Stoica
Presented by Ankur Dave
Graphs are everywhere…
Social Networks
Graphs are everywhere…
Gnutella network subgraph
Graphs are everywhere…
Graphs are everywhere…
Metabolic network of a single cell organism Tuberculosis
Plenty of interest in processing them
• Graph DBMS 25% of all enterprises by 20171
• Many open-source and research prototypes on distributed
graph processing frameworks: Giraph, Pregel, GraphLab,
Chaos, GraphX, …
1Forrester Research
Real-world Graphs are Dynamic
Many interesting business and research insights
possible by processing them…
…but little work on incrementally updating
computation results or on window-based operations
A Motivating Example…
A Motivating Example…
A Motivating Example…
Retrieve the network state when a disruption happened
Analyze the evolution of hotspot groups in the last day by hour
Track regions of heavy load in the last 10 minutes
What factors explain the performance in the network?
Tegra
How do we perform efficient computations on
time-evolving, dynamically changing graphs?
Goals
• Create and manage time-evolving graphs
• Retrieve the network state when a disruption happened
• Temporal analytics on windows
• Analyze the evolution of hotspot groups in the last day by hour
• Sliding window computations
• Track regions of heavy load in the last 10 minutes interval
• Mix graph and data parallel computing
• What factors explain the performance in the network
Tegra
Graph Snapshot
Index
Timelapse Abstraction
Lightweight
Incremental
Computations
→ →
Pause-Shift-Resume
Computations
→ →
Goals
• Create and manage time-evolving graphs using Graph
Snapshot Index
• Retrieve the network state when a disruption happened
• Temporal analytics on windows
• Analyze the evolution of hotspot groups in the last day by hour
• Sliding window computations
• Track regions of heavy load in the last 10 minutes interval
• Mix graph and data parallel computing
• What factors explain the performance in the network
Representing Evolving Graphs
Time
A
B C
G1
A
B C
δg1
A D
δg2
C
D
E
δg3
A
B C
D
G2
A
B C
D
E
G3
Snapshots
Updates
E
D
D
A
D
C
E
C
D
A
remove add add add add
Graph Composition
Updating graphs depend on application semantics
A
B C
D
G2
A
B C
D
E
G3’
C
D
E
δg
A
+ =
Maintaining Multiple Snapshots
Store entire
snapshots
A
B C
G1
A
B C
D
G2
A
B C
D
E
G3
+ Efficient retrieval
- Storage overhead
Store only deltas
A
B C
δg1
A D
δg2
C
D
E
δg3
+ Efficient storage
- Retrieval overhead
Maintaining Multiple Snapshots
Snapshot 2Snapshot 1
t1 t2
Use a structure-sharing persistent
data-structure.
Persistent Adaptive Radix Tree (PART)
is one solution available for Spark.
Graph Snapshot Index
Snapshot 2Snapshot 1
Vertex
t1 t2
Snapshot 2Snapshot 1
t1 t2
Edge
Partition
Snapshot ID Management
Graph Windowing
A
B C
G1
A
B C
D
G2
A
B C
D
E
G3
A
C
D
E
G4
G’1=	G3 G’2=	G4- G1
Equivalent	to	GraphReduce() in	GraphLINQ,	GraphReduceByWindow()in	CellIQ
G’2=	G’1+	δg4	- δg1
G’1=	δg1	+	δg2 +	δg3
A
B C
δg1
A D
δg2
C
D
E
δg3
B
δg4
Goals
• Create and manage time-evolving graphs using
Distributed Graph Snapshot Index
• Retrieve the network state when a disruption happened
• Temporal analytics on windows using Timelapse
Abstraction
• Analyze the evolution of hotspot groups in the last day by hour
• Sliding window computations
• Track regions of heavy load in the last 10 minutes interval
• Mix graph and data parallel computing
• What factors explain the performance in the network
Graph Parallel Computation
Many approaches
• Vertex centric, edge centric, subgraph centric, …
Timelapse
Operations on windows of snapshots result in redundant
computation
A
B C
A
B C
D A
B C
D
E
Time
G1 G2 G3
Timelapse
A
B C
A
B C
D A
B C
D
E
Time
G1 G2 G3
D
BCBA
AAA
B C
D
E
• Significant reduction in messages exchanged
between graph nodes
• Avoids redundant computations
Instead, expose the temporal evolution of a node
Timelapse API
B C
A D
F E
A DD
B C
D
E
AA
F
B C
A D
F E
A DD
B C
D
E
AA
F
Transition
(0.977, 0.968)
(X ,Y): X is 10 iteration PageRank
Y is 23 iteration PageRank
After 11 iteration on graph 2,
Both converge to 3-digit precision
(0.977, 0.968)(0.571, 0.556)
1.224
0.8490.502
(2.33, 2.39)
2.07
0.8490.502
(0.571, 0.556)(0.571, 0.556)
8: Example showing the benefit of PSR computation.
ing master program. For each iteration of Pregel,
ck for the availability of a new graph. When it
able, we stop the iterations on the current graph,
ume resume it on the new graph after copying
e computed results. The new computation will
ve vertices in the new active set continue message
. The new active set is a function of the old active
the changes between the new graph and the old
For a large class of algorithms (e.g. incremental
nk [19]), the new active set includes vertices from
active set, any new vertices and vertices with
dditions and deletions. Listing 2 shows a simple
class Graph[V, E] {
// Collection views
def vertices(sid: Int): Collection[(Id, V)]
def edges(sid: Int): Collection[(Id, Id, E)]
def triplets(sid: Int): Collection[Triplet]
// Graph-parallel computation
def mrTriplets(f: (Triplet) => M,
sum: (M, M) => M,
sids: Array[Int]): Collection[(Int, Id, M)]
// Convenience functions
def mapV(f: (Id, V) => V,
sids: Array[Int]): Graph[V, E]
def mapE(f: (Id, Id, E) => E
sids: Array[Int]): Graph[V, E]
def leftJoinV(v: Collection[(Id, V)],
f: (Id, V, V) => V,
sids: Array[Int]): Graph[V, E]
def leftJoinE(e: Collection[(Id, Id, E)],
f: (Id, Id, E, E) => E,
sids: Array[Int]): Graph[V, E]
def subgraph(vPred: (Id, V) => Boolean,
ePred: (Triplet) => Boolean,
sids: Array[Int]): Graph[V, E]
def reverse(sids: Array[Int]): Graph[V, E]
}
Listing 3: GraphX [24] operators modified to support Tegra’s
Temporal Operations
Evolution analysis with no state keeping requirements
Bulk Transformations
A
B C
A
B C
D A
B C
D
E
A
B C
A
B C
D A
B C
D
E
Bulk Iterative Computations
A
B C
A
B C
D A
B C
D
E
A
A A
A
A A
A A
A A
A
A
How did the hotspots change over this window?
Goals
• Create and manage time-evolving graphs using
Distributed Graph Snapshot Index
• Retrieve the network state when a disruption happened
• Temporal analytics on windows using Timelapse
Abstraction
• Analyze the evolution of hotspot groups in the last day by hour
• Sliding window computations
• Track regions of heavy load in the last 10 minutes interval
• Mix graph and data parallel computing
• What factors explain the performance in the network
Incremental Computation
• If results from a previous snapshot is available, how can we
reuse them?
• Three approaches in the past:
• Restart the algorithm
• Redundant computations
• Memoization (GraphInc1)
• Too much state
• Operator-wise state (Naiad2,3)
• Too much overhead
1Facilitating real- time graph mining, CloudDB ’12
2 Naiad: A timely dataflow system, SOSP ’13
3 Differential dataflow, CIDR ‘13
Incremental Computation
A
B C
D
G1
A
B C
D
E
G2
A
B C
D A
A B
A A
A A
A
A
B C
D
E
A
A B
A
C
A
A A
A
A
Time
→ →
• Can keep state as an efficient time-evolving graph
• Not limited to vertex-centric computations
Incremental Computation
• Some iterative graph algorithms are robust to
graph changes
• Allow them to proceed without keeping any state
A
D E
B
C
0.5570.557
0.5570.557
2.37
A
D E
B
C B
0.5570.557
0.5570.557
2.37 0
A
D E
B
C
0.8470.507
0.8470.507
2.07 B 1.2
→
Goals
• Create and manage time-evolving graphs using
Distributed Graph Snapshot Index
• Retrieve the network state when a disruption happened
• Temporal analytics on windows using Timelapse
Abstraction
• Analyze the evolution of hotspot groups in the last day by hour
• Sliding window computations
• Track regions of heavy load in the last 10 minutes interval
• Mix graph and data parallel computing
• What factors explain the performance in the network
Implementation & Evaluation
• Implemented as a major enhancement to GraphX
• Evaluated on two open source real-world graphs
• Twitter: 41,652,230 vertices, 1,468,365,182 edges
• uk-2007: 105,896,555 vertices, 3,738,733,648 edges
Preliminary Evaluation
����
��
����
��
����
��
����
��
����
�� �� �� �� �� �� �� �� �� ���
�������������������
�������������
�����������������������������
�������
����������
Tegra can pack more snapshots in memory
Linear	reduction
Graph	difference	significant
Preliminary Evaluation
��
����
��
����
��
����
��
�� �� �� �� ��
�����������������
�������������
�������������������������
�������
����������
Timelapse results in run time reduction
Preliminary Evaluation
Effectiveness of incremental computation
220
45
34 35 37
0
50
100
150
200
250
1 2 3 4 5
ConvergenceTime(s)
Snapshot #
Summary
• Processing time-evolving graph efficiently can be useful.
• Efficient storage of multiple snapshots and reducing
communication between graph nodes key to evolving graph
analysis.
Ongoing Work
• Expand timelapse and its incremental computation model to other
graph-parallel paradigms
• Other interesting graph algorithms:
• Fraud detection/prediction/incremental pattern matching
• Add graph querying support
• Graph queries and analytics in a single system
• Stay tuned for code release!

Contenu connexe

Tendances

Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Flink Forward
 

Tendances (20)

A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
 
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
 
TensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache SparkTensorFrames: Google Tensorflow on Apache Spark
TensorFrames: Google Tensorflow on Apache Spark
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSLSebastian Schelter – Distributed Machine Learing with the Samsara DSL
Sebastian Schelter – Distributed Machine Learing with the Samsara DSL
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models (DB T...
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
Ernest: Efficient Performance Prediction for Advanced Analytics on Apache Spa...
 
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
A Scalable Hierarchical Clustering Algorithm Using Spark: Spark Summit East t...
 
Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016Online learning with structured streaming, spark summit brussels 2016
Online learning with structured streaming, spark summit brussels 2016
 
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo InterlandiLazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
Lazy Join Optimizations Without Upfront Statistics with Matteo Interlandi
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Machine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud PlatformMachine learning at scale with Google Cloud Platform
Machine learning at scale with Google Cloud Platform
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Generalized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkRGeneralized Linear Models in Spark MLlib and SparkR
Generalized Linear Models in Spark MLlib and SparkR
 

En vedette

En vedette (20)

Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
RISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time DecisionsRISELab:Enabling Intelligent Real-Time Decisions
RISELab:Enabling Intelligent Real-Time Decisions
 
Spark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 FuriousSpark And Cassandra: 2 Fast, 2 Furious
Spark And Cassandra: 2 Fast, 2 Furious
 
Livy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache SparkLivy: A REST Web Service For Apache Spark
Livy: A REST Web Service For Apache Spark
 
Spark on Mesos
Spark on MesosSpark on Mesos
Spark on Mesos
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Spark at Bloomberg: Dynamically Composable Analytics
Spark at Bloomberg:  Dynamically Composable Analytics Spark at Bloomberg:  Dynamically Composable Analytics
Spark at Bloomberg: Dynamically Composable Analytics
 
Low Latency Execution For Apache Spark
Low Latency Execution For Apache SparkLow Latency Execution For Apache Spark
Low Latency Execution For Apache Spark
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
Big Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the CloudBig Data in Production: Lessons from Running in the Cloud
Big Data in Production: Lessons from Running in the Cloud
 
GPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And PythonGPU Computing With Apache Spark And Python
GPU Computing With Apache Spark And Python
 
Elasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlibElasticsearch And Apache Lucene For Apache Spark And MLlib
Elasticsearch And Apache Lucene For Apache Spark And MLlib
 
Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow Large Scale Deep Learning with TensorFlow
Large Scale Deep Learning with TensorFlow
 
Re-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance UnderstandabilityRe-Architecting Spark For Performance Understandability
Re-Architecting Spark For Performance Understandability
 
Vskills certified html5 developer Notes
Vskills certified html5 developer NotesVskills certified html5 developer Notes
Vskills certified html5 developer Notes
 
Solution Architecture Cassandra
Solution Architecture CassandraSolution Architecture Cassandra
Solution Architecture Cassandra
 
Deploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using SparkDeploying Accelerators At Datacenter Scale Using Spark
Deploying Accelerators At Datacenter Scale Using Spark
 
Amazon Redshift Analytical functions
Amazon Redshift Analytical functionsAmazon Redshift Analytical functions
Amazon Redshift Analytical functions
 
Cassandra data modelling best practices
Cassandra data modelling best practicesCassandra data modelling best practices
Cassandra data modelling best practices
 
Spark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with SparkSpark and Couchbase: Augmenting the Operational Database with Spark
Spark and Couchbase: Augmenting the Operational Database with Spark
 

Similaire à Time-Evolving Graph Processing On Commodity Clusters

Similaire à Time-Evolving Graph Processing On Commodity Clusters (20)

Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...Architectural Optimizations for High Performance and Energy Efficient Smith-W...
Architectural Optimizations for High Performance and Energy Efficient Smith-W...
 
An35225228
An35225228An35225228
An35225228
 
Graph processing
Graph processingGraph processing
Graph processing
 
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
Hanjun Dai, PhD Student, School of Computational Science and Engineering, Geo...
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
MAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsxMAtrix Multiplication Parallel.ppsx
MAtrix Multiplication Parallel.ppsx
 
matrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsxmatrixmultiplicationparallel.ppsx
matrixmultiplicationparallel.ppsx
 
Pregel
PregelPregel
Pregel
 
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
A Highly Parallel Semi-Dataflow FPGA Architecture for Large-Scale N-Body Simu...
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
CS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and CullingCS 354 Transformation, Clipping, and Culling
CS 354 Transformation, Clipping, and Culling
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdfTutorial-on-DNN-09A-Co-design-Sparsity.pdf
Tutorial-on-DNN-09A-Co-design-Sparsity.pdf
 
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph Analysis
 
Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners Short.course.introduction.to.vhdl for beginners
Short.course.introduction.to.vhdl for beginners
 
Detecting Lateral Movement with a Compute-Intense Graph Kernel
Detecting Lateral Movement with a Compute-Intense Graph KernelDetecting Lateral Movement with a Compute-Intense Graph Kernel
Detecting Lateral Movement with a Compute-Intense Graph Kernel
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data StreamsA Multicore Parallelization of Continuous Skyline Queries on Data Streams
A Multicore Parallelization of Continuous Skyline Queries on Data Streams
 
Optimization of graph storage using GoFFish
Optimization of graph storage using GoFFishOptimization of graph storage using GoFFish
Optimization of graph storage using GoFFish
 
GoFFish - A Sub-graph centric framework for large scale graph analytics
GoFFish - A Sub-graph centric framework for large scale graph analyticsGoFFish - A Sub-graph centric framework for large scale graph analytics
GoFFish - A Sub-graph centric framework for large scale graph analytics
 

Plus de Jen Aman

Plus de Jen Aman (15)

Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei ZahariaDeep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
Deep Learning and Streaming in Apache Spark 2.x with Matei Zaharia
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Deep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best PracticesDeep Learning on Apache® Spark™: Workflows and Best Practices
Deep Learning on Apache® Spark™: Workflows and Best Practices
 
Deep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best PracticesDeep Learning on Apache® Spark™ : Workflows and Best Practices
Deep Learning on Apache® Spark™ : Workflows and Best Practices
 
Spatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using SparkSpatial Analysis On Histological Images Using Spark
Spatial Analysis On Histological Images Using Spark
 
Efficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out DatabasesEfficient State Management With Spark 2.0 And Scale-Out Databases
Efficient State Management With Spark 2.0 And Scale-Out Databases
 
Building Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemMLBuilding Custom Machine Learning Algorithms With Apache SystemML
Building Custom Machine Learning Algorithms With Apache SystemML
 
EclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache SparkEclairJS = Node.Js + Apache Spark
EclairJS = Node.Js + Apache Spark
 
Spark: Interactive To Production
Spark: Interactive To ProductionSpark: Interactive To Production
Spark: Interactive To Production
 
High-Performance Python On Spark
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
 
Scalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In BaiduScalable Deep Learning Platform On Spark In Baidu
Scalable Deep Learning Platform On Spark In Baidu
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
Temporal Operators For Spark Streaming And Its Application For Office365 Serv...
 
Utilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine LearningUtilizing Human Data Validation For KPI Analysis And Machine Learning
Utilizing Human Data Validation For KPI Analysis And Machine Learning
 

Dernier

Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 

Dernier (20)

Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 

Time-Evolving Graph Processing On Commodity Clusters

  • 1. Tegra: Time-evolving Graph Processing on Commodity Clusters Anand Iyer, Ion Stoica Presented by Ankur Dave
  • 5. Graphs are everywhere… Metabolic network of a single cell organism Tuberculosis
  • 6. Plenty of interest in processing them • Graph DBMS 25% of all enterprises by 20171 • Many open-source and research prototypes on distributed graph processing frameworks: Giraph, Pregel, GraphLab, Chaos, GraphX, … 1Forrester Research
  • 7. Real-world Graphs are Dynamic Many interesting business and research insights possible by processing them… …but little work on incrementally updating computation results or on window-based operations
  • 10. A Motivating Example… Retrieve the network state when a disruption happened Analyze the evolution of hotspot groups in the last day by hour Track regions of heavy load in the last 10 minutes What factors explain the performance in the network?
  • 11. Tegra How do we perform efficient computations on time-evolving, dynamically changing graphs?
  • 12. Goals • Create and manage time-evolving graphs • Retrieve the network state when a disruption happened • Temporal analytics on windows • Analyze the evolution of hotspot groups in the last day by hour • Sliding window computations • Track regions of heavy load in the last 10 minutes interval • Mix graph and data parallel computing • What factors explain the performance in the network
  • 14. Goals • Create and manage time-evolving graphs using Graph Snapshot Index • Retrieve the network state when a disruption happened • Temporal analytics on windows • Analyze the evolution of hotspot groups in the last day by hour • Sliding window computations • Track regions of heavy load in the last 10 minutes interval • Mix graph and data parallel computing • What factors explain the performance in the network
  • 15. Representing Evolving Graphs Time A B C G1 A B C δg1 A D δg2 C D E δg3 A B C D G2 A B C D E G3 Snapshots Updates
  • 16. E D D A D C E C D A remove add add add add Graph Composition Updating graphs depend on application semantics A B C D G2 A B C D E G3’ C D E δg A + =
  • 17. Maintaining Multiple Snapshots Store entire snapshots A B C G1 A B C D G2 A B C D E G3 + Efficient retrieval - Storage overhead Store only deltas A B C δg1 A D δg2 C D E δg3 + Efficient storage - Retrieval overhead
  • 18. Maintaining Multiple Snapshots Snapshot 2Snapshot 1 t1 t2 Use a structure-sharing persistent data-structure. Persistent Adaptive Radix Tree (PART) is one solution available for Spark.
  • 19. Graph Snapshot Index Snapshot 2Snapshot 1 Vertex t1 t2 Snapshot 2Snapshot 1 t1 t2 Edge Partition Snapshot ID Management
  • 20. Graph Windowing A B C G1 A B C D G2 A B C D E G3 A C D E G4 G’1= G3 G’2= G4- G1 Equivalent to GraphReduce() in GraphLINQ, GraphReduceByWindow()in CellIQ G’2= G’1+ δg4 - δg1 G’1= δg1 + δg2 + δg3 A B C δg1 A D δg2 C D E δg3 B δg4
  • 21. Goals • Create and manage time-evolving graphs using Distributed Graph Snapshot Index • Retrieve the network state when a disruption happened • Temporal analytics on windows using Timelapse Abstraction • Analyze the evolution of hotspot groups in the last day by hour • Sliding window computations • Track regions of heavy load in the last 10 minutes interval • Mix graph and data parallel computing • What factors explain the performance in the network
  • 22. Graph Parallel Computation Many approaches • Vertex centric, edge centric, subgraph centric, …
  • 23. Timelapse Operations on windows of snapshots result in redundant computation A B C A B C D A B C D E Time G1 G2 G3
  • 24. Timelapse A B C A B C D A B C D E Time G1 G2 G3 D BCBA AAA B C D E • Significant reduction in messages exchanged between graph nodes • Avoids redundant computations Instead, expose the temporal evolution of a node
  • 25. Timelapse API B C A D F E A DD B C D E AA F B C A D F E A DD B C D E AA F Transition (0.977, 0.968) (X ,Y): X is 10 iteration PageRank Y is 23 iteration PageRank After 11 iteration on graph 2, Both converge to 3-digit precision (0.977, 0.968)(0.571, 0.556) 1.224 0.8490.502 (2.33, 2.39) 2.07 0.8490.502 (0.571, 0.556)(0.571, 0.556) 8: Example showing the benefit of PSR computation. ing master program. For each iteration of Pregel, ck for the availability of a new graph. When it able, we stop the iterations on the current graph, ume resume it on the new graph after copying e computed results. The new computation will ve vertices in the new active set continue message . The new active set is a function of the old active the changes between the new graph and the old For a large class of algorithms (e.g. incremental nk [19]), the new active set includes vertices from active set, any new vertices and vertices with dditions and deletions. Listing 2 shows a simple class Graph[V, E] { // Collection views def vertices(sid: Int): Collection[(Id, V)] def edges(sid: Int): Collection[(Id, Id, E)] def triplets(sid: Int): Collection[Triplet] // Graph-parallel computation def mrTriplets(f: (Triplet) => M, sum: (M, M) => M, sids: Array[Int]): Collection[(Int, Id, M)] // Convenience functions def mapV(f: (Id, V) => V, sids: Array[Int]): Graph[V, E] def mapE(f: (Id, Id, E) => E sids: Array[Int]): Graph[V, E] def leftJoinV(v: Collection[(Id, V)], f: (Id, V, V) => V, sids: Array[Int]): Graph[V, E] def leftJoinE(e: Collection[(Id, Id, E)], f: (Id, Id, E, E) => E, sids: Array[Int]): Graph[V, E] def subgraph(vPred: (Id, V) => Boolean, ePred: (Triplet) => Boolean, sids: Array[Int]): Graph[V, E] def reverse(sids: Array[Int]): Graph[V, E] } Listing 3: GraphX [24] operators modified to support Tegra’s
  • 26. Temporal Operations Evolution analysis with no state keeping requirements Bulk Transformations A B C A B C D A B C D E A B C A B C D A B C D E Bulk Iterative Computations A B C A B C D A B C D E A A A A A A A A A A A A How did the hotspots change over this window?
  • 27. Goals • Create and manage time-evolving graphs using Distributed Graph Snapshot Index • Retrieve the network state when a disruption happened • Temporal analytics on windows using Timelapse Abstraction • Analyze the evolution of hotspot groups in the last day by hour • Sliding window computations • Track regions of heavy load in the last 10 minutes interval • Mix graph and data parallel computing • What factors explain the performance in the network
  • 28. Incremental Computation • If results from a previous snapshot is available, how can we reuse them? • Three approaches in the past: • Restart the algorithm • Redundant computations • Memoization (GraphInc1) • Too much state • Operator-wise state (Naiad2,3) • Too much overhead 1Facilitating real- time graph mining, CloudDB ’12 2 Naiad: A timely dataflow system, SOSP ’13 3 Differential dataflow, CIDR ‘13
  • 29. Incremental Computation A B C D G1 A B C D E G2 A B C D A A B A A A A A A B C D E A A B A C A A A A A Time → → • Can keep state as an efficient time-evolving graph • Not limited to vertex-centric computations
  • 30. Incremental Computation • Some iterative graph algorithms are robust to graph changes • Allow them to proceed without keeping any state A D E B C 0.5570.557 0.5570.557 2.37 A D E B C B 0.5570.557 0.5570.557 2.37 0 A D E B C 0.8470.507 0.8470.507 2.07 B 1.2 →
  • 31. Goals • Create and manage time-evolving graphs using Distributed Graph Snapshot Index • Retrieve the network state when a disruption happened • Temporal analytics on windows using Timelapse Abstraction • Analyze the evolution of hotspot groups in the last day by hour • Sliding window computations • Track regions of heavy load in the last 10 minutes interval • Mix graph and data parallel computing • What factors explain the performance in the network
  • 32. Implementation & Evaluation • Implemented as a major enhancement to GraphX • Evaluated on two open source real-world graphs • Twitter: 41,652,230 vertices, 1,468,365,182 edges • uk-2007: 105,896,555 vertices, 3,738,733,648 edges
  • 33. Preliminary Evaluation ���� �� ���� �� ���� �� ���� �� ���� �� �� �� �� �� �� �� �� �� ��� ������������������� ������������� ����������������������������� ������� ���������� Tegra can pack more snapshots in memory Linear reduction Graph difference significant
  • 34. Preliminary Evaluation �� ���� �� ���� �� ���� �� �� �� �� �� �� ����������������� ������������� ������������������������� ������� ���������� Timelapse results in run time reduction
  • 35. Preliminary Evaluation Effectiveness of incremental computation 220 45 34 35 37 0 50 100 150 200 250 1 2 3 4 5 ConvergenceTime(s) Snapshot #
  • 36. Summary • Processing time-evolving graph efficiently can be useful. • Efficient storage of multiple snapshots and reducing communication between graph nodes key to evolving graph analysis.
  • 37. Ongoing Work • Expand timelapse and its incremental computation model to other graph-parallel paradigms • Other interesting graph algorithms: • Fraud detection/prediction/incremental pattern matching • Add graph querying support • Graph queries and analytics in a single system • Stay tuned for code release!