SlideShare une entreprise Scribd logo
1  sur  41
Télécharger pour lire hors ligne
Ling Liu
Distributed Data Intensive Systems Lab (DiSL)
School of Computer Science
Georgia Institute of Technology
Joint work with Yang Zhou, Kisung Lee, Qi Zhang
Why Graph Processing?
Graphs are everywhere!
2.84 PB adjacency list
2.84 PB edge list
Human connectome.
Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011
2
NA = 6.022 ⇥ 1023
mol 1
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
Graphs
•  Real information networks are represented
as graphs and analyzed using graph theory
– Internet
– Road networks
– Utility Grids
– Protein Interactions
•  A graph is
– a collection of binary relationships, or
– networks of pairwise interactions, including
social networks, digital networks. 3
Why Distributed Graph
Processing?
They are getting bigger!
Scale of Big Graphs
•  Social Scale: 1 billion vertices, 100 billion edges
–  adjacency matrix: >108 GB
–  adjacency list: >103GB
–  edge list: >103GB
•  Web Scale: 50 billion vertices, 1 trillion edges
–  adjacency matrix: >1011 GB
–  adjacency list: > 104 GB
–  edge list: > 104 GB
•  Brain Scale: 100 billion vertices, 100 trillion edges
–  adjacency matrix: >1020 GB
–  adjacency list: > 106 GB
–  edge list: > 106 GB 5
1 terabyte (TB) =1,024GB ~103GB
1 petabyte (PB) =1,024TB ~ 106GB
1 exabyte (EB) =1,024PB~109GB
1 zettabyte (ZB) =1,024EB~1012GB.
1 yottabyte (YB) =1,024ZB~1015GB.
[Paul Burkhardt, Chris Waring 2013]
Classifications of Graphs
•  Simple Graphs v.s. Multigraphs
–  Simple graph: allow only one edge per pair of vertices
–  Multigraph: allow multiple parallel edges between pairs
of vertices
•  Small Graphs v.s. Big Graphs
–  Small graph: can fit whole graph and its processing in
main memory
–  Big Graph: cannot fit the whole graph and its processing
in main memory
6
How hard it can be?
•  Difficult to parallelize (data/computation)
–  irregular data access increases latency
–  skewed data distribution creates bottlenecks
•  giant component
•  high degree vertices, highly skewed edge weight distribution
•  Increased size imposes greater challenge . . .
–  Latency
–  resource contention (i.e. hot-spotting)
•  Algorithm complexity really matters!
–  Run-time of O(n2) on a trillion node graph is not practical!
7
How do we store and process graphs?
•  Conventional approach is to store and compute
in-memory.
–  SHARED-MEMORY
•  Parallel Random Access Machine (PRAM)
•  data in globally-shared memory
•  implicit communication by updating memory
•  fast-random access
–  DISTRIBUTED-MEMORY
•  Bulk Synchronous Parallel (BSP)
•  data distributed to local, private memory
•  explicit communication by sending messages
•  easier to scale by adding more machines
8
[Paul Burkhardt, Chris Waring 2013]
Top Super Computer
Installations
9
[Paul Burkhardt, Chris Waring 2013]
Big Data Makes Bigger
Challenges to Graph Problems
•  Increasing volume, velocity, variety of Big Data
are posting significant challenges to scalable
graph algorithms.
•  Big Graph Challenges:
–  How will graph applications adapt to big data at
petabyte scale?
–  Ability to store and process Big Graphs impacts
typical data structures.
•  Big graphs challenge our conventional thinking
on both algorithms and computing architectures
10
HPDC’15
11
Existing Graph Processing Systems
•  Single PC Systems
–  GraphLab [Low et al., UAI'10]
–  GraphChi [Kyrola et al., OSDI'12]
–  X-Stream [Roy et al., SOSP'13]
–  TurboGraph [Han et al., KDD‘13]
–  GraphLego [Zhou, Liu et al., ACM HPDC‘15]
•  Distributed Shared Memory Systems
–  Pregel [Malewicz et al., SIGMOD'10]
–  Giraph/Hama – Apache Software Foundation
–  Distributed GraphLab [Low et al., VLDB'12]
–  PowerGraph [Gonzalez et al., OSDI'12]
–  SPARK-GraphX [Gonzalez et al., OSDI'14]
–  PathGraph [Yuan et.al, IEEE SC 2014]
–  GraphMap [LeeLiu et al. IEEE SC 2015]
Parallel Graph Processing:
Challenges
•  Structure driven computation
–  Storage and Data Transfer Issues
•  Irregular Graph Structure and
Computation Model
–  Storage and Data/Computation Partitioning Issues
–  Partitioning v.s. Load/Resource Balancing
12
Parallel Graph Processing:
Opportunities
•  Extend Existing Paradigms
–  Vertex centric
–  Edge centric
•  BUILD NEW FRAMEWORKS!
–  Centralized Approaches
•  GraphLego [ACM HPDC 2015] / GraphTwist [VLDB2015]
–  Distributed Approaches
•  GraphMap [IEEE SC 2015], PathGraph [IEEE SC 2014]
13
Parallel Graph Processing
•  Part I: Singe Machine Approaches
•  Can big graphs be processed on a single
machine?
•  How to effectively maximize sequential access and
minimize random access?
•  Part II: Distributed Approaches
•  How to effectively partition a big graph for graph
computation?
•  Are large clusters always better for big graphs?
14
Part I:
Parallel Graph Processing
(single machine)
Human connectome.
Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011
2
NA = 6.022 ⇥ 1023
mol 1
Paul Burkhardt, Chris Waring An NSA Big Graph experiment
7/28/15
HPDC’15
16
Vertex-centric Computation Model
•  Think like a vertex
•  vertex_scatter(vertex v)
–  send updates over outgoing edges of v
•  vertex_gather(vertex v)
–  apply updates from inbound edges of v
•  repeat the computation iterations
–  for all vertices v
•  vertex_scatter(v)
–  for all vertices v
•  vertex_gather(v)
7/28/15
HPDC’15
17
Edge-centric
Computation Model (X-Stream)
•  Think like an edge (source vertex and destination vertex)
•  edge_scatter(edge e)
–  send update over e (from source vertex to destination vertex)
•  update_gather(update u)
–  apply update u to u.destination
•  repeat the computation iterations
–  for all edges e
•  edge_scatter(e)
–  for all updates u
•  update_gather(u)
7/28/15
HPDC’15
18
Challenges of Big Graphs
•  Graph size v.s. limited resource
–  Handling big graphs with billions of vertices and edges in memory may
require hundreds of gigabytes of DRAM
•  High-degree vertices
–  In uk-union with 133.6M vertices: the maximum indegree is 6,366,525 and
the maximum outdegree is 22,429
•  Skewed vertex degree distribution
–  In Yahoo web with 1.4B vertices: the average vertex degree is 4.7, 49% of
the vertices have degree zero and the maximum indegree is 7,637,656
•  Skewed edge weight distribution
–  In DBLP with 0.96M vertices: among 389 coauthors of Elisa Bertino, she has
only one coauthored paper with 198 coauthors, two coauthored papers with
74 coauthors, three coauthored papers with 30 coauthors, and coauthored
paper larger than 4 with 87 coauthors
7/28/15
HPDC’15
19
Real-world Big Graphs
7/28/15
HPDC’15
20
Graph Processing Systems: Challenges
•  Diverse types of processed graphs
–  Simple graph: not allow for parallel edges (multiple edges) between a pair of
vertices
–  Multigraph: allow for parallel edges between a pair of vertices
•  Different kinds of graph applications
–  Matrix-vector multiplication and graph traversal with the cost of O(n2)
–  Matrix-matrix multiplication with the cost of O(n3)
•  Random access
–  It is inefficient for both access and storage. A bunch of random accesses
are necessary but would hurt the performance of graph processing systems
•  Workload imbalance
–  The time of computing on a vertex and its edges is much faster than the
time to access to the vertex state and its edge data in memory or on disk
–  The computation workloads on different vertices are significantly
imbalanced due to the highly skewed vertex degree distribution.
7/28/15
HPDC’15
21
GraphLego: Our Approach
•  Flexible multi-level hierarchical graph parallel abstractions
–  Model a large graph as a 3D cube with source vertex, destination vertex and
edge weight as the dimensions
–  Partitioning a big graph by: slice, strip, dice based graph partitioning
•  Access Locality Optimization
–  Dice-based data placement: store a large graph on disk by minimizing non-
sequential disk access and enabling more structured in-memory access
–  Construct partition-to-chunk index and vertex-to-partition index to facilitate fast
access to slices, strips and dices
–  implement partition-level in-memory gzip compression to optimize disk I/Os
•  Optimization for Partitioning Parameters
–  Build a regression-based learning model to discover the latent relationship
between the number of partitions and the runtime
7/28/15
HPDC’15
22
Modeling a Graph as a 3D Cube
•  Model a directed graph G=(V,E,W) as a 3D cube I=(S,D,E,W)
with source vertices (S=V), destination vertices (D=V) and
edge weights (W) as the three dimensions
7/28/15
HPDC’15
23
Multi-level Hierarchical Graph Parallel
Abstractions
7/28/15
HPDC’15
24
Slice Partitioning: DBLP Example
7/28/15
HPDC’15
25
Strip Partitioning of DB Slice
7/28/15
HPDC’15
26
Dice Partitioning: An Example
SVP: v1,v2,v3, v5,v6,v7, v11, v12, v15
DVP: v2,v3,v4,v5,v6, v7,v8,v9,v10,v11, v12,v13,v14,v15.v16
7/28/15
HPDC’15
27
Dice Partition Storage (OEDs)
7/28/15
HPDC’15
28
Advantage: Multi-level Hierarchical
Graph Parallel Abstractions
•  Choose smaller subgraph blocks such as dice partition or
strip partition to balance the parallel computation efficiency
among partition blocks
•  Use larger subgraph blocks such as slice partition or strip
partition to maximize sequential access and minimize
random access
7/28/15
HPDC’15
29
Partial Aggregation in Parallel
•  Aggregation operation
–  Partially scatter vertex updates in parallel
–  Partially gather vertex updates in parallel
•  Two-level Graph Computation Parallelism
–  Parallel partial update at the subgraph partition level (slice, strip or dice)
–  Parallel partial update at the vertex level
Scatter Gather
Multi-threading Update and
Asynchronization
7/28/15
HPDC’15
30
7/28/15
HPDC’15
31
Programmable Interface
•  vertex centric programming API, such as Scatter and Gather.
•  Compile iterative algorithms into a sequence of internal function (routine)
calls that understand the internal data structures for accessing the graph
by different types of subgraph partition blocks
Gather vertex updates
from neighbor vertices
and incoming edges
Scatter vertex
updates to
outgoing edges
7/28/15
HPDC’15
32
Partition-to-chunk Index
Vertex-to-partition Index
•  The dice-level index is a dense index that maps a dice ID and its DVP (or SVP)
to the chunks on disk where the corresponding dice partition is stored physically
•  The strip-level index is a two level sparse index, which maps a strip ID to the
dice-level index-blocks and then map each dice ID to the dice partition chunks in
the physical storage
•  The slice level index is a three-level sparse index with slice index blocks at the
top, strip index blocks at the middle and dice index blocks at the bottom,
enabling fast retrieval of dices with a slice-specific condition
7/28/15
HPDC’15
33
Partition-level Compression
•  Iterative computations on large graphs incur non-
trivial cost for the I/O processing
–  The I/O processing of Twitter dataset on a PC with 4 CPU
cores and 16GB memory takes 50.2% of the total running
time for PageRank (5 iterations)
•  Apply in-memory gzip compression to transform
each graph partition block into a compressed format
before storing them on disk
7/28/15
HPDC’15
34
Configuration of Partitioning Parameters
•  User definition
•  Simple estimation
•  Regression-based learning
–  Construct a polynomial regression model to model the nonlinear relationship between
independent variables p, q, r (partition parameters) and dependent variable T (runtime) with
latent coefficient αijk and error term ε	

–  The goal of regression-based learning is to determine the latent αijk and ε to get the function
between p, q, r and T	

–  Select m limited samples of (pl, ql, rl, Tl) (1≤l≤m) from the existing experiment results	

–  Solve m linear equations consisting of m selected samples to generate the concrete αijk and ε	

–  Utilize a successive convex approximation method (SCA) to find the optimal solution (i.e., the
minimum runtime T) of the above polynomial function and the optimal parameters (i.e., p, q
and r) when T is minimum
7/28/15
HPDC’15
35
Experimental Evaluation
•  Computer server
–  Intel Core i5 2.66 GHz, 16 GB RAM, 1 TB hard drive, Linux 64-bit
•  Graph parallel systems
–  GraphLab [Low et al., UAI'10]
–  GraphChi [Kyrola et al., OSDI'12]
–  X-Stream [Roy et al., SOSP'13]
•  Graph applications
7/28/15
HPDC’15
36
Execution Efficiency on Single Graph
7/28/15
HPDC’15
37
Execution Efficiency on Multiple Graphs
7/28/15
HPDC’15
38
Decision of #Partitions
7/28/15
HPDC’15
39
Efficiency of Regression-based Learning
GraphLego: Resource Aware GPS
•  Flexible multi-level hierarchical graph parallel
abstractions
–  Model a large graph as a 3D cube with source vertex, destination vertex
and edge weight as the dimensions
–  Partitioning a big graph by: slice, strip, dice based graph
partitioning
•  Access Locality Optimization
–  Dice-based data placement: store a large graph on disk by minimizing
non-sequential disk access and enabling more structured in-memory
access
–  Construct partition-to-chunk index and vertex-to-partition index to
facilitate fast access to slices, strips and dices
–  implement partition-level in-memory gzip compression to optimize disk I/
Os
•  Optimization for Partitioning Parameters
–  Build a regression-based learning model to discover the latent
relationship between the number of partitions and the runtime
40
41
Questions
Open Source:
https://sites.google.com/site/git_GraphLego/
7/28/15
HPDC’15

Contenu connexe

Tendances

GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
Paco Nathan
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
Carl Lu
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
robertlz
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
Hiroshi Ono
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
ivascucristian
 

Tendances (20)

Google's Dremel
Google's DremelGoogle's Dremel
Google's Dremel
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN LayersNear-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
Near-Duplicate Video Retrieval by Aggregating Intermediate CNN Layers
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
 
Recent progress on distributing deep learning
Recent progress on distributing deep learningRecent progress on distributing deep learning
Recent progress on distributing deep learning
 
Dremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasetsDremel interactive analysis of web scale datasets
Dremel interactive analysis of web scale datasets
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
Time-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity ClustersTime-Evolving Graph Processing On Commodity Clusters
Time-Evolving Graph Processing On Commodity Clusters
 
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
 
Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets Dremel: Interactive Analysis of Web-Scale Datasets
Dremel: Interactive Analysis of Web-Scale Datasets
 
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
Giraph++: From "Think Like a Vertex" to "Think Like a Graph"
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
benchmarks-sigmod09
benchmarks-sigmod09benchmarks-sigmod09
benchmarks-sigmod09
 
Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014Yarn spark next_gen_hadoop_8_jan_2014
Yarn spark next_gen_hadoop_8_jan_2014
 
Parallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A SurveyParallel Data Processing with MapReduce: A Survey
Parallel Data Processing with MapReduce: A Survey
 
Distributed computing poli
Distributed computing poliDistributed computing poli
Distributed computing poli
 
Generalized Linear Models with H2O
Generalized Linear Models with H2O Generalized Linear Models with H2O
Generalized Linear Models with H2O
 
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATLParikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
Parikshit Ram – Senior Machine Learning Scientist, Skytree at MLconf ATL
 

En vedette

Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
Tobias Kuhn
 
Graphlab under the hood
Graphlab under the hoodGraphlab under the hood
Graphlab under the hood
Zuhair khayyat
 
Jeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, AdaptrisJeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, Adaptris
MLconf
 

En vedette (13)

Citation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific LiteratureCitation Graph Analysis to Identify Memes in Scientific Literature
Citation Graph Analysis to Identify Memes in Scientific Literature
 
Graphlab under the hood
Graphlab under the hoodGraphlab under the hood
Graphlab under the hood
 
Machine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLabMachine Learning in the Cloud with GraphLab
Machine Learning in the Cloud with GraphLab
 
Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013Joey gonzalez, graph lab, m lconf 2013
Joey gonzalez, graph lab, m lconf 2013
 
GraphChi big graph processing
GraphChi big graph processingGraphChi big graph processing
GraphChi big graph processing
 
CS267_Graph_Lab
CS267_Graph_LabCS267_Graph_Lab
CS267_Graph_Lab
 
GraphLab
GraphLabGraphLab
GraphLab
 
PowerGraph
PowerGraphPowerGraph
PowerGraph
 
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defenseLarge-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
 
Jeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, AdaptrisJeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, Adaptris
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Graph processing - Graphlab
Graph processing - GraphlabGraph processing - Graphlab
Graph processing - Graphlab
 
Graph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphXGraph processing - Powergraph and GraphX
Graph processing - Powergraph and GraphX
 

Similaire à Ling liu part 01:big graph processing

Graph chi
Graph chiGraph chi
Graph chi
Jay Rathod
 
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
miyurud
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
inside-BigData.com
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
sscdotopen
 

Similaire à Ling liu part 01:big graph processing (20)

Optimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data PerspectiveOptimal Chain Matrix Multiplication Big Data Perspective
Optimal Chain Matrix Multiplication Big Data Perspective
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Graph chi
Graph chiGraph chi
Graph chi
 
try
trytry
try
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Large graph analysis using g mine system
Large graph analysis using g mine systemLarge graph analysis using g mine system
Large graph analysis using g mine system
 
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
Scalable Graph Convolutional Network Based Link Prediction on a Distributed G...
 
The Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsThe Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing Systems
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Project Matsu
Project MatsuProject Matsu
Project Matsu
 
Scalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data ShardingScalable Machine Learning: The Role of Stratified Data Sharding
Scalable Machine Learning: The Role of Stratified Data Sharding
 
Big Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big GraphsBig Stream Processing Systems, Big Graphs
Big Stream Processing Systems, Big Graphs
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
20191107 deeplearningapproachesfornetworks
20191107 deeplearningapproachesfornetworks20191107 deeplearningapproachesfornetworks
20191107 deeplearningapproachesfornetworks
 
Distributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology OverviewDistributed Computing with Apache Hadoop: Technology Overview
Distributed Computing with Apache Hadoop: Technology Overview
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Chapter 3.pptx
Chapter 3.pptxChapter 3.pptx
Chapter 3.pptx
 
Scalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduceScalable Similarity-Based Neighborhood Methods with MapReduce
Scalable Similarity-Based Neighborhood Methods with MapReduce
 
Map reducecloudtech
Map reducecloudtechMap reducecloudtech
Map reducecloudtech
 

Plus de jins0618

Plus de jins0618 (20)

Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud EnvironmentMachine Status Prediction for Dynamic and Heterogenous Cloud Environment
Machine Status Prediction for Dynamic and Heterogenous Cloud Environment
 
Latent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite NetworksLatent Interest and Topic Mining on User-item Bipartite Networks
Latent Interest and Topic Mining on User-item Bipartite Networks
 
Web Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet EnvironmentsWeb Service QoS Prediction Approach in Mobile Internet Environments
Web Service QoS Prediction Approach in Mobile Internet Environments
 
吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践吕潇 星环科技大数据技术探索与应用实践
吕潇 星环科技大数据技术探索与应用实践
 
李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究李战怀 大数据环境下数据存储与管理的研究
李战怀 大数据环境下数据存储与管理的研究
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Christian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big dataChristian jensen advanced routing in spatial networks using big data
Christian jensen advanced routing in spatial networks using big data
 
Jeffrey xu yu large graph processing
Jeffrey xu yu large graph processingJeffrey xu yu large graph processing
Jeffrey xu yu large graph processing
 
Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...Calton pu experimental methods on performance in cloud and accuracy in big da...
Calton pu experimental methods on performance in cloud and accuracy in big da...
 
Wang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configurationWang ke mining revenue-maximizing bundling configuration
Wang ke mining revenue-maximizing bundling configuration
 
Wang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under thresholdWang ke classification by cut clearance under threshold
Wang ke classification by cut clearance under threshold
 
2015 07-tuto2-clus type
2015 07-tuto2-clus type2015 07-tuto2-clus type
2015 07-tuto2-clus type
 
2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining2015 07-tuto1-phrase mining
2015 07-tuto1-phrase mining
 
2015 07-tuto3-mining hin
2015 07-tuto3-mining hin2015 07-tuto3-mining hin
2015 07-tuto3-mining hin
 
2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline2015 07-tuto0-courseoutline
2015 07-tuto0-courseoutline
 
Weiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysisWeiyi meng web data truthfulness analysis
Weiyi meng web data truthfulness analysis
 
Ke yi small summaries for big data
Ke yi small summaries for big dataKe yi small summaries for big data
Ke yi small summaries for big data
 
Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...Gao cong geospatial social media data management and context-aware recommenda...
Gao cong geospatial social media data management and context-aware recommenda...
 
Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台Chen li asterix db: 大数据处理开源平台
Chen li asterix db: 大数据处理开源平台
 
Movies&demographics
Movies&demographicsMovies&demographics
Movies&demographics
 

Dernier

Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 

Dernier (20)

Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 

Ling liu part 01:big graph processing

  • 1. Ling Liu Distributed Data Intensive Systems Lab (DiSL) School of Computer Science Georgia Institute of Technology Joint work with Yang Zhou, Kisung Lee, Qi Zhang
  • 2. Why Graph Processing? Graphs are everywhere! 2.84 PB adjacency list 2.84 PB edge list Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 ⇥ 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 3. Graphs •  Real information networks are represented as graphs and analyzed using graph theory – Internet – Road networks – Utility Grids – Protein Interactions •  A graph is – a collection of binary relationships, or – networks of pairwise interactions, including social networks, digital networks. 3
  • 5. Scale of Big Graphs •  Social Scale: 1 billion vertices, 100 billion edges –  adjacency matrix: >108 GB –  adjacency list: >103GB –  edge list: >103GB •  Web Scale: 50 billion vertices, 1 trillion edges –  adjacency matrix: >1011 GB –  adjacency list: > 104 GB –  edge list: > 104 GB •  Brain Scale: 100 billion vertices, 100 trillion edges –  adjacency matrix: >1020 GB –  adjacency list: > 106 GB –  edge list: > 106 GB 5 1 terabyte (TB) =1,024GB ~103GB 1 petabyte (PB) =1,024TB ~ 106GB 1 exabyte (EB) =1,024PB~109GB 1 zettabyte (ZB) =1,024EB~1012GB. 1 yottabyte (YB) =1,024ZB~1015GB. [Paul Burkhardt, Chris Waring 2013]
  • 6. Classifications of Graphs •  Simple Graphs v.s. Multigraphs –  Simple graph: allow only one edge per pair of vertices –  Multigraph: allow multiple parallel edges between pairs of vertices •  Small Graphs v.s. Big Graphs –  Small graph: can fit whole graph and its processing in main memory –  Big Graph: cannot fit the whole graph and its processing in main memory 6
  • 7. How hard it can be? •  Difficult to parallelize (data/computation) –  irregular data access increases latency –  skewed data distribution creates bottlenecks •  giant component •  high degree vertices, highly skewed edge weight distribution •  Increased size imposes greater challenge . . . –  Latency –  resource contention (i.e. hot-spotting) •  Algorithm complexity really matters! –  Run-time of O(n2) on a trillion node graph is not practical! 7
  • 8. How do we store and process graphs? •  Conventional approach is to store and compute in-memory. –  SHARED-MEMORY •  Parallel Random Access Machine (PRAM) •  data in globally-shared memory •  implicit communication by updating memory •  fast-random access –  DISTRIBUTED-MEMORY •  Bulk Synchronous Parallel (BSP) •  data distributed to local, private memory •  explicit communication by sending messages •  easier to scale by adding more machines 8 [Paul Burkhardt, Chris Waring 2013]
  • 9. Top Super Computer Installations 9 [Paul Burkhardt, Chris Waring 2013]
  • 10. Big Data Makes Bigger Challenges to Graph Problems •  Increasing volume, velocity, variety of Big Data are posting significant challenges to scalable graph algorithms. •  Big Graph Challenges: –  How will graph applications adapt to big data at petabyte scale? –  Ability to store and process Big Graphs impacts typical data structures. •  Big graphs challenge our conventional thinking on both algorithms and computing architectures 10
  • 11. HPDC’15 11 Existing Graph Processing Systems •  Single PC Systems –  GraphLab [Low et al., UAI'10] –  GraphChi [Kyrola et al., OSDI'12] –  X-Stream [Roy et al., SOSP'13] –  TurboGraph [Han et al., KDD‘13] –  GraphLego [Zhou, Liu et al., ACM HPDC‘15] •  Distributed Shared Memory Systems –  Pregel [Malewicz et al., SIGMOD'10] –  Giraph/Hama – Apache Software Foundation –  Distributed GraphLab [Low et al., VLDB'12] –  PowerGraph [Gonzalez et al., OSDI'12] –  SPARK-GraphX [Gonzalez et al., OSDI'14] –  PathGraph [Yuan et.al, IEEE SC 2014] –  GraphMap [LeeLiu et al. IEEE SC 2015]
  • 12. Parallel Graph Processing: Challenges •  Structure driven computation –  Storage and Data Transfer Issues •  Irregular Graph Structure and Computation Model –  Storage and Data/Computation Partitioning Issues –  Partitioning v.s. Load/Resource Balancing 12
  • 13. Parallel Graph Processing: Opportunities •  Extend Existing Paradigms –  Vertex centric –  Edge centric •  BUILD NEW FRAMEWORKS! –  Centralized Approaches •  GraphLego [ACM HPDC 2015] / GraphTwist [VLDB2015] –  Distributed Approaches •  GraphMap [IEEE SC 2015], PathGraph [IEEE SC 2014] 13
  • 14. Parallel Graph Processing •  Part I: Singe Machine Approaches •  Can big graphs be processed on a single machine? •  How to effectively maximize sequential access and minimize random access? •  Part II: Distributed Approaches •  How to effectively partition a big graph for graph computation? •  Are large clusters always better for big graphs? 14
  • 15. Part I: Parallel Graph Processing (single machine) Human connectome. Gerhard et al., Frontiers in Neuroinformatics 5(3), 2011 2 NA = 6.022 ⇥ 1023 mol 1 Paul Burkhardt, Chris Waring An NSA Big Graph experiment
  • 16. 7/28/15 HPDC’15 16 Vertex-centric Computation Model •  Think like a vertex •  vertex_scatter(vertex v) –  send updates over outgoing edges of v •  vertex_gather(vertex v) –  apply updates from inbound edges of v •  repeat the computation iterations –  for all vertices v •  vertex_scatter(v) –  for all vertices v •  vertex_gather(v)
  • 17. 7/28/15 HPDC’15 17 Edge-centric Computation Model (X-Stream) •  Think like an edge (source vertex and destination vertex) •  edge_scatter(edge e) –  send update over e (from source vertex to destination vertex) •  update_gather(update u) –  apply update u to u.destination •  repeat the computation iterations –  for all edges e •  edge_scatter(e) –  for all updates u •  update_gather(u)
  • 18. 7/28/15 HPDC’15 18 Challenges of Big Graphs •  Graph size v.s. limited resource –  Handling big graphs with billions of vertices and edges in memory may require hundreds of gigabytes of DRAM •  High-degree vertices –  In uk-union with 133.6M vertices: the maximum indegree is 6,366,525 and the maximum outdegree is 22,429 •  Skewed vertex degree distribution –  In Yahoo web with 1.4B vertices: the average vertex degree is 4.7, 49% of the vertices have degree zero and the maximum indegree is 7,637,656 •  Skewed edge weight distribution –  In DBLP with 0.96M vertices: among 389 coauthors of Elisa Bertino, she has only one coauthored paper with 198 coauthors, two coauthored papers with 74 coauthors, three coauthored papers with 30 coauthors, and coauthored paper larger than 4 with 87 coauthors
  • 20. 7/28/15 HPDC’15 20 Graph Processing Systems: Challenges •  Diverse types of processed graphs –  Simple graph: not allow for parallel edges (multiple edges) between a pair of vertices –  Multigraph: allow for parallel edges between a pair of vertices •  Different kinds of graph applications –  Matrix-vector multiplication and graph traversal with the cost of O(n2) –  Matrix-matrix multiplication with the cost of O(n3) •  Random access –  It is inefficient for both access and storage. A bunch of random accesses are necessary but would hurt the performance of graph processing systems •  Workload imbalance –  The time of computing on a vertex and its edges is much faster than the time to access to the vertex state and its edge data in memory or on disk –  The computation workloads on different vertices are significantly imbalanced due to the highly skewed vertex degree distribution.
  • 21. 7/28/15 HPDC’15 21 GraphLego: Our Approach •  Flexible multi-level hierarchical graph parallel abstractions –  Model a large graph as a 3D cube with source vertex, destination vertex and edge weight as the dimensions –  Partitioning a big graph by: slice, strip, dice based graph partitioning •  Access Locality Optimization –  Dice-based data placement: store a large graph on disk by minimizing non- sequential disk access and enabling more structured in-memory access –  Construct partition-to-chunk index and vertex-to-partition index to facilitate fast access to slices, strips and dices –  implement partition-level in-memory gzip compression to optimize disk I/Os •  Optimization for Partitioning Parameters –  Build a regression-based learning model to discover the latent relationship between the number of partitions and the runtime
  • 22. 7/28/15 HPDC’15 22 Modeling a Graph as a 3D Cube •  Model a directed graph G=(V,E,W) as a 3D cube I=(S,D,E,W) with source vertices (S=V), destination vertices (D=V) and edge weights (W) as the three dimensions
  • 26. 7/28/15 HPDC’15 26 Dice Partitioning: An Example SVP: v1,v2,v3, v5,v6,v7, v11, v12, v15 DVP: v2,v3,v4,v5,v6, v7,v8,v9,v10,v11, v12,v13,v14,v15.v16
  • 28. 7/28/15 HPDC’15 28 Advantage: Multi-level Hierarchical Graph Parallel Abstractions •  Choose smaller subgraph blocks such as dice partition or strip partition to balance the parallel computation efficiency among partition blocks •  Use larger subgraph blocks such as slice partition or strip partition to maximize sequential access and minimize random access
  • 29. 7/28/15 HPDC’15 29 Partial Aggregation in Parallel •  Aggregation operation –  Partially scatter vertex updates in parallel –  Partially gather vertex updates in parallel •  Two-level Graph Computation Parallelism –  Parallel partial update at the subgraph partition level (slice, strip or dice) –  Parallel partial update at the vertex level Scatter Gather
  • 31. 7/28/15 HPDC’15 31 Programmable Interface •  vertex centric programming API, such as Scatter and Gather. •  Compile iterative algorithms into a sequence of internal function (routine) calls that understand the internal data structures for accessing the graph by different types of subgraph partition blocks Gather vertex updates from neighbor vertices and incoming edges Scatter vertex updates to outgoing edges
  • 32. 7/28/15 HPDC’15 32 Partition-to-chunk Index Vertex-to-partition Index •  The dice-level index is a dense index that maps a dice ID and its DVP (or SVP) to the chunks on disk where the corresponding dice partition is stored physically •  The strip-level index is a two level sparse index, which maps a strip ID to the dice-level index-blocks and then map each dice ID to the dice partition chunks in the physical storage •  The slice level index is a three-level sparse index with slice index blocks at the top, strip index blocks at the middle and dice index blocks at the bottom, enabling fast retrieval of dices with a slice-specific condition
  • 33. 7/28/15 HPDC’15 33 Partition-level Compression •  Iterative computations on large graphs incur non- trivial cost for the I/O processing –  The I/O processing of Twitter dataset on a PC with 4 CPU cores and 16GB memory takes 50.2% of the total running time for PageRank (5 iterations) •  Apply in-memory gzip compression to transform each graph partition block into a compressed format before storing them on disk
  • 34. 7/28/15 HPDC’15 34 Configuration of Partitioning Parameters •  User definition •  Simple estimation •  Regression-based learning –  Construct a polynomial regression model to model the nonlinear relationship between independent variables p, q, r (partition parameters) and dependent variable T (runtime) with latent coefficient αijk and error term ε –  The goal of regression-based learning is to determine the latent αijk and ε to get the function between p, q, r and T –  Select m limited samples of (pl, ql, rl, Tl) (1≤l≤m) from the existing experiment results –  Solve m linear equations consisting of m selected samples to generate the concrete αijk and ε –  Utilize a successive convex approximation method (SCA) to find the optimal solution (i.e., the minimum runtime T) of the above polynomial function and the optimal parameters (i.e., p, q and r) when T is minimum
  • 35. 7/28/15 HPDC’15 35 Experimental Evaluation •  Computer server –  Intel Core i5 2.66 GHz, 16 GB RAM, 1 TB hard drive, Linux 64-bit •  Graph parallel systems –  GraphLab [Low et al., UAI'10] –  GraphChi [Kyrola et al., OSDI'12] –  X-Stream [Roy et al., SOSP'13] •  Graph applications
  • 40. GraphLego: Resource Aware GPS •  Flexible multi-level hierarchical graph parallel abstractions –  Model a large graph as a 3D cube with source vertex, destination vertex and edge weight as the dimensions –  Partitioning a big graph by: slice, strip, dice based graph partitioning •  Access Locality Optimization –  Dice-based data placement: store a large graph on disk by minimizing non-sequential disk access and enabling more structured in-memory access –  Construct partition-to-chunk index and vertex-to-partition index to facilitate fast access to slices, strips and dices –  implement partition-level in-memory gzip compression to optimize disk I/ Os •  Optimization for Partitioning Parameters –  Build a regression-based learning model to discover the latent relationship between the number of partitions and the runtime 40