SlideShare une entreprise Scribd logo
1  sur  59
Télécharger pour lire hors ligne
Pregel: A System for Large-Scale Graph Processing
Amir H. Payberah
amir@sics.se
Amirkabir University of Technology
(Tehran Polytechnic)
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 1 / 40
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 2 / 40
Introduction
Graphs provide a flexible abstraction for describing relationships be-
tween discrete objects.
Many problems can be modeled by graphs and solved with appro-
priate graph algorithms.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 3 / 40
Large Graph
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 4 / 40
Large-Scale Graph Processing
Large graphs need large-scale processing.
A large graph either cannot fit into memory of single computer or
it fits with huge cost.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 5 / 40
Question
Can we use platforms like MapReduce or Spark, which are based on data-parallel
model, for large-scale graph proceeding?
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 6 / 40
Data-Parallel Model for Large-Scale Graph Processing
The platforms that have worked well for developing parallel applica-
tions are not necessarily effective for large-scale graph problems.
Why?
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 7 / 40
Graph Algorithms Characteristics (1/2)
Unstructured problems
• Difficult to extract parallelism based on partitioning of the data: the
irregular structure of graphs.
• Limited scalability: unbalanced computational loads resulting from
poorly partitioned data.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 8 / 40
Graph Algorithms Characteristics (1/2)
Unstructured problems
• Difficult to extract parallelism based on partitioning of the data: the
irregular structure of graphs.
• Limited scalability: unbalanced computational loads resulting from
poorly partitioned data.
Data-driven computations
• Difficult to express parallelism based on partitioning of computation:
the structure of computations in the algorithm is not known a priori.
• The computations are dictated by nodes and links of the graph.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 8 / 40
Graph Algorithms Characteristics (2/2)
Poor data locality
• The computations and data access patterns do not have much local-
ity: the irregular structure of graphs.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 9 / 40
Graph Algorithms Characteristics (2/2)
Poor data locality
• The computations and data access patterns do not have much local-
ity: the irregular structure of graphs.
High data access to computation ratio
• Graph algorithms are often based on exploring the structure of a
graph to perform computations on the graph data.
• Runtime can be dominated by waiting memory fetches: low locality.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 9 / 40
Proposed Solution
Graph-Parallel Processing
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 10 / 40
Proposed Solution
Graph-Parallel Processing
Computation typically depends on the neighbors.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 10 / 40
Graph-Parallel Processing
Restricts the types of computation.
New techniques to partition and distribute graphs.
Exploit graph structure.
Executes graph algorithms orders-of-magnitude faster than more
general data-parallel systems.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 11 / 40
Data-Parallel vs. Graph-Parallel Computation
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 12 / 40
Data-Parallel vs. Graph-Parallel Computation
Data-parallel computation
• Record-centric view of data.
• Parallelism: processing independent data on separate resources.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 13 / 40
Data-Parallel vs. Graph-Parallel Computation
Data-parallel computation
• Record-centric view of data.
• Parallelism: processing independent data on separate resources.
Graph-parallel computation
• Vertex-centric view of graphs.
• Parallelism: partitioning graph (dependent) data across processing
resources, and resolving dependencies (along edges) through
iterative computation and communication.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 13 / 40
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 14 / 40
Seven Bridges of K¨onigsberg
Finding a walk through the city that would cross each bridge once
and only once.
Euler proved that the problem has no solution.
Map of K¨onigsberg in Euler’s time, highlighting the river Pregel and the bridges.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 15 / 40
Pregel
Large-scale graph-parallel processing platform developed at Google.
Inspired by bulk synchronous parallel (BSP) model.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 16 / 40
Bulk Synchronous Parallel (1/2)
It is a parallel programming model.
The model consists of:
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 17 / 40
Bulk Synchronous Parallel (1/2)
It is a parallel programming model.
The model consists of:
• A set of processor-memory pairs.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 17 / 40
Bulk Synchronous Parallel (1/2)
It is a parallel programming model.
The model consists of:
• A set of processor-memory pairs.
• A communications network that delivers messages in a point-
to-point manner.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 17 / 40
Bulk Synchronous Parallel (1/2)
It is a parallel programming model.
The model consists of:
• A set of processor-memory pairs.
• A communications network that delivers messages in a point-
to-point manner.
• A mechanism for the efficient barrier synchronization for all or
a subset of the processes.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 17 / 40
Bulk Synchronous Parallel (1/2)
It is a parallel programming model.
The model consists of:
• A set of processor-memory pairs.
• A communications network that delivers messages in a point-
to-point manner.
• A mechanism for the efficient barrier synchronization for all or
a subset of the processes.
• There are no special combining, replicating, or broadcasting fa-
cilities.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 17 / 40
Bulk Synchronous Parallel (2/2)
All vertices update in parallel (at the same time).
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 18 / 40
Vertex-Centric Programs
Think as a vertex.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 19 / 40
Vertex-Centric Programs
Think as a vertex.
Each vertex computes individually its value: in parallel
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 19 / 40
Vertex-Centric Programs
Think as a vertex.
Each vertex computes individually its value: in parallel
Each vertex can see its local context, and updates its value accord-
ingly.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 19 / 40
Data Model
A directed graph that stores the program state, e.g., the current
value.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 20 / 40
Execution Model (1/3)
Applications run in sequence of iterations: supersteps
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 21 / 40
Execution Model (1/3)
Applications run in sequence of iterations: supersteps
During a superstep, user-defined functions for each vertex is invoked
(method Compute()): in parallel
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 21 / 40
Execution Model (1/3)
Applications run in sequence of iterations: supersteps
During a superstep, user-defined functions for each vertex is invoked
(method Compute()): in parallel
A vertex in superstep S can:
• reads messages sent to it in superstep S-1.
• sends messages to other vertices: receiving at superstep S+1.
• modifies its state.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 21 / 40
Execution Model (1/3)
Applications run in sequence of iterations: supersteps
During a superstep, user-defined functions for each vertex is invoked
(method Compute()): in parallel
A vertex in superstep S can:
• reads messages sent to it in superstep S-1.
• sends messages to other vertices: receiving at superstep S+1.
• modifies its state.
Vertices communicate directly with one another by sending mes-
sages.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 21 / 40
Execution Model (2/3)
Superstep 0: all vertices are in the active state.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 22 / 40
Execution Model (2/3)
Superstep 0: all vertices are in the active state.
A vertex deactivates itself by voting to halt: no further work to do.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 22 / 40
Execution Model (2/3)
Superstep 0: all vertices are in the active state.
A vertex deactivates itself by voting to halt: no further work to do.
A halted vertex can be active if it receives a message.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 22 / 40
Execution Model (2/3)
Superstep 0: all vertices are in the active state.
A vertex deactivates itself by voting to halt: no further work to do.
A halted vertex can be active if it receives a message.
The whole algorithm terminates when:
• All vertices are simultaneously inactive.
• There are no messages in transit.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 22 / 40
Execution Model (3/3)
Aggregation: a mechanism for global communication, monitoring,
and data.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 23 / 40
Execution Model (3/3)
Aggregation: a mechanism for global communication, monitoring,
and data.
Runs after each superstep.
Each vertex can provide a value to an aggregator in superstep S.
The system combines those values and the resulting value is made
available to all vertices in superstep S + 1.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 23 / 40
Execution Model (3/3)
Aggregation: a mechanism for global communication, monitoring,
and data.
Runs after each superstep.
Each vertex can provide a value to an aggregator in superstep S.
The system combines those values and the resulting value is made
available to all vertices in superstep S + 1.
A number of predefined aggregators, e.g., min, max, sum.
Aggregation operators should be commutative and associative.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 23 / 40
Example: Max Value (1/4)
i_val := val
for each message m
if m > val then val := m
if i_val == val then
vote_to_halt
else
for each neighbor v
send_message(v, val)
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 24 / 40
Example: Max Value (2/4)
i_val := val
for each message m
if m > val then val := m
if i_val == val then
vote_to_halt
else
for each neighbor v
send_message(v, val)
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 25 / 40
Example: Max Value (3/4)
i_val := val
for each message m
if m > val then val := m
if i_val == val then
vote_to_halt
else
for each neighbor v
send_message(v, val)
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 26 / 40
Example: Max Value (4/4)
i_val := val
for each message m
if m > val then val := m
if i_val == val then
vote_to_halt
else
for each neighbor v
send_message(v, val)
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 27 / 40
Example: PageRank
Update ranks in parallel.
Iterate until convergence.
R[i] = 0.15 +
j∈Nbrs(i)
wjiR[j]
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 28 / 40
Example: PageRank
Pregel_PageRank(i, messages):
// receive all the messages
total = 0
foreach(msg in messages):
total = total + msg
// update the rank of this vertex
R[i] = 0.15 + total
// send new messages to neighbors
foreach(j in out_neighbors[i]):
sendmsg(R[i] * wij) to vertex j
R[i] = 0.15 +
j∈Nbrs(i)
wjiR[j]
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 29 / 40
Partitioning the Graph
The pregel library divides a graph into a number of partitions.
Each consisting of a set of vertices and all of those vertices’ outgoing
edges.
Vertices are assigned to partitions based on their vertex-ID (e.g.,
hash(ID)).
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 30 / 40
Implementation (1/4)
Master-worker model.
User programs are copied on machines.
One copy becomes the master.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 31 / 40
Implementation (2/4)
The master is responsible for
• Coordinating workers activity.
• Determining the number of partitions.
Each worker is responsible for
• Maintaining the state of its partitions.
• Executing the user’s Compute() method on its vertices.
• Managing messages to and from other workers.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 32 / 40
Implementation (3/4)
The master assigns one or more partitions to each worker.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 33 / 40
Implementation (3/4)
The master assigns one or more partitions to each worker.
The master assigns a portion of user input to each worker.
• Set of records containing an arbitrary number of vertices and edges.
• If a worker loads a vertex that belongs to that worker’s partitions,
the appropriate data structures are immediately updated.
• Otherwise the worker enqueues a message to the remote peer that
owns the vertex.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 33 / 40
Implementation (4/4)
After the input has finished loading, all vertices are marked as active.
The master instructs each worker to perform a superstep.
After the computation halts, the master may instruct each worker
to save its portion of the graph.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 34 / 40
Combiner
Sending a message between workers incurs some overhead: use com-
biner.
This can be reduced in some cases: sometimes vertices only care
about a summary value for the messages it is sent (e.g., min, max,
sum, avg).
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 35 / 40
Fault Tolerance (1/2)
Fault tolerance is achieved through checkpointing.
At start of each superstep, master tells workers to save their state:
• Vertex values, edge values, incoming messages
• Saved to persistent storage
Master saves aggregator values (if any).
This is not necessarily done at every superstep: costly
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 36 / 40
Fault Tolerance (2/2)
When master detects one or more worker failures:
• All workers revert to last checkpoint.
• Continue from there.
• That is a lot of repeated work.
• At least it is better than redoing the whole job.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 37 / 40
Pregel Limitations
Inefficient if different regions of the graph converge at different
speed.
Can suffer if one task is more expensive than the others.
Runtime of each phase is determined by the slowest machine.
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 38 / 40
Pregel Summary
Bulk Synchronous Parallel model
Vertex-centric
Superstep: sequence of iterations
Master-worker model
Communication: message passing
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 39 / 40
Questions?
Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 40 / 40

Contenu connexe

Tendances

Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaDatabricks
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
 
Spark overview
Spark overviewSpark overview
Spark overviewLisa Hua
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksLegacy Typesafe (now Lightbend)
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergAnant Corporation
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageJulien Le Dem
 
Introdução à Neo4j
Introdução à Neo4j Introdução à Neo4j
Introdução à Neo4j Neo4j
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationDatabricks
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®confluent
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaAvinash Ramineni
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®confluent
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...Altinity Ltd
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageDatabricks
 

Tendances (20)

Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Spark overview
Spark overviewSpark overview
Spark overview
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache IcebergData Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
Data Engineer's Lunch #83: Strategies for Migration to Apache Iceberg
 
Open core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineageOpen core summit: Observability for data pipelines with OpenLineage
Open core summit: Observability for data pipelines with OpenLineage
 
Introdução à Neo4j
Introdução à Neo4j Introdução à Neo4j
Introdução à Neo4j
 
Spark SQL Beyond Official Documentation
Spark SQL Beyond Official DocumentationSpark SQL Beyond Official Documentation
Spark SQL Beyond Official Documentation
 
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
Best Practices for Streaming IoT Data with MQTT and Apache Kafka®
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Log analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and KibanaLog analysis using Logstash,ElasticSearch and Kibana
Log analysis using Logstash,ElasticSearch and Kibana
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
OSA Con 2022 - Arrow in Flight_ New Developments in Data Connectivity - David...
 
Observability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineageObservability for Data Pipelines With OpenLineage
Observability for Data Pipelines With OpenLineage
 

En vedette

Spark Stream and SEEP
Spark Stream and SEEPSpark Stream and SEEP
Spark Stream and SEEPAmir Payberah
 
Introduction to stream processing
Introduction to stream processingIntroduction to stream processing
Introduction to stream processingAmir Payberah
 
Process Management - Part3
Process Management - Part3Process Management - Part3
Process Management - Part3Amir Payberah
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module ProgrammingAmir Payberah
 
Graph processing - Graphlab
Graph processing - GraphlabGraph processing - Graphlab
Graph processing - GraphlabAmir Payberah
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and SpannerAmir Payberah
 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1Amir Payberah
 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Amir Payberah
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2Amir Payberah
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Amir Payberah
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2Amir Payberah
 

En vedette (20)

Spark Stream and SEEP
Spark Stream and SEEPSpark Stream and SEEP
Spark Stream and SEEP
 
MapReduce
MapReduceMapReduce
MapReduce
 
Introduction to stream processing
Introduction to stream processingIntroduction to stream processing
Introduction to stream processing
 
Process Management - Part3
Process Management - Part3Process Management - Part3
Process Management - Part3
 
Linux Module Programming
Linux Module ProgrammingLinux Module Programming
Linux Module Programming
 
Graph processing - Graphlab
Graph processing - GraphlabGraph processing - Graphlab
Graph processing - Graphlab
 
MegaStore and Spanner
MegaStore and SpannerMegaStore and Spanner
MegaStore and Spanner
 
Process Management - Part1
Process Management - Part1Process Management - Part1
Process Management - Part1
 
Threads
ThreadsThreads
Threads
 
Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3Introduction to Operating Systems - Part3
Introduction to Operating Systems - Part3
 
Mesos and YARN
Mesos and YARNMesos and YARN
Mesos and YARN
 
Main Memory - Part2
Main Memory - Part2Main Memory - Part2
Main Memory - Part2
 
IO Systems
IO SystemsIO Systems
IO Systems
 
Security
SecuritySecurity
Security
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
Protection
ProtectionProtection
Protection
 
Process Management - Part2
Process Management - Part2Process Management - Part2
Process Management - Part2
 
Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2Introduction to Operating Systems - Part2
Introduction to Operating Systems - Part2
 
Storage
StorageStorage
Storage
 
CPU Scheduling - Part2
CPU Scheduling - Part2CPU Scheduling - Part2
CPU Scheduling - Part2
 

Similaire à Graph processing - Pregel

Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1Amir Payberah
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2Amir Payberah
 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Amir Payberah
 
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...FogGuru MSCA Project
 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2Amir Payberah
 
Reliability Study of wireless corba using Petri net and end to end instanteno...
Reliability Study of wireless corba using Petri net and end to end instanteno...Reliability Study of wireless corba using Petri net and end to end instanteno...
Reliability Study of wireless corba using Petri net and end to end instanteno...Ahmed Koriem
 
Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1Amir Payberah
 
Boetticher Presentation Promise 2008v2
Boetticher Presentation Promise 2008v2Boetticher Presentation Promise 2008v2
Boetticher Presentation Promise 2008v2gregoryg
 
MHH_20Feb_2012111111111111111111111111111.ppt
MHH_20Feb_2012111111111111111111111111111.pptMHH_20Feb_2012111111111111111111111111111.ppt
MHH_20Feb_2012111111111111111111111111111.pptBiHongPhc
 
Introduction to cloud computing and big data - part2
Introduction to cloud computing and big data - part2Introduction to cloud computing and big data - part2
Introduction to cloud computing and big data - part2Amir Payberah
 
Raminder kaur presentation_two
Raminder kaur presentation_twoRaminder kaur presentation_two
Raminder kaur presentation_tworamikaurraminder
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfPo-Chuan Chen
 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1Amir Payberah
 

Similaire à Graph processing - Pregel (20)

Spark
SparkSpark
Spark
 
Process Synchronization - Part1
Process Synchronization -  Part1Process Synchronization -  Part1
Process Synchronization - Part1
 
Process Synchronization - Part2
Process Synchronization -  Part2Process Synchronization -  Part2
Process Synchronization - Part2
 
Scala
ScalaScala
Scala
 
Chubby
ChubbyChubby
Chubby
 
Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1Introduction to Operating Systems - Part1
Introduction to Operating Systems - Part1
 
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
An Experiment-Driven Performance Model of Stream Processing Operators in Fog ...
 
Virtual Memory - Part2
Virtual Memory - Part2Virtual Memory - Part2
Virtual Memory - Part2
 
Bigtable
BigtableBigtable
Bigtable
 
Main Memory - Part1
Main Memory - Part1Main Memory - Part1
Main Memory - Part1
 
Reliability Study of wireless corba using Petri net and end to end instanteno...
Reliability Study of wireless corba using Petri net and end to end instanteno...Reliability Study of wireless corba using Petri net and end to end instanteno...
Reliability Study of wireless corba using Petri net and end to end instanteno...
 
Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1Introduction to cloud computing and big data - part1
Introduction to cloud computing and big data - part1
 
Boetticher Presentation Promise 2008v2
Boetticher Presentation Promise 2008v2Boetticher Presentation Promise 2008v2
Boetticher Presentation Promise 2008v2
 
MHH_20Feb_2012111111111111111111111111111.ppt
MHH_20Feb_2012111111111111111111111111111.pptMHH_20Feb_2012111111111111111111111111111.ppt
MHH_20Feb_2012111111111111111111111111111.ppt
 
Introduction to cloud computing and big data - part2
Introduction to cloud computing and big data - part2Introduction to cloud computing and big data - part2
Introduction to cloud computing and big data - part2
 
Deadlocks
DeadlocksDeadlocks
Deadlocks
 
Raminder kaur presentation_two
Raminder kaur presentation_twoRaminder kaur presentation_two
Raminder kaur presentation_two
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
 
Virtual Memory - Part1
Virtual Memory - Part1Virtual Memory - Part1
Virtual Memory - Part1
 
Transformer and BERT
Transformer and BERTTransformer and BERT
Transformer and BERT
 

Plus de Amir Payberah

P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution NetworkAmir Payberah
 
Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing FrameworksAmir Payberah
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformAmir Payberah
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformAmir Payberah
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2Amir Payberah
 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1Amir Payberah
 
File System Interface
File System InterfaceFile System Interface
File System InterfaceAmir Payberah
 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1Amir Payberah
 

Plus de Amir Payberah (8)

P2P Content Distribution Network
P2P Content Distribution NetworkP2P Content Distribution Network
P2P Content Distribution Network
 
Data Intensive Computing Frameworks
Data Intensive Computing FrameworksData Intensive Computing Frameworks
Data Intensive Computing Frameworks
 
The Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics PlatformThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform
 
The Spark Big Data Analytics Platform
The Spark Big Data Analytics PlatformThe Spark Big Data Analytics Platform
The Spark Big Data Analytics Platform
 
File System Implementation - Part2
File System Implementation - Part2File System Implementation - Part2
File System Implementation - Part2
 
File System Implementation - Part1
File System Implementation - Part1File System Implementation - Part1
File System Implementation - Part1
 
File System Interface
File System InterfaceFile System Interface
File System Interface
 
CPU Scheduling - Part1
CPU Scheduling - Part1CPU Scheduling - Part1
CPU Scheduling - Part1
 

Dernier

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Dernier (20)

WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Graph processing - Pregel

  • 1. Pregel: A System for Large-Scale Graph Processing Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 1 / 40
  • 2. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 2 / 40
  • 3. Introduction Graphs provide a flexible abstraction for describing relationships be- tween discrete objects. Many problems can be modeled by graphs and solved with appro- priate graph algorithms. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 3 / 40
  • 4. Large Graph Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 4 / 40
  • 5. Large-Scale Graph Processing Large graphs need large-scale processing. A large graph either cannot fit into memory of single computer or it fits with huge cost. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 5 / 40
  • 6. Question Can we use platforms like MapReduce or Spark, which are based on data-parallel model, for large-scale graph proceeding? Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 6 / 40
  • 7. Data-Parallel Model for Large-Scale Graph Processing The platforms that have worked well for developing parallel applica- tions are not necessarily effective for large-scale graph problems. Why? Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 7 / 40
  • 8. Graph Algorithms Characteristics (1/2) Unstructured problems • Difficult to extract parallelism based on partitioning of the data: the irregular structure of graphs. • Limited scalability: unbalanced computational loads resulting from poorly partitioned data. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 8 / 40
  • 9. Graph Algorithms Characteristics (1/2) Unstructured problems • Difficult to extract parallelism based on partitioning of the data: the irregular structure of graphs. • Limited scalability: unbalanced computational loads resulting from poorly partitioned data. Data-driven computations • Difficult to express parallelism based on partitioning of computation: the structure of computations in the algorithm is not known a priori. • The computations are dictated by nodes and links of the graph. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 8 / 40
  • 10. Graph Algorithms Characteristics (2/2) Poor data locality • The computations and data access patterns do not have much local- ity: the irregular structure of graphs. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 9 / 40
  • 11. Graph Algorithms Characteristics (2/2) Poor data locality • The computations and data access patterns do not have much local- ity: the irregular structure of graphs. High data access to computation ratio • Graph algorithms are often based on exploring the structure of a graph to perform computations on the graph data. • Runtime can be dominated by waiting memory fetches: low locality. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 9 / 40
  • 12. Proposed Solution Graph-Parallel Processing Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 10 / 40
  • 13. Proposed Solution Graph-Parallel Processing Computation typically depends on the neighbors. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 10 / 40
  • 14. Graph-Parallel Processing Restricts the types of computation. New techniques to partition and distribute graphs. Exploit graph structure. Executes graph algorithms orders-of-magnitude faster than more general data-parallel systems. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 11 / 40
  • 15. Data-Parallel vs. Graph-Parallel Computation Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 12 / 40
  • 16. Data-Parallel vs. Graph-Parallel Computation Data-parallel computation • Record-centric view of data. • Parallelism: processing independent data on separate resources. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 13 / 40
  • 17. Data-Parallel vs. Graph-Parallel Computation Data-parallel computation • Record-centric view of data. • Parallelism: processing independent data on separate resources. Graph-parallel computation • Vertex-centric view of graphs. • Parallelism: partitioning graph (dependent) data across processing resources, and resolving dependencies (along edges) through iterative computation and communication. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 13 / 40
  • 18. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 14 / 40
  • 19. Seven Bridges of K¨onigsberg Finding a walk through the city that would cross each bridge once and only once. Euler proved that the problem has no solution. Map of K¨onigsberg in Euler’s time, highlighting the river Pregel and the bridges. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 15 / 40
  • 20. Pregel Large-scale graph-parallel processing platform developed at Google. Inspired by bulk synchronous parallel (BSP) model. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 16 / 40
  • 21. Bulk Synchronous Parallel (1/2) It is a parallel programming model. The model consists of: Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 17 / 40
  • 22. Bulk Synchronous Parallel (1/2) It is a parallel programming model. The model consists of: • A set of processor-memory pairs. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 17 / 40
  • 23. Bulk Synchronous Parallel (1/2) It is a parallel programming model. The model consists of: • A set of processor-memory pairs. • A communications network that delivers messages in a point- to-point manner. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 17 / 40
  • 24. Bulk Synchronous Parallel (1/2) It is a parallel programming model. The model consists of: • A set of processor-memory pairs. • A communications network that delivers messages in a point- to-point manner. • A mechanism for the efficient barrier synchronization for all or a subset of the processes. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 17 / 40
  • 25. Bulk Synchronous Parallel (1/2) It is a parallel programming model. The model consists of: • A set of processor-memory pairs. • A communications network that delivers messages in a point- to-point manner. • A mechanism for the efficient barrier synchronization for all or a subset of the processes. • There are no special combining, replicating, or broadcasting fa- cilities. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 17 / 40
  • 26. Bulk Synchronous Parallel (2/2) All vertices update in parallel (at the same time). Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 18 / 40
  • 27. Vertex-Centric Programs Think as a vertex. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 19 / 40
  • 28. Vertex-Centric Programs Think as a vertex. Each vertex computes individually its value: in parallel Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 19 / 40
  • 29. Vertex-Centric Programs Think as a vertex. Each vertex computes individually its value: in parallel Each vertex can see its local context, and updates its value accord- ingly. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 19 / 40
  • 30. Data Model A directed graph that stores the program state, e.g., the current value. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 20 / 40
  • 31. Execution Model (1/3) Applications run in sequence of iterations: supersteps Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 21 / 40
  • 32. Execution Model (1/3) Applications run in sequence of iterations: supersteps During a superstep, user-defined functions for each vertex is invoked (method Compute()): in parallel Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 21 / 40
  • 33. Execution Model (1/3) Applications run in sequence of iterations: supersteps During a superstep, user-defined functions for each vertex is invoked (method Compute()): in parallel A vertex in superstep S can: • reads messages sent to it in superstep S-1. • sends messages to other vertices: receiving at superstep S+1. • modifies its state. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 21 / 40
  • 34. Execution Model (1/3) Applications run in sequence of iterations: supersteps During a superstep, user-defined functions for each vertex is invoked (method Compute()): in parallel A vertex in superstep S can: • reads messages sent to it in superstep S-1. • sends messages to other vertices: receiving at superstep S+1. • modifies its state. Vertices communicate directly with one another by sending mes- sages. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 21 / 40
  • 35. Execution Model (2/3) Superstep 0: all vertices are in the active state. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 22 / 40
  • 36. Execution Model (2/3) Superstep 0: all vertices are in the active state. A vertex deactivates itself by voting to halt: no further work to do. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 22 / 40
  • 37. Execution Model (2/3) Superstep 0: all vertices are in the active state. A vertex deactivates itself by voting to halt: no further work to do. A halted vertex can be active if it receives a message. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 22 / 40
  • 38. Execution Model (2/3) Superstep 0: all vertices are in the active state. A vertex deactivates itself by voting to halt: no further work to do. A halted vertex can be active if it receives a message. The whole algorithm terminates when: • All vertices are simultaneously inactive. • There are no messages in transit. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 22 / 40
  • 39. Execution Model (3/3) Aggregation: a mechanism for global communication, monitoring, and data. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 23 / 40
  • 40. Execution Model (3/3) Aggregation: a mechanism for global communication, monitoring, and data. Runs after each superstep. Each vertex can provide a value to an aggregator in superstep S. The system combines those values and the resulting value is made available to all vertices in superstep S + 1. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 23 / 40
  • 41. Execution Model (3/3) Aggregation: a mechanism for global communication, monitoring, and data. Runs after each superstep. Each vertex can provide a value to an aggregator in superstep S. The system combines those values and the resulting value is made available to all vertices in superstep S + 1. A number of predefined aggregators, e.g., min, max, sum. Aggregation operators should be commutative and associative. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 23 / 40
  • 42. Example: Max Value (1/4) i_val := val for each message m if m > val then val := m if i_val == val then vote_to_halt else for each neighbor v send_message(v, val) Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 24 / 40
  • 43. Example: Max Value (2/4) i_val := val for each message m if m > val then val := m if i_val == val then vote_to_halt else for each neighbor v send_message(v, val) Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 25 / 40
  • 44. Example: Max Value (3/4) i_val := val for each message m if m > val then val := m if i_val == val then vote_to_halt else for each neighbor v send_message(v, val) Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 26 / 40
  • 45. Example: Max Value (4/4) i_val := val for each message m if m > val then val := m if i_val == val then vote_to_halt else for each neighbor v send_message(v, val) Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 27 / 40
  • 46. Example: PageRank Update ranks in parallel. Iterate until convergence. R[i] = 0.15 + j∈Nbrs(i) wjiR[j] Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 28 / 40
  • 47. Example: PageRank Pregel_PageRank(i, messages): // receive all the messages total = 0 foreach(msg in messages): total = total + msg // update the rank of this vertex R[i] = 0.15 + total // send new messages to neighbors foreach(j in out_neighbors[i]): sendmsg(R[i] * wij) to vertex j R[i] = 0.15 + j∈Nbrs(i) wjiR[j] Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 29 / 40
  • 48. Partitioning the Graph The pregel library divides a graph into a number of partitions. Each consisting of a set of vertices and all of those vertices’ outgoing edges. Vertices are assigned to partitions based on their vertex-ID (e.g., hash(ID)). Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 30 / 40
  • 49. Implementation (1/4) Master-worker model. User programs are copied on machines. One copy becomes the master. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 31 / 40
  • 50. Implementation (2/4) The master is responsible for • Coordinating workers activity. • Determining the number of partitions. Each worker is responsible for • Maintaining the state of its partitions. • Executing the user’s Compute() method on its vertices. • Managing messages to and from other workers. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 32 / 40
  • 51. Implementation (3/4) The master assigns one or more partitions to each worker. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 33 / 40
  • 52. Implementation (3/4) The master assigns one or more partitions to each worker. The master assigns a portion of user input to each worker. • Set of records containing an arbitrary number of vertices and edges. • If a worker loads a vertex that belongs to that worker’s partitions, the appropriate data structures are immediately updated. • Otherwise the worker enqueues a message to the remote peer that owns the vertex. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 33 / 40
  • 53. Implementation (4/4) After the input has finished loading, all vertices are marked as active. The master instructs each worker to perform a superstep. After the computation halts, the master may instruct each worker to save its portion of the graph. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 34 / 40
  • 54. Combiner Sending a message between workers incurs some overhead: use com- biner. This can be reduced in some cases: sometimes vertices only care about a summary value for the messages it is sent (e.g., min, max, sum, avg). Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 35 / 40
  • 55. Fault Tolerance (1/2) Fault tolerance is achieved through checkpointing. At start of each superstep, master tells workers to save their state: • Vertex values, edge values, incoming messages • Saved to persistent storage Master saves aggregator values (if any). This is not necessarily done at every superstep: costly Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 36 / 40
  • 56. Fault Tolerance (2/2) When master detects one or more worker failures: • All workers revert to last checkpoint. • Continue from there. • That is a lot of repeated work. • At least it is better than redoing the whole job. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 37 / 40
  • 57. Pregel Limitations Inefficient if different regions of the graph converge at different speed. Can suffer if one task is more expensive than the others. Runtime of each phase is determined by the slowest machine. Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 38 / 40
  • 58. Pregel Summary Bulk Synchronous Parallel model Vertex-centric Superstep: sequence of iterations Master-worker model Communication: message passing Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 39 / 40
  • 59. Questions? Amir H. Payberah (Tehran Polytechnic) Pregel 1393/9/3 40 / 40