SlideShare une entreprise Scribd logo
1  sur  44
Télécharger pour lire hors ligne
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Hello, I'm Eli Reisman!
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Eli is...
•  Apache Giraph Committer and PMC Member
•  Apache Tajo Committer
•  Wrote initial port of Giraph to YARN
•  Collaborating with fellow Giraph committers on
Giraph in Action book for Manning publishing
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Eli is...
•  Only able to do all this with the support of:
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Eli is a software engineer at
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Etsy enables non-technical folks to sell
handmade and vintage stuff:
We have a great blog called Code As Craft:
Fast, Scalable Graph Processing:
Apache Giraph on YARN
...but, enough about me, lets talk Giraph!
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Key Topics
What is Apache Giraph?
Why do I need it?
Giraph + MapReduce
Giraph + YARN
Giraph Roadmap
Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph is a framework for performing offline
batch processing of semi-structured graph
data on a massive scale.
Giraph is loosely based upon Google's Pregel
graph processing framework.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph performs iterative calculations on top of an
existing Hadoop cluster.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph uses Apache Zookeeper to enforce atomic
barrier waits and perform leader election.
Done! Done! ...Still
working...
Fast, Scalable Graph Processing:
Apache Giraph on YARN
What is Apache Giraph?
Giraph benefits from a vibrant Apache community, and is
under active development:
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Why do I need it?
Giraph makes graph algorithms easy to reason about
and implement by following the Bulk Synchronous
Parallel (BSP) programming model.
In BSP, all algorithms are implemented from the point
of view of a single vertex in the input graph
performing a single iteration of the computation.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Why do I need it?
•  Giraph makes iterative data processing more
practical for Hadoop users.
•  Giraph can avoid costly disk and network
operations that are mandatory in MR.
•  No concept of message passing in MR.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Why do I need it?
Each cycle of an iterative calculation on
Hadoop means running a full MapReduce
job.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Let's use simple PageRank as a quick
example:
http://en.wikipedia.org/wiki/PageRank
1.0
1.0
1.0
Fast, Scalable Graph Processing:
Apache Giraph on YARN
1. All vertices start with same PageRank
1.0
1.0
1.0
Fast, Scalable Graph Processing:
Apache Giraph on YARN
2. Each vertex distributes an equal portion of
its PageRank to all neighbors:
0.5
0.5
1
1
Fast, Scalable Graph Processing:
Apache Giraph on YARN
3. Each vertex sums incoming values times a
weight factor and adds in small adjustment:
1/(# vertices in graph)
(.5*.85) + (.15/3)
(1.5*.85) + (.15/3)
(1*.85) + (.15/3)
Fast, Scalable Graph Processing:
Apache Giraph on YARN
4. This value becomes the vertices' PageRank
for the next iteration
.43
.21
.64
Fast, Scalable Graph Processing:
Apache Giraph on YARN
5. Repeat until convergence:
(change in PR per-iteration < epsilon)
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Vertices with more in-degrees converge to higher
PageRank
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Put another way:
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
1. Load complete input graph from disk as
[K= Vertex ID, V = out-edges and PR]
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
2. Emit all input records (full graph state),
Emit [K = edgeTarget, V = share of PR]
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
3. Sort and Shuffle this entire mess!
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
4. Sum incoming PR shares for each vertex,
update PR values in graph state records
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
5. Emit full graph state to disk...
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
6. ...and start over!
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on MapReduce
•  Awkward to reason about
•  I/O bound despite simple core business logic
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
1. Hadoop Mappers are "hijacked" to host
Giraph master and worker tasks.
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
2. Input graph is loaded once, maintaining
code-data locality when possible.
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
3. All iterations are performed on data in memory,
optionally spilled to disk. Disk access is linear/
scan-based.
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
PageRank on Giraph
4. Output is written from the Mappers hosting
the calculation, and the job run ends.
Map Sort/Shuffle Reduce
Fast, Scalable Graph Processing:
Apache Giraph on YARN
This is all well and good, but must we
manipulate Hadoop this way?
?
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + MapReduce
•  Heap and other resources are set once, globally, for all
Mappers in the computation.
•  No control of which cluster nodes host which tasks.
•  No control over how Mappers are scheduled.
•  Mapper and Reducer slots abstraction is meaningless
for Giraph at best, an artificial limit at worst.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
YARN
•  YARN (Yet Another Resource Negotiator) is Hadoop's
next-gen job management platform.
•  Powers MapReduce v2, but is a general purpose
framework that is not tied to the MapReduce paradigm.
•  Offers fine-grained control over each task's resource
allocations and host placement for clients that need it.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
YARN Architecture
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN
Its a natural fit!
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN
•  Giraph has maintained compatibility with Hadoop since
0.1 release by executing via MapReduce interface.
•  Giraph has featured a "pure YARN" build profile since
1.0 release. It supports Hadoop-2.0.3 and trunk.
*Patches to add 2.0.4 and 2.0.5 support are in review :)
•  Giraph's YARN component is easy to extend or use as
a template to port other projects!
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN: Roadmap
•  YARN Application Master allows for more natural and
stable bootstrapping of Giraph jobs.
•  Zookeeper management can find natural home in
Application Master.
•  Giraph on YARN can stop borrowing from Hadoop and
have its own web interface.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph + YARN: Roadmap
•  Variable per-task resource allocation opens up the
possibility of Supertasks to manage graph supernodes.
•  Ability to spawn or retire tasks per-iteration enables in-
flight reassignment of data partitions.
•  AppMaster managed utility tasks such as dedicated
sub-aggregators for tree-like aggregation, or data pre-
samplers.
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Giraph New Developments
•  Decoupling of logic and graph data means tasks host
computations that are pluggable per-iteration.
•  Support for Giraph job scripting, starting with Jython.
More to follow...
•  New website, fresh docs, upcoming Manning book, and
large, active community means Giraph has never been
easier to use or contribute to!
Fast, Scalable Graph Processing:
Apache Giraph on YARN
Great! Where can I learn more?
http://giraph.apache.org
Mailing List:
user@giraph.apache.org

Contenu connexe

Tendances

Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?rhatr
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloudrhatr
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesVinay Shukla
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanLuke Han
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & ZeppelinVinay Shukla
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDataWorks Summit
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
SparkR + Zeppelin
SparkR + ZeppelinSparkR + Zeppelin
SparkR + Zeppelinfelixcss
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsMilind Bhandarkar
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Alexey Zinoviev
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupNed Shawa
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterMilind Bhandarkar
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCJosh Baer
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark Summit
 

Tendances (20)

Hadoop to spark_v2
Hadoop to spark_v2Hadoop to spark_v2
Hadoop to spark_v2
 
Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?Apache Spark: killer or savior of Apache Hadoop?
Apache Spark: killer or savior of Apache Hadoop?
 
Elephant in the cloud
Elephant in the cloudElephant in the cloud
Elephant in the cloud
 
Apache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenchesApache con big data 2015 - Data Science from the trenches
Apache con big data 2015 - Data Science from the trenches
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
The Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke HanThe Evolution of Apache Kylin by Luke Han
The Evolution of Apache Kylin by Luke Han
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
Data Science with Spark & Zeppelin
Data Science with Spark & ZeppelinData Science with Spark & Zeppelin
Data Science with Spark & Zeppelin
 
Spark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko KorndorfSpark Summit EU talk by Heiko Korndorf
Spark Summit EU talk by Heiko Korndorf
 
Dynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache GiraphDynamic Draph / Iterative Computation on Apache Giraph
Dynamic Draph / Iterative Computation on Apache Giraph
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
SparkR + Zeppelin
SparkR + ZeppelinSparkR + Zeppelin
SparkR + Zeppelin
 
Future of Data Intensive Applicaitons
Future of Data Intensive ApplicaitonsFuture of Data Intensive Applicaitons
Future of Data Intensive Applicaitons
 
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
Thorny path to the Large-Scale Graph Processing (Highload++, 2014)
 
Spark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni SchieferSpark Summit EU talk by Berni Schiefer
Spark Summit EU talk by Berni Schiefer
 
Apache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetupApache spark-melbourne-april-2015-meetup
Apache spark-melbourne-april-2015-meetup
 
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to HamsterThe Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
The Zoo Expands: Labrador *Loves* Elephant, Thanks to Hamster
 
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYCSpotify in the Cloud - An evolution of data infrastructure - Strata NYC
Spotify in the Cloud - An evolution of data infrastructure - Strata NYC
 
Spark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar CastanedaSpark Summit EU talk by Oscar Castaneda
Spark Summit EU talk by Oscar Castaneda
 
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
Spark at NASA/JPL-(Chris Mattmann, NASA/JPL)
 

En vedette

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Mani kandan
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsNesreen K. Ahmed
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphDataWorks Summit
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big dataSigmoid
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - HortonworksAvery Ching
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXAndrea Iacono
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingPetr Zapletal
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked DataMarin Dimitrov
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXKrishna Sankar
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesPaco Nathan
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphsNicola Barbieri
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesData Ninja API
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkFlink Forward
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)Ankur Dave
 
Graph theory
Graph theoryGraph theory
Graph theoryKumar
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphXAndy Petrella
 
Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Rajul Kukreja
 

En vedette (20)

Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...Improving personalized recommendations through temporal overlapping community...
Improving personalized recommendations through temporal overlapping community...
 
Graph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph AnalyticsGraph Sample and Hold: A Framework for Big Graph Analytics
Graph Sample and Hold: A Framework for Big Graph Analytics
 
Apache giraph
Apache giraphApache giraph
Apache giraph
 
Hadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache GiraphHadoop Graph Processing with Apache Giraph
Hadoop Graph Processing with Apache Giraph
 
Graph Analytics for big data
Graph Analytics for big dataGraph Analytics for big data
Graph Analytics for big data
 
2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks2011.10.14 Apache Giraph - Hortonworks
2011.10.14 Apache Giraph - Hortonworks
 
Graphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphXGraphs are everywhere! Distributed graph computing with Spark GraphX
Graphs are everywhere! Distributed graph computing with Spark GraphX
 
Spark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, StreamingSpark Concepts - Spark SQL, Graphx, Streaming
Spark Concepts - Spark SQL, Graphx, Streaming
 
Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Graph Analytics
Graph AnalyticsGraph Analytics
Graph Analytics
 
An excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphXAn excursion into Graph Analytics with Apache Spark GraphX
An excursion into Graph Analytics with Apache Spark GraphX
 
GraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communitiesGraphX: Graph analytics for insights about developer communities
GraphX: Graph analytics for insights about developer communities
 
Community detection in graphs
Community detection in graphsCommunity detection in graphs
Community detection in graphs
 
Applying large scale text analytics with graph databases
Applying large scale text analytics with graph databasesApplying large scale text analytics with graph databases
Applying large scale text analytics with graph databases
 
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache FlinkMartin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
Martin Junghans – Gradoop: Scalable Graph Analytics with Apache Flink
 
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
 
Graph theory
Graph theoryGraph theory
Graph theory
 
Machine Learning and GraphX
Machine Learning and GraphXMachine Learning and GraphX
Machine Learning and GraphX
 
Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...Recomendation system: Community Detection Based Recomendation System using Hy...
Recomendation system: Community Detection Based Recomendation System using Hy...
 

Similaire à Fast, Scalable Graph Processing with Apache Giraph on YARN

[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache SparkNaukri.com
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark Aakashdata
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceUwe Printz
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfWasyihunSema2
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadooppengshanzhang
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Sparkrhatr
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Agile Testing Alliance
 
Apache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceApache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceEdureka!
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]Shweta Patnaik
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to YarnApache Apex
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...SCAPE Project
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to sparkHome
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwordsSzehon Ho
 

Similaire à Fast, Scalable Graph Processing with Apache Giraph on YARN (20)

Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14Giraph+Gora in ApacheCon14
Giraph+Gora in ApacheCon14
 
[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark[@NaukriEngineering] Apache Spark
[@NaukriEngineering] Apache Spark
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Introduction to apache spark
Introduction to apache spark Introduction to apache spark
Introduction to apache spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Hadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduceHadoop 2 - More than MapReduce
Hadoop 2 - More than MapReduce
 
Big Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdfBig Data Analytics Chapter3-6@2021.pdf
Big Data Analytics Chapter3-6@2021.pdf
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoop
 
Tachyon and Apache Spark
Tachyon and Apache SparkTachyon and Apache Spark
Tachyon and Apache Spark
 
Apache PIG
Apache PIGApache PIG
Apache PIG
 
Empire: JPA for RDF & SPARQL
Empire: JPA for RDF & SPARQLEmpire: JPA for RDF & SPARQL
Empire: JPA for RDF & SPARQL
 
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
Introduction To Big Data with Hadoop and Spark - For Batch and Real Time Proc...
 
Apache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduceApache Spark beyond Hadoop MapReduce
Apache Spark beyond Hadoop MapReduce
 
Apache spark installation [autosaved]
Apache spark installation [autosaved]Apache spark installation [autosaved]
Apache spark installation [autosaved]
 
Introduction to Yarn
Introduction to YarnIntroduction to Yarn
Introduction to Yarn
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
Apache spark
Apache sparkApache spark
Apache spark
 
Introduction to spark
Introduction to sparkIntroduction to spark
Introduction to spark
 
Module01
 Module01 Module01
Module01
 
Hive on spark berlin buzzwords
Hive on spark berlin buzzwordsHive on spark berlin buzzwords
Hive on spark berlin buzzwords
 

Plus de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Plus de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Dernier

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 

Dernier (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 

Fast, Scalable Graph Processing with Apache Giraph on YARN

  • 1. Fast, Scalable Graph Processing: Apache Giraph on YARN
  • 2. Fast, Scalable Graph Processing: Apache Giraph on YARN Hello, I'm Eli Reisman!
  • 3. Fast, Scalable Graph Processing: Apache Giraph on YARN Eli is... •  Apache Giraph Committer and PMC Member •  Apache Tajo Committer •  Wrote initial port of Giraph to YARN •  Collaborating with fellow Giraph committers on Giraph in Action book for Manning publishing
  • 4. Fast, Scalable Graph Processing: Apache Giraph on YARN Eli is... •  Only able to do all this with the support of:
  • 5. Fast, Scalable Graph Processing: Apache Giraph on YARN Eli is a software engineer at
  • 6. Fast, Scalable Graph Processing: Apache Giraph on YARN Etsy enables non-technical folks to sell handmade and vintage stuff: We have a great blog called Code As Craft:
  • 7. Fast, Scalable Graph Processing: Apache Giraph on YARN ...but, enough about me, lets talk Giraph!
  • 8. Fast, Scalable Graph Processing: Apache Giraph on YARN Key Topics What is Apache Giraph? Why do I need it? Giraph + MapReduce Giraph + YARN Giraph Roadmap
  • 9. Fast, Scalable Graph Processing: Apache Giraph on YARN What is Apache Giraph? Giraph is a framework for performing offline batch processing of semi-structured graph data on a massive scale. Giraph is loosely based upon Google's Pregel graph processing framework.
  • 10. Fast, Scalable Graph Processing: Apache Giraph on YARN What is Apache Giraph? Giraph performs iterative calculations on top of an existing Hadoop cluster.
  • 11. Fast, Scalable Graph Processing: Apache Giraph on YARN What is Apache Giraph? Giraph uses Apache Zookeeper to enforce atomic barrier waits and perform leader election. Done! Done! ...Still working...
  • 12. Fast, Scalable Graph Processing: Apache Giraph on YARN What is Apache Giraph? Giraph benefits from a vibrant Apache community, and is under active development:
  • 13. Fast, Scalable Graph Processing: Apache Giraph on YARN Why do I need it? Giraph makes graph algorithms easy to reason about and implement by following the Bulk Synchronous Parallel (BSP) programming model. In BSP, all algorithms are implemented from the point of view of a single vertex in the input graph performing a single iteration of the computation.
  • 14. Fast, Scalable Graph Processing: Apache Giraph on YARN Why do I need it? •  Giraph makes iterative data processing more practical for Hadoop users. •  Giraph can avoid costly disk and network operations that are mandatory in MR. •  No concept of message passing in MR.
  • 15. Fast, Scalable Graph Processing: Apache Giraph on YARN Why do I need it? Each cycle of an iterative calculation on Hadoop means running a full MapReduce job.
  • 16. Fast, Scalable Graph Processing: Apache Giraph on YARN Let's use simple PageRank as a quick example: http://en.wikipedia.org/wiki/PageRank 1.0 1.0 1.0
  • 17. Fast, Scalable Graph Processing: Apache Giraph on YARN 1. All vertices start with same PageRank 1.0 1.0 1.0
  • 18. Fast, Scalable Graph Processing: Apache Giraph on YARN 2. Each vertex distributes an equal portion of its PageRank to all neighbors: 0.5 0.5 1 1
  • 19. Fast, Scalable Graph Processing: Apache Giraph on YARN 3. Each vertex sums incoming values times a weight factor and adds in small adjustment: 1/(# vertices in graph) (.5*.85) + (.15/3) (1.5*.85) + (.15/3) (1*.85) + (.15/3)
  • 20. Fast, Scalable Graph Processing: Apache Giraph on YARN 4. This value becomes the vertices' PageRank for the next iteration .43 .21 .64
  • 21. Fast, Scalable Graph Processing: Apache Giraph on YARN 5. Repeat until convergence: (change in PR per-iteration < epsilon)
  • 22. Fast, Scalable Graph Processing: Apache Giraph on YARN Vertices with more in-degrees converge to higher PageRank
  • 23. Fast, Scalable Graph Processing: Apache Giraph on YARN Put another way:
  • 24. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 1. Load complete input graph from disk as [K= Vertex ID, V = out-edges and PR] Map Sort/Shuffle Reduce
  • 25. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 2. Emit all input records (full graph state), Emit [K = edgeTarget, V = share of PR] Map Sort/Shuffle Reduce
  • 26. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 3. Sort and Shuffle this entire mess! Map Sort/Shuffle Reduce
  • 27. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 4. Sum incoming PR shares for each vertex, update PR values in graph state records Map Sort/Shuffle Reduce
  • 28. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 5. Emit full graph state to disk... Map Sort/Shuffle Reduce
  • 29. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce 6. ...and start over! Map Sort/Shuffle Reduce
  • 30. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on MapReduce •  Awkward to reason about •  I/O bound despite simple core business logic Map Sort/Shuffle Reduce
  • 31. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on Giraph 1. Hadoop Mappers are "hijacked" to host Giraph master and worker tasks. Map Sort/Shuffle Reduce
  • 32. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on Giraph 2. Input graph is loaded once, maintaining code-data locality when possible. Map Sort/Shuffle Reduce
  • 33. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on Giraph 3. All iterations are performed on data in memory, optionally spilled to disk. Disk access is linear/ scan-based. Map Sort/Shuffle Reduce
  • 34. Fast, Scalable Graph Processing: Apache Giraph on YARN PageRank on Giraph 4. Output is written from the Mappers hosting the calculation, and the job run ends. Map Sort/Shuffle Reduce
  • 35. Fast, Scalable Graph Processing: Apache Giraph on YARN This is all well and good, but must we manipulate Hadoop this way? ?
  • 36. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph + MapReduce •  Heap and other resources are set once, globally, for all Mappers in the computation. •  No control of which cluster nodes host which tasks. •  No control over how Mappers are scheduled. •  Mapper and Reducer slots abstraction is meaningless for Giraph at best, an artificial limit at worst.
  • 37. Fast, Scalable Graph Processing: Apache Giraph on YARN YARN •  YARN (Yet Another Resource Negotiator) is Hadoop's next-gen job management platform. •  Powers MapReduce v2, but is a general purpose framework that is not tied to the MapReduce paradigm. •  Offers fine-grained control over each task's resource allocations and host placement for clients that need it.
  • 38. Fast, Scalable Graph Processing: Apache Giraph on YARN YARN Architecture
  • 39. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph + YARN Its a natural fit!
  • 40. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph + YARN •  Giraph has maintained compatibility with Hadoop since 0.1 release by executing via MapReduce interface. •  Giraph has featured a "pure YARN" build profile since 1.0 release. It supports Hadoop-2.0.3 and trunk. *Patches to add 2.0.4 and 2.0.5 support are in review :) •  Giraph's YARN component is easy to extend or use as a template to port other projects!
  • 41. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph + YARN: Roadmap •  YARN Application Master allows for more natural and stable bootstrapping of Giraph jobs. •  Zookeeper management can find natural home in Application Master. •  Giraph on YARN can stop borrowing from Hadoop and have its own web interface.
  • 42. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph + YARN: Roadmap •  Variable per-task resource allocation opens up the possibility of Supertasks to manage graph supernodes. •  Ability to spawn or retire tasks per-iteration enables in- flight reassignment of data partitions. •  AppMaster managed utility tasks such as dedicated sub-aggregators for tree-like aggregation, or data pre- samplers.
  • 43. Fast, Scalable Graph Processing: Apache Giraph on YARN Giraph New Developments •  Decoupling of logic and graph data means tasks host computations that are pluggable per-iteration. •  Support for Giraph job scripting, starting with Jython. More to follow... •  New website, fresh docs, upcoming Manning book, and large, active community means Giraph has never been easier to use or contribute to!
  • 44. Fast, Scalable Graph Processing: Apache Giraph on YARN Great! Where can I learn more? http://giraph.apache.org Mailing List: user@giraph.apache.org