SlideShare une entreprise Scribd logo
1  sur  52
Télécharger pour lire hors ligne
Turning NoSQL data into Graph

Playing with Apache Giraph and Apache Gora
Team
Renato Marroquín
!
• PhD student:
• Interested in:
Information retrieval.
Distributed and scalable data management.
• Apache Gora:
PPMC Member
Committer.
• rmarroquin [at] apache [dot] org
Claudio Martella
• PhD student: LSDS @VU University Amsterdam.
• Interested in
Complex Networks.
Distributed and scalable infrastructures.
• Apache Girapher:
PPMC Member
Committer.
• claudio [at] apache [dot] org
Lewis McGibbney
• Scottish expat fae Glasgow
• Post Doc @Stanford University: Engineering Informatics
• Quantity Surveyor/Cost Consultant by 

profession
• Cycling mad
• Keen OSS enthusiast @TheASF 

and beyond
• lewismc [at] apacher [dot] org
Apache Gora
What is Apache Gora?
● Data Persistence : Persisting objects to Column stores, key-value
stores, SQL databases and to flat files in local file system of Hadoop
HDFS.
● Data Access : An easy to use Java-friendly common API for accessing
the data regardless of its location.
● Indexing : Persisting objects to Lucene and Solr indexes, accessing/
querying the data with Gora API.
● Analysis : Accesing the data and making analysis through adapters for
Apache Pig, Apache Hive and Cascading
● MapReduce support : Out-of-the-box and extensive MapReduce
(Apache Hadoop) support for data in the data store.
What is Apache Gora?
● Provides an in-memory data model and persistence for big data.
● Gora supports:
How does Gora work?
!
1.Define your schema using Apache AVRO.
2.Compile your schemas using Gora's Compiler.
3.Create a mapping between logical and physical layout.
4.Update gora.properties file to set back-end properties.



Rock the NoSQL world!!!
How does Gora work?
1.Define your schema using Apache AVRO.
How does Gora work?
2.Compile your schemas using Gora's Compiler.
	 java -jar gora-core-XYZ-.jar

" " o.a.gora.compiler.GoraCompiler.class 

" " " employee.avsc

" " " gora-app/src/main/java/
How does Gora work?
2.Compile your schemas using Gora's Compiler.
How does Gora work?
3.Create a mapping between logical and physical layout.
How does Gora work?
4.Update gora.properties file to set back-end properties.
How does Gora work?
Rock the NoSQL world!
Apache Giraph
MapReduce and Graphs
• Plain MapReduce is not well suited for graph
algorithms because:
• Graph algorithms are iterative.
• Not intuitive in MapReduce.
• Unnecessarily slow
• Each iteration is a single MapReduce job with too much
overhead
• Separately scheduled
• The graph structure is read from disk
• The intermediate results are read from disks
• Hard to implement
Google's Pregel
• Introduced on 2010
• Based on Valiant's BSP
• “Think like a vertex” that can send messages to any vertex in the
graph using the bulk synchronous parallel programming model.
• Computation complete when all components complete.
• Batch-oriented processing
• Computation happens in-memory
• Master/slave architecture
Bulk synchronous parallel
Time
Processors
Barrier
Computation + 

Communication
Superstep
Open source implementations
• There are some such as:
• Apache Giraph
• Apache Hama
• GoldenOrb
• Signal/Collect
Apache Giraph
• Incubated since summer 2011
• Written in Java
• Implements Pregel's API
• Runs on existing MapReduce infrastructure
• Active community from Yahoo!, Facebook, LinkedIn, Twitter, and
more.
• It's a single Map-only job
• It runs on Hadoop in-memory.
• Fault tolerant
• Zookeeper for state, No SPOF
During execution time
Setup
● Load graph
● Assign vertices to workers
● Validate workers' health
Teardown
● Write results back
● Write aggregators back
Computer
● Assign messages to workers
● Iterate on active vertices
● Call vertices compute()
Synchronize
● Send messages to workers
● Compute aggregators
● Checkpoint
Giraph's components
• Master
• Application coordinator
• One active master at a time
• Assigns partition owners to workers prior to each superstep
• Synchronizes supersteps
• Worker – Computation & messaging
• Loads the graph from input splits
• Performs computation/messaging of its assigned partitions
• Zookeeper
• Maintains global application state
What is needed then?
• Your algorithm in the Pregel model.
• A VertexInputFormat to read your graph.

e.g. <vertex><neighbor1><neighbor2>
• A VertexOutputFormat to write back the results.

e.g. <vertex> <pageRank>
• You could define:
• A Combiner (for reducing number of messages sent/received)
• An Aggregator (for enabling global computation)
Running a Giraph job
• It is just like running Hadoop
!
$HADOOP_HOME/bin/hadoop jar
giraph-examples-1.1.0-XXX-jar-with-dependencies.jar
o.a.g.GiraphRunner o.a.g.examples.SimpleShortestPathsComputation
-vif o.a.g.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat
-vip /user/hduser/input/tiny_graph.txt
-vof o.a.g.io.formats.IdWithValueTextOutputFormat
-op /user/hduser/output/shortestpaths
-w 1
Apache Giraph + Apache Gora
The project idea
• Integrating Apache Gora with other cool projects.
• Provide access to different data stores out-of-the-
box for Apache Giraph.
• Give users more flexibility when deciding how to run graph
algorithms.
• Make the Hadoop Env bigger.
• Apply to for the Google Summer of Code Project.
The big picture
Integration hooks
• Vertices
Integration hooks
• Vertices
Integration hooks
• Edges
Integration hooks
• Edges
Integration hooks
• Key factory
Parameters offered
Label Description
giraph.gora.datastore.class Gora DataStore class to access to data from - required.
!giraph.gora.key.class Gora Key class to query the datastore - required.
giraph.gora.persistent.class Gora Persistent class to read objects from Gora - required.
giraph.gora.keys.factory.class Keys factory to convert strings into desired keys - required.
giraph.gora.output.datastore.class Gora DataStore class to write data to - required.
giraph.gora.output.key.class Gora Key class to write to datastore - required.
giraph.gora.output.persistent.class Gora Persistent class to write to Gora - required.
giraph.gora.start.key Gora start key to query the datastore.
giraph.gora.end.key Gora end key to query the datastore.
Rocks in the way
• Dependency issues.
• Supported versions by each project.
• Maven war for handling cyclic dependencies.
• Hadoop issues.
• Not all data stores support MapReduce out of the box.
• Finding what it is necessary to be in the classpath.
• Providing an API between both projects that is:
• Flexible.
• Simple.
• Pluggable.
So now what?
1.Create your data beans with Gora.
So now what?
2. Compile them.
java -jar gora-core-XYZ.jar o.a.gora.compiler.GoraCompiler.class vertex.avsc gora-app/src/main/java/
So now what?
3. Get your Gora files set up for passing them to Giraph.
	 Gora.properties
	 Gora-mapping-{datastore}.xml.
So now what?
4. Get your hooks in place.
	 GVertexInputFormat
So now what?
4. Get your hooks in place.
	 GVertexOutputFormat
So now what?
4. Get your hooks in place.
	 GVertexOutputFormat
So now what?
4. Get your hooks in place.
	 KeyFactory
So now what?
5. Run Giraph!
	 hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner "
-files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml"
-Dio.serializations=o.a.h.io.serializer.WritableSerialization,o.a.h.io.serializer.JavaSerialization"
-Dgiraph.gora.datastore.class=org.apache.gora.hbase.store.HBaseStore"
-Dgiraph.gora.key.class=java.lang.String"
-Dgiraph.gora.persistent.class=org.apache.giraph.io.gora.generated.GEdge"
-Dgiraph.gora.start.key=0 -Dgiraph.gora.end.key=10"
-Dgiraph.gora.keys.factory.class=org.apache.giraph.io.gora.utils.KeyFactory"
-Dgiraph.gora.output.datastore.class=org.apache.gora.hbase.store.HBaseStore "
-Dgiraph.gora.output.key.class=java.lang.String "
-Dgiraph.gora.output.persistent.class=org.apache.giraph.io.gora.generated.GEdgeResult "
-libjars $GIRAPH_GORA_JAR,$GORA_HBASE_JAR,$HBASE_JAR"
org.apache.giraph.examples.SimpleShortestPathsComputation "
-eif org.apache.giraph.io.gora.GoraGEdgeEdgeInputFormat "
-eof org.apache.giraph.io.gora.GoraGEdgeEdgeOutputFormat "
-w 1"
Future work
More complex schemas
Adding more data stores
Send us an email on the mailing lists
New serialization formats
• Different serialization formats beside Apache Avro.
!
!
!
!
!
!
• And others that could be interesting for handling different use
cases.
Thanks!
Q&A
References
• http://prezi.com/9ake_klzwrga/apache-giraph-distributed-graph-processing-in-the-cloud/
• http://de.slideshare.net/sscdotopen/large-scale
• http://www.slideshare.net/Hadoop_Summit/processing-edges-on-apache-giraph
Bulk synchronous parallel model
• Computation consists of a series of “supersteps”
• Supersteps are an atomic unit of computation where operations can
happen in parallel
• During a superstep, components are assigned to tasks and receive
unordered messages from previous supersteps.
• Point-to-point messages
• Sent during a superstep from one component to another and then
delivered in the following supersteps.
• Computation completes when all components complete

Contenu connexe

Tendances

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleEvan Chan
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomyDongmin Yu
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2DataWorks Summit
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersRahul Jain
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoHyunsik Choi
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Lightbend
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...CloudxLab
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupRafal Kwasny
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Richard Seymour
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersDataWorks Summit
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failingSandy Ryza
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterDon Drake
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsGuy Harrison
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Databricks
 

Tendances (20)

Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza SeattleBuilding Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines - 2016 DataPalooza Seattle
 
Spark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve LoughranSpark Summit EU talk by Steve Loughran
Spark Summit EU talk by Steve Loughran
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2New Data Transfer Tools for Hadoop: Sqoop 2
New Data Transfer Tools for Hadoop: Sqoop 2
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Sqoop
SqoopSqoop
Sqoop
 
Hadoop & HDFS for Beginners
Hadoop & HDFS for BeginnersHadoop & HDFS for Beginners
Hadoop & HDFS for Beginners
 
Efficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajoEfficient in situ processing of various storage types on apache tajo
Efficient in situ processing of various storage types on apache tajo
 
spark-kafka_mod
spark-kafka_modspark-kafka_mod
spark-kafka_mod
 
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
Modernizing Infrastructures for Fast Data with Spark, Kafka, Cassandra, React...
 
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
Introduction to Spark Streaming & Apache Kafka | Big Data Hadoop Spark Tutori...
 
SQL and Search with Spark in your browser
SQL and Search with Spark in your browserSQL and Search with Spark in your browser
SQL and Search with Spark in your browser
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 
ETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetupETL with SPARK - First Spark London meetup
ETL with SPARK - First Spark London meetup
 
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
Rapid Prototyping in PySpark Streaming: The Thermodynamics of Docker Containe...
 
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared ClustersMercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
Mercury: Hybrid Centralized and Distributed Scheduling in Large Shared Clusters
 
Why your Spark job is failing
Why your Spark job is failingWhy your Spark job is failing
Why your Spark job is failing
 
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball RosterSpark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
Spark ETL Techniques - Creating An Optimal Fantasy Baseball Roster
 
From oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other toolsFrom oracle to hadoop with Sqoop and other tools
From oracle to hadoop with Sqoop and other tools
 
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
Spark DataFrames: Simple and Fast Analytics on Structured Data at Spark Summi...
 

Similaire à Giraph+Gora in ApacheCon14

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark FundamentalsZahra Eskandari
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesAlexis Seigneurin
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraDataStax Academy
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoMapR Technologies
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scalesamthemonad
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance IssuesAntonios Katsarakis
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014Claudiu Barbura
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache SparkSimon Lia-Jonassen
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internalDavid Lauzon
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformTsuyoshi OZAWA
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingDemi Ben-Ari
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionRUHULAMINHAZARIKA
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerEvan Chan
 

Similaire à Giraph+Gora in ApacheCon14 (20)

Apache Spark Fundamentals
Apache Spark FundamentalsApache Spark Fundamentals
Apache Spark Fundamentals
 
Lessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and MicroservicesLessons Learned: Using Spark and Microservices
Lessons Learned: Using Spark and Microservices
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Apache Spark Core
Apache Spark CoreApache Spark Core
Apache Spark Core
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Intro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of TwingoIntro to Apache Spark by CTO of Twingo
Intro to Apache Spark by CTO of Twingo
 
Architecting and productionising data science applications at scale
Architecting and productionising data science applications at scaleArchitecting and productionising data science applications at scale
Architecting and productionising data science applications at scale
 
Spark Overview and Performance Issues
Spark Overview and Performance IssuesSpark Overview and Performance Issues
Spark Overview and Performance Issues
 
xPatterns - Spark Summit 2014
xPatterns - Spark Summit   2014xPatterns - Spark Summit   2014
xPatterns - Spark Summit 2014
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
APACHE SPARK.pptx
APACHE SPARK.pptxAPACHE SPARK.pptx
APACHE SPARK.pptx
 
Yet another intro to Apache Spark
Yet another intro to Apache SparkYet another intro to Apache Spark
Yet another intro to Apache Spark
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
BDM25 - Spark runtime internal
BDM25 - Spark runtime internalBDM25 - Spark runtime internal
BDM25 - Spark runtime internal
 
YARN: a resource manager for analytic platform
YARN: a resource manager for analytic platformYARN: a resource manager for analytic platform
YARN: a resource manager for analytic platform
 
Spark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computingSpark 101 - First steps to distributed computing
Spark 101 - First steps to distributed computing
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 
Big_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_SessionBig_data_analytics_NoSql_Module-4_Session
Big_data_analytics_NoSql_Module-4_Session
 
Productionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job ServerProductionizing Spark and the Spark Job Server
Productionizing Spark and the Spark Job Server
 

Dernier

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystSamantha Rae Coolbeth
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 

Dernier (20)

VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfKantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf
 
Unveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data AnalystUnveiling Insights: The Role of a Data Analyst
Unveiling Insights: The Role of a Data Analyst
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 

Giraph+Gora in ApacheCon14

  • 1. Turning NoSQL data into Graph
 Playing with Apache Giraph and Apache Gora
  • 3. Renato Marroquín ! • PhD student: • Interested in: Information retrieval. Distributed and scalable data management. • Apache Gora: PPMC Member Committer. • rmarroquin [at] apache [dot] org
  • 4. Claudio Martella • PhD student: LSDS @VU University Amsterdam. • Interested in Complex Networks. Distributed and scalable infrastructures. • Apache Girapher: PPMC Member Committer. • claudio [at] apache [dot] org
  • 5. Lewis McGibbney • Scottish expat fae Glasgow • Post Doc @Stanford University: Engineering Informatics • Quantity Surveyor/Cost Consultant by 
 profession • Cycling mad • Keen OSS enthusiast @TheASF 
 and beyond • lewismc [at] apacher [dot] org
  • 7. What is Apache Gora? ● Data Persistence : Persisting objects to Column stores, key-value stores, SQL databases and to flat files in local file system of Hadoop HDFS. ● Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location. ● Indexing : Persisting objects to Lucene and Solr indexes, accessing/ querying the data with Gora API. ● Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading ● MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.
  • 8. What is Apache Gora? ● Provides an in-memory data model and persistence for big data. ● Gora supports:
  • 9. How does Gora work? ! 1.Define your schema using Apache AVRO. 2.Compile your schemas using Gora's Compiler. 3.Create a mapping between logical and physical layout. 4.Update gora.properties file to set back-end properties.
 
 Rock the NoSQL world!!!
  • 10. How does Gora work? 1.Define your schema using Apache AVRO.
  • 11. How does Gora work? 2.Compile your schemas using Gora's Compiler. java -jar gora-core-XYZ-.jar
 " " o.a.gora.compiler.GoraCompiler.class 
 " " " employee.avsc
 " " " gora-app/src/main/java/
  • 12. How does Gora work? 2.Compile your schemas using Gora's Compiler.
  • 13. How does Gora work? 3.Create a mapping between logical and physical layout.
  • 14. How does Gora work? 4.Update gora.properties file to set back-end properties.
  • 15. How does Gora work? Rock the NoSQL world!
  • 17. MapReduce and Graphs • Plain MapReduce is not well suited for graph algorithms because: • Graph algorithms are iterative. • Not intuitive in MapReduce. • Unnecessarily slow • Each iteration is a single MapReduce job with too much overhead • Separately scheduled • The graph structure is read from disk • The intermediate results are read from disks • Hard to implement
  • 18. Google's Pregel • Introduced on 2010 • Based on Valiant's BSP • “Think like a vertex” that can send messages to any vertex in the graph using the bulk synchronous parallel programming model. • Computation complete when all components complete. • Batch-oriented processing • Computation happens in-memory • Master/slave architecture
  • 20. Open source implementations • There are some such as: • Apache Giraph • Apache Hama • GoldenOrb • Signal/Collect
  • 21. Apache Giraph • Incubated since summer 2011 • Written in Java • Implements Pregel's API • Runs on existing MapReduce infrastructure • Active community from Yahoo!, Facebook, LinkedIn, Twitter, and more. • It's a single Map-only job • It runs on Hadoop in-memory. • Fault tolerant • Zookeeper for state, No SPOF
  • 22. During execution time Setup ● Load graph ● Assign vertices to workers ● Validate workers' health Teardown ● Write results back ● Write aggregators back Computer ● Assign messages to workers ● Iterate on active vertices ● Call vertices compute() Synchronize ● Send messages to workers ● Compute aggregators ● Checkpoint
  • 23. Giraph's components • Master • Application coordinator • One active master at a time • Assigns partition owners to workers prior to each superstep • Synchronizes supersteps • Worker – Computation & messaging • Loads the graph from input splits • Performs computation/messaging of its assigned partitions • Zookeeper • Maintains global application state
  • 24. What is needed then? • Your algorithm in the Pregel model. • A VertexInputFormat to read your graph.
 e.g. <vertex><neighbor1><neighbor2> • A VertexOutputFormat to write back the results.
 e.g. <vertex> <pageRank> • You could define: • A Combiner (for reducing number of messages sent/received) • An Aggregator (for enabling global computation)
  • 25. Running a Giraph job • It is just like running Hadoop ! $HADOOP_HOME/bin/hadoop jar giraph-examples-1.1.0-XXX-jar-with-dependencies.jar o.a.g.GiraphRunner o.a.g.examples.SimpleShortestPathsComputation -vif o.a.g.io.formats.JsonLongDoubleFloatDoubleVertexInputFormat -vip /user/hduser/input/tiny_graph.txt -vof o.a.g.io.formats.IdWithValueTextOutputFormat -op /user/hduser/output/shortestpaths -w 1
  • 26. Apache Giraph + Apache Gora
  • 27. The project idea • Integrating Apache Gora with other cool projects. • Provide access to different data stores out-of-the- box for Apache Giraph. • Give users more flexibility when deciding how to run graph algorithms. • Make the Hadoop Env bigger. • Apply to for the Google Summer of Code Project.
  • 34. Parameters offered Label Description giraph.gora.datastore.class Gora DataStore class to access to data from - required. !giraph.gora.key.class Gora Key class to query the datastore - required. giraph.gora.persistent.class Gora Persistent class to read objects from Gora - required. giraph.gora.keys.factory.class Keys factory to convert strings into desired keys - required. giraph.gora.output.datastore.class Gora DataStore class to write data to - required. giraph.gora.output.key.class Gora Key class to write to datastore - required. giraph.gora.output.persistent.class Gora Persistent class to write to Gora - required. giraph.gora.start.key Gora start key to query the datastore. giraph.gora.end.key Gora end key to query the datastore.
  • 35. Rocks in the way • Dependency issues. • Supported versions by each project. • Maven war for handling cyclic dependencies. • Hadoop issues. • Not all data stores support MapReduce out of the box. • Finding what it is necessary to be in the classpath. • Providing an API between both projects that is: • Flexible. • Simple. • Pluggable.
  • 36. So now what? 1.Create your data beans with Gora.
  • 37. So now what? 2. Compile them. java -jar gora-core-XYZ.jar o.a.gora.compiler.GoraCompiler.class vertex.avsc gora-app/src/main/java/
  • 38. So now what? 3. Get your Gora files set up for passing them to Giraph. Gora.properties Gora-mapping-{datastore}.xml.
  • 39. So now what? 4. Get your hooks in place. GVertexInputFormat
  • 40.
  • 41. So now what? 4. Get your hooks in place. GVertexOutputFormat
  • 42. So now what? 4. Get your hooks in place. GVertexOutputFormat
  • 43. So now what? 4. Get your hooks in place. KeyFactory
  • 44. So now what? 5. Run Giraph! hadoop jar $GIRAPH_EXAMPLES_JAR org.apache.giraph.GiraphRunner " -files ../conf/gora.properties,../conf/gora-hbase-mapping.xml,../conf/hbase-site.xml" -Dio.serializations=o.a.h.io.serializer.WritableSerialization,o.a.h.io.serializer.JavaSerialization" -Dgiraph.gora.datastore.class=org.apache.gora.hbase.store.HBaseStore" -Dgiraph.gora.key.class=java.lang.String" -Dgiraph.gora.persistent.class=org.apache.giraph.io.gora.generated.GEdge" -Dgiraph.gora.start.key=0 -Dgiraph.gora.end.key=10" -Dgiraph.gora.keys.factory.class=org.apache.giraph.io.gora.utils.KeyFactory" -Dgiraph.gora.output.datastore.class=org.apache.gora.hbase.store.HBaseStore " -Dgiraph.gora.output.key.class=java.lang.String " -Dgiraph.gora.output.persistent.class=org.apache.giraph.io.gora.generated.GEdgeResult " -libjars $GIRAPH_GORA_JAR,$GORA_HBASE_JAR,$HBASE_JAR" org.apache.giraph.examples.SimpleShortestPathsComputation " -eif org.apache.giraph.io.gora.GoraGEdgeEdgeInputFormat " -eof org.apache.giraph.io.gora.GoraGEdgeEdgeOutputFormat " -w 1"
  • 47. Adding more data stores Send us an email on the mailing lists
  • 48. New serialization formats • Different serialization formats beside Apache Avro. ! ! ! ! ! ! • And others that could be interesting for handling different use cases.
  • 50. Q&A
  • 52. Bulk synchronous parallel model • Computation consists of a series of “supersteps” • Supersteps are an atomic unit of computation where operations can happen in parallel • During a superstep, components are assigned to tasks and receive unordered messages from previous supersteps. • Point-to-point messages • Sent during a superstep from one component to another and then delivered in the following supersteps. • Computation completes when all components complete