SlideShare une entreprise Scribd logo
1  sur  40
1 
RDF and the Hadoop Ecosystem 
Rob Vesse 
Twitter: @RobVesse 
Email: rvesse@apache.org
2 
 Software Engineer at YarcData (part of Cray Inc) 
 Working on big data analytics products 
 Active open source contributor primarily to RDF & SPARQL 
related projects 
 Apache Jena Committer and PMC Member 
 dotNetRDF Lead Developer 
 Primarily interested in RDF, SPARQL and Big Data Analytics 
technologies
3 
What's missing in the Hadoop ecosystem? 
What's needed to fill the gap? 
What's already available? 
 Jena Hadoop RDF Tools 
 GraphBuilder 
 Other Projects 
 Getting Involved 
 Questions
4
5 
Apache, the projects and their logo shown here are registered trademarks or 
trademarks of The Apache Software Foundation in the U.S. and/or other 
countries
6 
 No first class projects 
 Some limited support in other projects 
 E.g. Giraph supports RDF by bridging through the Tinkerpop stack 
 Some existing external projects 
 Lots of academic proof of concepts 
 Some open source efforts but mostly task specific 
 E.g. Infovore targeted at creating curated Freebase and DBPedia datasets
7
8 
 Need to efficiently represent RDF concepts as Writable 
types 
 Nodes, Triples, Quads, Graphs, Datasets, Query Results etc 
What's the minimum viable subset?
9 
 Need to be able to get data in and out of RDF formats 
Without this we can't use the power of the Hadoop 
ecosystem to do useful work 
 Lots of serializations out there: 
 RDF/XML 
 Turtle 
 NTriples 
 NQuads 
 JSON-LD 
 etc 
 Also would like to be able to produce end results as RDF
1 
0 
Map/Reduce building blocks 
 Common operations e.g. splitting 
 Enable developers to focus on their applications 
 User Friendly tooling 
 i.e. non-programmer tools
1 
1
1 
2 
CC BY-SA 3.0 Wikimedia Commons
1 
3 
 Set of modules part of the Apache Jena project 
 Originally developed at Cray and donated to the project earlier this year 
 Experimental modules on the hadoop-rdf branch of our 
 Currently only available as development SNAPSHOT 
releases 
 Group ID: org.apache.jena 
 Artifact IDs: 
 jena-hadoop-rdf-common 
 jena-hadoop-rdf-io 
 jena-hadoop-rdf-mapreduce 
 Latest Version: 0.9.0-SNAPSHOT 
 Aims to fulfill all the basic requirements for enabling RDF on 
Hadoop 
 Built against Hadoop Map/Reduce 2.x APIs
1 
4 
 Provides the Writable types for RDF primitives 
 NodeWritable 
 TripleWritable 
 QuadWritable 
 NodeTupleWritable 
 All backed by RDF Thrift 
 A compact binary serialization for RDF using Apache Thrift 
 See http://afs.github.io/rdf-thrift/ 
 Extremely efficient to serialize and deserialize 
 Allows for efficient WritableComparator implementations that perform binary comparisons
 Provides InputFormat and OutputFormat implementations 
1 
5 
 Supports most formats that Jena supports 
 Designed to be extendable with new formats 
Will split and parallelize inputs where the RDF serialization 
is amenable to this 
 Also transparently handles compressed inputs and outputs 
 Note that compression blocks splitting 
 i.e. trade off between IO and parallelism
1 
6 
 Various reusable building block Mapper and Reducer 
implementations: 
 Counting 
 Filtering 
 Grouping 
 Splitting 
 Transforming 
 Can be used as-is to do some basic Hadoop tasks or used as 
building blocks for more complex tasks
1 
7
 For NTriples inputs compared performance of a Text based 
node count versus RDF based node count 
1 
8 
 Performance as good (within 10%) and sometimes 
significantly better 
 Heavily dataset dependent 
 Varies considerably with cluster setup 
 Also depends on how the input is processed 
 YMMV! 
 For other RDF formats you would struggle to implement 
this at all
1 
9 
 Originally developed by Intel 
 Some contributions by Cray - awaiting merging at time of writing 
 Open source under Apache License 
 https://github.com/01org/graphbuilder/tree/2.0.alpha 
 2.0.alpha is the Pig based branch 
 Allows data to be transformed into graphs using Pig scripts 
 Provides set of Pig UDFs for translating data to graph formats 
 Supports both property graphs and RDF graphs
2 
0 
-- Declare our mappings 
x = FOREACH propertyGraph GENERATE (*, 
[ 'idBase' # 'http://example.org/instances/', 
'base' # 'http://example.org/ontology/', 
'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ], 
'propertyMap' # [ 'type' # 'a', 
'name' # 'foaf:name', 
'age' # 'foaf:age' ], 
'idProperty' # 'id' ]); 
-- Convert to NTriples 
rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*)); 
-- Write out NTriples 
STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
2 
1 
 Uses a declarative mapping based on Pig primitives 
 Maps and Tuples 
 Have to be explicitly joined to the data because Pig UDFs 
can only be called with String arguments 
 Has some benefits e.g. conditional mappings 
 RDF Mappings operate on Property Graphs 
 Requires original data to be mapped to a property graph first 
 Direct mapping to RDF is a future enhancement that has yet to be implemented
2 
2
2 
3 
 Infovore - Paul Houle 
 https://github.com/paulhoule/infovore/wiki 
 Cleaned and curated Freebase datasets processed with Hadoop 
 CumulusRDF - Institute of Applied Informatics and Formal 
Description Methods 
 https://code.google.com/p/cumulusrdf/ 
 RDF store backed by Apache Cassandra
2 
4 
 Please start playing with these projects 
 Please interact with the community: 
 dev@jena.apache.org 
 What works? 
 What is broken? 
 What is missing? 
 Contribute 
 Apache projects are ultimately driven by the community 
 If there's a feature you want please suggest it 
 Or better still contribute it yourself!
2 
5 
Questions? 
Personal Email: rvesse@apache.org 
Jena Mailing List: dev@jena.apache.org
2 
6
2 
7 
> bin/hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar 
org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output 
/user/output --input-type triples /user/input 
 --node-count requests the Node Count statistics be 
calculated 
 Assumes mixed quads and triples input if no --input-type 
specified 
 Using this for triples only data can skew statistics 
 e.g. can result in high node counts for default graph node 
 Hence we explicitly specify input as triples
2 
8
2 
9
3 
0
3 
1
3 
2
3 
3 
> ./pig -x local examples/property_graphs_and_rdf.pig 
> cat /tmp/rdf_triples/part-m-00000 
 Running in local mode for this demo 
 Output goes to /tmp/rdf_triples
3 
4
3 
5
3 
6
public abstract class AbstractNodeTupleNodeCountMapper<TKey, TValue, T extends AbstractNodeTupleWritable<TValue>> 
extends Mapper<TKey, T, NodeWritable, LongWritable> { 
3 
7 
private LongWritable initialCount = new LongWritable(1); 
@Override 
protected void map(TKey key, T value, Context context) throws IOException, InterruptedException { 
NodeWritable[] ns = this.getNodes(value); 
for (NodeWritable n : ns) { 
context.write(n, this.initialCount); 
} 
} 
protected abstract NodeWritable[] getNodes(T tuple); 
} 
public class TripleNodeCountMapper<TKey> extends AbstractNodeTupleNodeCountMapper<TKey, Triple, TripleWritable> { 
@Override 
protected NodeWritable[] getNodes(TripleWritable tuple) { 
Triple t = tuple.get(); 
return new NodeWritable[] { new NodeWritable(t.getSubject()), new NodeWritable(t.getPredicate()), 
new NodeWritable(t.getObject()) }; 
} 
}
3 
8 
public class NodeCountReducer extends Reducer<NodeWritable, LongWritable, NodeWritable, 
LongWritable> { 
@Override 
protected void reduce(NodeWritable key, Iterable<LongWritable> values, Context context) 
throws IOException, InterruptedException { 
long count = 0; 
Iterator<LongWritable> iter = values.iterator(); 
while (iter.hasNext()) { 
count += iter.next().get(); 
} 
context.write(key, new LongWritable(count)); 
} 
}
3 
9 
Job job = Job.getInstance(config); 
job.setJarByClass(JobFactory.class); 
job.setJobName("RDF Triples Node Usage Count"); 
// Map/Reduce classes 
job.setMapperClass(TripleNodeCountMapper.class); 
job.setMapOutputKeyClass(NodeWritable.class); 
job.setMapOutputValueClass(LongWritable.class); 
job.setReducerClass(NodeCountReducer.class); 
// Input and Output 
job.setInputFormatClass(TriplesInputFormat.class); 
job.setOutputFormatClass(NTriplesNodeOutputFormat.class); 
FileInputFormat.setInputPaths(job, StringUtils.arrayToString(inputPaths)); 
FileOutputFormat.setOutputPath(job, new Path(outputPath)); 
return job;
 https://github.com/Cray/graphbuilder/blob/2.0.alpha/exa 
mples/property_graphs_and_rdf_example.pig 
4 
0

Contenu connexe

Tendances

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Databricks
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkRDataWorks Summit
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIRyuji Tamagawa
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache GiraphAvery Ching
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLDatabricks
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideWhizlabs
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Sparkfelixcss
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Databricks
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processingsscdotopen
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Yu Liu
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraphDataWorks Summit
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesTuhin Mahmud
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungSpark Summit
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best PracticesCloudera, Inc.
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingCloudera, Inc.
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonChristian Perone
 

Tendances (20)

Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0Pandas UDF and Python Type Hint in Apache Spark 3.0
Pandas UDF and Python Type Hint in Apache Spark 3.0
 
Scalable Data Science with SparkR
Scalable Data Science with SparkRScalable Data Science with SparkR
Scalable Data Science with SparkR
 
Performant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame APIPerformant data processing with PySpark, SparkR and DataFrame API
Performant data processing with PySpark, SparkR and DataFrame API
 
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
2014.02.13 (Strata) Graph Analysis with One Trillion Edges on Apache Giraph
 
Keeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETLKeeping Spark on Track: Productionizing Spark for ETL
Keeping Spark on Track: Productionizing Spark for ETL
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
Enabling Composition in Distributed Reinforcement Learning with Ray RLlib wit...
 
Introducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph ProcessingIntroducing Apache Giraph for Large Scale Graph Processing
Introducing Apache Giraph for Large Scale Graph Processing
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...Introduction of the Design of A High-level Language over MapReduce -- The Pig...
Introduction of the Design of A High-level Language over MapReduce -- The Pig...
 
Processing edges on apache giraph
Processing edges on apache giraphProcessing edges on apache giraph
Processing edges on apache giraph
 
Apache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion TreesApache Spark MLlib - Random Foreset and Desicion Trees
Apache Spark MLlib - Random Foreset and Desicion Trees
 
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix CheungScalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
Scalable Data Science with SparkR: Spark Summit East talk by Felix Cheung
 
PySpark Best Practices
PySpark Best PracticesPySpark Best Practices
PySpark Best Practices
 
Introduction to Apache Spark Developer Training
Introduction to Apache Spark Developer TrainingIntroduction to Apache Spark Developer Training
Introduction to Apache Spark Developer Training
 
Hadoop and Spark
Hadoop and SparkHadoop and Spark
Hadoop and Spark
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and PythonApache Spark - Intro to Large-scale recommendations with Apache Spark and Python
Apache Spark - Intro to Large-scale recommendations with Apache Spark and Python
 

Similaire à Quadrupling your elephants - RDF and the Hadoop ecosystem

May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLAdam Muise
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overviewMartin Zapletal
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Updatevithakur
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonVitthal Gogate
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonBenjamin Bengfort
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark InternalsPietro Michiardi
 
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Anne Nicolas
 
Presentation distro recipes-2013
Presentation distro recipes-2013Presentation distro recipes-2013
Presentation distro recipes-2013olberger
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questionsKalyan Hadoop
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaAtif Akhtar
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...spinningmatt
 

Similaire à Quadrupling your elephants - RDF and the Hadoop ecosystem (20)

May 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETLMay 29, 2014 Toronto Hadoop User Group - Micro ETL
May 29, 2014 Toronto Hadoop User Group - Micro ETL
 
Unit V.pdf
Unit V.pdfUnit V.pdf
Unit V.pdf
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Apache spark - History and market overview
Apache spark - History and market overviewApache spark - History and market overview
Apache spark - History and market overview
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Boston Spark Meetup event Slides Update
Boston Spark Meetup event Slides UpdateBoston Spark Meetup event Slides Update
Boston Spark Meetup event Slides Update
 
Data Science
Data ScienceData Science
Data Science
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Apache Spark Introduction @ University College London
Apache Spark Introduction @ University College LondonApache Spark Introduction @ University College London
Apache Spark Introduction @ University College London
 
Fast Data Analytics with Spark and Python
Fast Data Analytics with Spark and PythonFast Data Analytics with Spark and Python
Fast Data Analytics with Spark and Python
 
Spark mhug2
Spark mhug2Spark mhug2
Spark mhug2
 
Introduction to Spark Internals
Introduction to Spark InternalsIntroduction to Spark Internals
Introduction to Spark Internals
 
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
Distro Recipes 2013 : Contribution of RDF metadata for traceability among pro...
 
Presentation distro recipes-2013
Presentation distro recipes-2013Presentation distro recipes-2013
Presentation distro recipes-2013
 
Hadoop interview questions
Hadoop interview questionsHadoop interview questions
Hadoop interview questions
 
Spark 101
Spark 101Spark 101
Spark 101
 
Geek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and ScalaGeek Night - Functional Data Processing using Spark and Scala
Geek Night - Functional Data Processing using Spark and Scala
 
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
Boston Apache Spark User Group (the Spahk group) - Introduction to Spark - 15...
 

Plus de Rob Vesse

Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleRob Vesse
 
Introducing JDBC for SPARQL
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQLRob Vesse
 
Practical SPARQL Benchmarking
Practical SPARQL BenchmarkingPractical SPARQL Benchmarking
Practical SPARQL BenchmarkingRob Vesse
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperRob Vesse
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperRob Vesse
 
dotNetRDF - A Semantic Web/RDF Library for .Net Developers
dotNetRDF - A Semantic Web/RDF Library for .Net DevelopersdotNetRDF - A Semantic Web/RDF Library for .Net Developers
dotNetRDF - A Semantic Web/RDF Library for .Net DevelopersRob Vesse
 

Plus de Rob Vesse (6)

Challenges and patterns for semantics at scale
Challenges and patterns for semantics at scaleChallenges and patterns for semantics at scale
Challenges and patterns for semantics at scale
 
Introducing JDBC for SPARQL
Introducing JDBC for SPARQLIntroducing JDBC for SPARQL
Introducing JDBC for SPARQL
 
Practical SPARQL Benchmarking
Practical SPARQL BenchmarkingPractical SPARQL Benchmarking
Practical SPARQL Benchmarking
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
 
Everyday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web DeveloperEveryday Tools for the Semantic Web Developer
Everyday Tools for the Semantic Web Developer
 
dotNetRDF - A Semantic Web/RDF Library for .Net Developers
dotNetRDF - A Semantic Web/RDF Library for .Net DevelopersdotNetRDF - A Semantic Web/RDF Library for .Net Developers
dotNetRDF - A Semantic Web/RDF Library for .Net Developers
 

Dernier

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noidabntitsolutionsrishis
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 

Dernier (20)

A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in NoidaBuds n Tech IT Solutions: Top-Notch Web Services in Noida
Buds n Tech IT Solutions: Top-Notch Web Services in Noida
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Advantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your BusinessAdvantages of Odoo ERP 17 for Your Business
Advantages of Odoo ERP 17 for Your Business
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 

Quadrupling your elephants - RDF and the Hadoop ecosystem

  • 1. 1 RDF and the Hadoop Ecosystem Rob Vesse Twitter: @RobVesse Email: rvesse@apache.org
  • 2. 2  Software Engineer at YarcData (part of Cray Inc)  Working on big data analytics products  Active open source contributor primarily to RDF & SPARQL related projects  Apache Jena Committer and PMC Member  dotNetRDF Lead Developer  Primarily interested in RDF, SPARQL and Big Data Analytics technologies
  • 3. 3 What's missing in the Hadoop ecosystem? What's needed to fill the gap? What's already available?  Jena Hadoop RDF Tools  GraphBuilder  Other Projects  Getting Involved  Questions
  • 4. 4
  • 5. 5 Apache, the projects and their logo shown here are registered trademarks or trademarks of The Apache Software Foundation in the U.S. and/or other countries
  • 6. 6  No first class projects  Some limited support in other projects  E.g. Giraph supports RDF by bridging through the Tinkerpop stack  Some existing external projects  Lots of academic proof of concepts  Some open source efforts but mostly task specific  E.g. Infovore targeted at creating curated Freebase and DBPedia datasets
  • 7. 7
  • 8. 8  Need to efficiently represent RDF concepts as Writable types  Nodes, Triples, Quads, Graphs, Datasets, Query Results etc What's the minimum viable subset?
  • 9. 9  Need to be able to get data in and out of RDF formats Without this we can't use the power of the Hadoop ecosystem to do useful work  Lots of serializations out there:  RDF/XML  Turtle  NTriples  NQuads  JSON-LD  etc  Also would like to be able to produce end results as RDF
  • 10. 1 0 Map/Reduce building blocks  Common operations e.g. splitting  Enable developers to focus on their applications  User Friendly tooling  i.e. non-programmer tools
  • 11. 1 1
  • 12. 1 2 CC BY-SA 3.0 Wikimedia Commons
  • 13. 1 3  Set of modules part of the Apache Jena project  Originally developed at Cray and donated to the project earlier this year  Experimental modules on the hadoop-rdf branch of our  Currently only available as development SNAPSHOT releases  Group ID: org.apache.jena  Artifact IDs:  jena-hadoop-rdf-common  jena-hadoop-rdf-io  jena-hadoop-rdf-mapreduce  Latest Version: 0.9.0-SNAPSHOT  Aims to fulfill all the basic requirements for enabling RDF on Hadoop  Built against Hadoop Map/Reduce 2.x APIs
  • 14. 1 4  Provides the Writable types for RDF primitives  NodeWritable  TripleWritable  QuadWritable  NodeTupleWritable  All backed by RDF Thrift  A compact binary serialization for RDF using Apache Thrift  See http://afs.github.io/rdf-thrift/  Extremely efficient to serialize and deserialize  Allows for efficient WritableComparator implementations that perform binary comparisons
  • 15.  Provides InputFormat and OutputFormat implementations 1 5  Supports most formats that Jena supports  Designed to be extendable with new formats Will split and parallelize inputs where the RDF serialization is amenable to this  Also transparently handles compressed inputs and outputs  Note that compression blocks splitting  i.e. trade off between IO and parallelism
  • 16. 1 6  Various reusable building block Mapper and Reducer implementations:  Counting  Filtering  Grouping  Splitting  Transforming  Can be used as-is to do some basic Hadoop tasks or used as building blocks for more complex tasks
  • 17. 1 7
  • 18.  For NTriples inputs compared performance of a Text based node count versus RDF based node count 1 8  Performance as good (within 10%) and sometimes significantly better  Heavily dataset dependent  Varies considerably with cluster setup  Also depends on how the input is processed  YMMV!  For other RDF formats you would struggle to implement this at all
  • 19. 1 9  Originally developed by Intel  Some contributions by Cray - awaiting merging at time of writing  Open source under Apache License  https://github.com/01org/graphbuilder/tree/2.0.alpha  2.0.alpha is the Pig based branch  Allows data to be transformed into graphs using Pig scripts  Provides set of Pig UDFs for translating data to graph formats  Supports both property graphs and RDF graphs
  • 20. 2 0 -- Declare our mappings x = FOREACH propertyGraph GENERATE (*, [ 'idBase' # 'http://example.org/instances/', 'base' # 'http://example.org/ontology/', 'namespaces' # [ 'foaf' # 'http://xmlns.com/foaf/0.1/' ], 'propertyMap' # [ 'type' # 'a', 'name' # 'foaf:name', 'age' # 'foaf:age' ], 'idProperty' # 'id' ]); -- Convert to NTriples rdf_triples = FOREACH propertyGraphWithMappings GENERATE FLATTEN(RDF(*)); -- Write out NTriples STORE rdf_triples INTO '/tmp/rdf_triples' USING PigStorage();
  • 21. 2 1  Uses a declarative mapping based on Pig primitives  Maps and Tuples  Have to be explicitly joined to the data because Pig UDFs can only be called with String arguments  Has some benefits e.g. conditional mappings  RDF Mappings operate on Property Graphs  Requires original data to be mapped to a property graph first  Direct mapping to RDF is a future enhancement that has yet to be implemented
  • 22. 2 2
  • 23. 2 3  Infovore - Paul Houle  https://github.com/paulhoule/infovore/wiki  Cleaned and curated Freebase datasets processed with Hadoop  CumulusRDF - Institute of Applied Informatics and Formal Description Methods  https://code.google.com/p/cumulusrdf/  RDF store backed by Apache Cassandra
  • 24. 2 4  Please start playing with these projects  Please interact with the community:  dev@jena.apache.org  What works?  What is broken?  What is missing?  Contribute  Apache projects are ultimately driven by the community  If there's a feature you want please suggest it  Or better still contribute it yourself!
  • 25. 2 5 Questions? Personal Email: rvesse@apache.org Jena Mailing List: dev@jena.apache.org
  • 26. 2 6
  • 27. 2 7 > bin/hadoop jar jena-hadoop-rdf-stats-0.9.0-SNAPSHOT-hadoop-job.jar org.apache.jena.hadoop.rdf.stats.RdfStats --node-count --output /user/output --input-type triples /user/input  --node-count requests the Node Count statistics be calculated  Assumes mixed quads and triples input if no --input-type specified  Using this for triples only data can skew statistics  e.g. can result in high node counts for default graph node  Hence we explicitly specify input as triples
  • 28. 2 8
  • 29. 2 9
  • 30. 3 0
  • 31. 3 1
  • 32. 3 2
  • 33. 3 3 > ./pig -x local examples/property_graphs_and_rdf.pig > cat /tmp/rdf_triples/part-m-00000  Running in local mode for this demo  Output goes to /tmp/rdf_triples
  • 34. 3 4
  • 35. 3 5
  • 36. 3 6
  • 37. public abstract class AbstractNodeTupleNodeCountMapper<TKey, TValue, T extends AbstractNodeTupleWritable<TValue>> extends Mapper<TKey, T, NodeWritable, LongWritable> { 3 7 private LongWritable initialCount = new LongWritable(1); @Override protected void map(TKey key, T value, Context context) throws IOException, InterruptedException { NodeWritable[] ns = this.getNodes(value); for (NodeWritable n : ns) { context.write(n, this.initialCount); } } protected abstract NodeWritable[] getNodes(T tuple); } public class TripleNodeCountMapper<TKey> extends AbstractNodeTupleNodeCountMapper<TKey, Triple, TripleWritable> { @Override protected NodeWritable[] getNodes(TripleWritable tuple) { Triple t = tuple.get(); return new NodeWritable[] { new NodeWritable(t.getSubject()), new NodeWritable(t.getPredicate()), new NodeWritable(t.getObject()) }; } }
  • 38. 3 8 public class NodeCountReducer extends Reducer<NodeWritable, LongWritable, NodeWritable, LongWritable> { @Override protected void reduce(NodeWritable key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException { long count = 0; Iterator<LongWritable> iter = values.iterator(); while (iter.hasNext()) { count += iter.next().get(); } context.write(key, new LongWritable(count)); } }
  • 39. 3 9 Job job = Job.getInstance(config); job.setJarByClass(JobFactory.class); job.setJobName("RDF Triples Node Usage Count"); // Map/Reduce classes job.setMapperClass(TripleNodeCountMapper.class); job.setMapOutputKeyClass(NodeWritable.class); job.setMapOutputValueClass(LongWritable.class); job.setReducerClass(NodeCountReducer.class); // Input and Output job.setInputFormatClass(TriplesInputFormat.class); job.setOutputFormatClass(NTriplesNodeOutputFormat.class); FileInputFormat.setInputPaths(job, StringUtils.arrayToString(inputPaths)); FileOutputFormat.setOutputPath(job, new Path(outputPath)); return job;

Notes de l'éditeur

  1. Tons of active projects Accumulo, Ambari, Avro, Cassandra, Chukwa, Giraph, Ham, HBase, Hive, Mahout, Pig, Spark, Tez, ZooKeeper And those are just off the top of my head (and ignoring Incubating projects) However mostly focused on traditional data sources e.g. logs, relational databases, unstructured data
  2. Highlight benefit of WritableComparator - significant speed up in reduce phase
  3. Project also provides a demo JAR which shows how to use the building blocks to perform common Hadoop tasks on RDF So Node Count is essentially the Word Count "Hello World" of Hadoop programming
  4. Mention that Intel may not have yet merged our pull request that adds the declarative mapping approach
  5. ~20 lines of code (less if you remove unnecessary formatting)
  6. 11 lines of code