SlideShare a Scribd company logo
1 of 33
Download to read offline
LODOP
Multi-Query Optimization for 	

Linked Data Profiling Queries
Anja Jentzsch (@anjeve), Benedikt Forchhammer, Felix Naumann	

Hasso Plattner Institute, Potsdam, Germany	

!
!
!
!
1st International Workshop on Dataset Profiling & 	

Federated Search for Linked Data (PROFILES2014), ESWC 2014	

2014/05/26
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. Challenges of Linked Data Profiling	

2. ProfilingTasks	

3. LODOP	

4. Multi-Query Optimizations
OUTLINE
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
LINKED DATA PROFILING
• Metadata often not available	

• e.g. statistical information on predicates, classes, vocabularies, value
patterns, property co-occurrence, …	

• Data registries,VoiD, and Semantic Sitemaps provide only basic
information. e.g., description, author & license information,
estimated triple and link count	

!
• Use cases requiring metadata	

• Query optimization	

• Data cleansing	

• Data integration	

• Schema induction	

!
• Data profiling: methods for computing metrics / metadata for datasets
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
TRADITIONALVS LINKED DATA PROFILING
• State of the art data profiling	

• Based on columns	

• Assumes well-defined semantics	

• Expects regular data	

!
• Heterogeneity on the Web of Data	

• Diverse sources	

• Diverse structures	

• Diverse views
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
CHALLENGES OF LD PROFILING
• Heterogeneity	

• Nested graphs	

 	

 Makes reasoning difficult	

• Loose structure	

	

 Things have different predicate sets	

• Incomplete	

 	

 	

 Missing property definitions	

• Poorly formatted	

 Property types used inconsistently	

• Inconsistent	

	

 	

 Multiple representations claim opposite things 	

!
• Existing (relational) data profiling tools don’t work	

!
• Volume of data	

• Requires parallelization
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
LODOP - CONTRIBUTIONS
• Implementation of 15 profiling tasks as Apache Pig scripts (56 scripts)	

• System for executing, benchmarking and optimizing data profiling
scripts with Apache Pig on Hadoop	

• Development and evaluation of 3 multi-script optimization rules	

!
• Apache Pig:	

• Platform for analyzing large datasets	

• High-level language: Pig Latin	

• Scripts executed on Hadoop / MapReduce
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
PROFILINGTASKS
• Groupings	

• e.g. by resource, class, property type, language, vocabulary, …	

!
• Tasks	

• Number of triples	

• Average number of triples per resource	

• Average number of triples per object URI	

• Average number of triples per context URL	

• Number of property types	

• Average number of property values	

• Number of resources	

• Number of inlinks / outlinks
• Number of context URLs	

• Number of context PLDs	

• Property co-occurrence	

• Inverse Properties	

• URI-Literal ratio	

• Property value ranges	

• Average value length
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
DATASETS STATISTICS
!
!
!
!
!
!
!
!
!
!
!
*	

 source: BTC 2012 dataset	

**	

 WDC = Web Data Commons	

***	

 EUNIS = European Environment Agency	

!
Statistics for 1M triples! DBpedia*! Freebase*! WDC
RDFa**!
EUNIS
Species***!
Number of resources! 169,035! 226,834! 168,736! 65,843!
Avg. number of triples per resource! 5.9! 4.4! 5.9! 15.2!
Number of classes! 19,585! 1,928! 61! 1!
Number of property types! 7,844! 2,748! 477! 16!
Number of URIs! 519,692! 642,183! 174,317! 407,418!
Number of inlinks! 207,712! 192,179! 35,329! 78,377!
Number of literals! 480,279! 357,817! 825,564! 592,582!
Avg. number of property values! 127.5 363.9! 2096.2! 62,500.0!
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
PERFORMANCE EVALUATION
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
PERFORMANCE EVALUATION
• 10-15s scheduling overhead per MapReduce job (~3.4 jobs per script)	

• Earlier MapReduce jobs have longer runtimes	

• Earlier jobs handle more data more HDFS activity	

• Most scripts scale linearly	

• Most scripts reduce amount of data in workflow	

• Exceptions e.g. property co-occurrence scripts
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
OPTIMIZATION GOALS
• Optimize concurrent execution of multiple scripts	

• Reduce number of operators	

• Reduce data flow between operators
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
NUMBER OF INSTANCES (PIG)
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
LODOP - SYSTEM OVERVIEW
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
MULTI-QUERY OPTIMIZATION
1. Merging identical operators	

2. Combining FILTER operators	

3. Combining FOREACH operators
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
• Merging all logical plans into one master plan	

• Allows parallel execution	

• Reduces runtime to 25-30% of sequential execution	

!
STEP 0: MASTER PLAN
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. MERGING IDENTICAL OPERATORS
Number of property types
per class!
URI Literal Ratio per class!
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
Number of property types
per class!
URI Literal Ratio per class!
1. Identify and compare
sibling operators
2. Merge matching siblings
1. MERGING IDENTICAL OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
Number of property types
per class!
URI Literal Ratio per class!
1. Identify and compare
sibling operators	

2. Merge matching siblings
1. MERGING IDENTICAL OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
• Number of operators reduced from 365 to 267	

• Number of MapReduce jobs reduced from 176 to 140	

• Frees up cluster resources	

• Prerequisite step for other optimisations	

• Restricts parallelism	

1. MERGING IDENTICAL OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. MERGING IDENTICAL OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
1. Create combined FILTER operator
2. Rearrange original FILTER operators	

3. Remove redundant operators
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
1. Create combined FILTER operator	

2. Rearrange original FILTER operators
3. Remove redundant operators
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
1. Create combined FILTER operator	

2. Rearrange original FILTER operators	

3. Remove redundant operators
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
3. COMBINING FOREACH OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. Create combined FOREACH
operator
2. Replace with simple
projections
3. Remove redundant projection
3. COMBINING FOREACH OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. Create combined FOREACH
operator	

2. Replace with simple projections	

3. Remove redundant
projection
3. COMBINING FOREACH OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
3. COMBINING FOREACH OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
ALL OPTIMIZATIONS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
SUMMARY
• Optimizations reduce	

• Number of operations	

• Number of MapReduce jobs	

• Data flow between operators → less HDFS I/O	

→ Improved execution time 	

• Reduces execution time by 70%	

• … but rules should not be applied in all cases	

• More advanced (cost-based) approach is needed
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
FUTURE WORK
• Additional logical optimization rules	

• Ignore projections if it allows further merging of operators	

• Advanced optimization strategies	

• Cost-based approach could use previous profiling results (e.g.
cardinalities) → on-the-go	

• Materialization of intermediate results	

• Materialize common subsets, e.g. only triples with typed object
values for later scripts
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
http://github.com/bforchhammer/lodop/ 	

!
@anjeve	

anja.jentzsch@hpi.uni-potsdam.de

More Related Content

What's hot

An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
MapR Technologies
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
DataWorks Summit
 

What's hot (20)

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
LinkedGov extension for Google Refine
LinkedGov extension for Google RefineLinkedGov extension for Google Refine
LinkedGov extension for Google Refine
 
Relational Database Design Bootcamp
Relational Database Design BootcampRelational Database Design Bootcamp
Relational Database Design Bootcamp
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Semantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care AnalyticsSemantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care Analytics
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and Stanbol
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
 
Денис Головняк - Продвинутый поиск с помощью Search API
Денис Головняк - Продвинутый поиск с помощью Search APIДенис Головняк - Продвинутый поиск с помощью Search API
Денис Головняк - Продвинутый поиск с помощью Search API
 

Viewers also liked

Big Data Profiling
Big Data Profiling Big Data Profiling
Big Data Profiling
eXascale Infolab
 

Viewers also liked (9)

Open Education and Open Development – working together
Open Education and Open Development – working togetherOpen Education and Open Development – working together
Open Education and Open Development – working together
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
Open Education Handbook
Open Education HandbookOpen Education Handbook
Open Education Handbook
 
Lessons Learnt from LinkedUp
Lessons Learnt from LinkedUpLessons Learnt from LinkedUp
Lessons Learnt from LinkedUp
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
 
Open data in Education
Open data in EducationOpen data in Education
Open data in Education
 
Big Data Profiling
Big Data Profiling Big Data Profiling
Big Data Profiling
 

Similar to LODOP - Multi-Query Optimization for Linked Data Profiling Queries

Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Peter Haase
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
ebiquity
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
Sören Auer
 
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
TERN Australia
 

Similar to LODOP - Multi-Query Optimization for Linked Data Profiling Queries (20)

balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
Grails And The Semantic Web
Grails And The Semantic WebGrails And The Semantic Web
Grails And The Semantic Web
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph Stores
 
Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers  Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 

More from Anja Jentzsch (9)

Wikidata
WikidataWikidata
Wikidata
 
Linked Data
Linked DataLinked Data
Linked Data
 
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...
Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...
Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...
 
Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)
 
Visualizing Web Data Query Results
Visualizing Web Data Query ResultsVisualizing Web Data Query Results
Visualizing Web Data Query Results
 
Finding Data Sets
Finding Data SetsFinding Data Sets
Finding Data Sets
 
Linked Data Basics
Linked Data BasicsLinked Data Basics
Linked Data Basics
 

Recently uploaded

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 

Recently uploaded (20)

Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 

LODOP - Multi-Query Optimization for Linked Data Profiling Queries

  • 1. LODOP Multi-Query Optimization for Linked Data Profiling Queries Anja Jentzsch (@anjeve), Benedikt Forchhammer, Felix Naumann Hasso Plattner Institute, Potsdam, Germany ! ! ! ! 1st International Workshop on Dataset Profiling & Federated Search for Linked Data (PROFILES2014), ESWC 2014 2014/05/26
  • 2. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. Challenges of Linked Data Profiling 2. ProfilingTasks 3. LODOP 4. Multi-Query Optimizations OUTLINE
  • 3. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. LINKED DATA PROFILING • Metadata often not available • e.g. statistical information on predicates, classes, vocabularies, value patterns, property co-occurrence, … • Data registries,VoiD, and Semantic Sitemaps provide only basic information. e.g., description, author & license information, estimated triple and link count ! • Use cases requiring metadata • Query optimization • Data cleansing • Data integration • Schema induction ! • Data profiling: methods for computing metrics / metadata for datasets
  • 4. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. TRADITIONALVS LINKED DATA PROFILING • State of the art data profiling • Based on columns • Assumes well-defined semantics • Expects regular data ! • Heterogeneity on the Web of Data • Diverse sources • Diverse structures • Diverse views
  • 5. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. CHALLENGES OF LD PROFILING • Heterogeneity • Nested graphs Makes reasoning difficult • Loose structure Things have different predicate sets • Incomplete Missing property definitions • Poorly formatted Property types used inconsistently • Inconsistent Multiple representations claim opposite things ! • Existing (relational) data profiling tools don’t work ! • Volume of data • Requires parallelization
  • 6. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. LODOP - CONTRIBUTIONS • Implementation of 15 profiling tasks as Apache Pig scripts (56 scripts) • System for executing, benchmarking and optimizing data profiling scripts with Apache Pig on Hadoop • Development and evaluation of 3 multi-script optimization rules ! • Apache Pig: • Platform for analyzing large datasets • High-level language: Pig Latin • Scripts executed on Hadoop / MapReduce
  • 7. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. PROFILINGTASKS • Groupings • e.g. by resource, class, property type, language, vocabulary, … ! • Tasks • Number of triples • Average number of triples per resource • Average number of triples per object URI • Average number of triples per context URL • Number of property types • Average number of property values • Number of resources • Number of inlinks / outlinks • Number of context URLs • Number of context PLDs • Property co-occurrence • Inverse Properties • URI-Literal ratio • Property value ranges • Average value length
  • 8. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. DATASETS STATISTICS ! ! ! ! ! ! ! ! ! ! ! * source: BTC 2012 dataset ** WDC = Web Data Commons *** EUNIS = European Environment Agency ! Statistics for 1M triples! DBpedia*! Freebase*! WDC RDFa**! EUNIS Species***! Number of resources! 169,035! 226,834! 168,736! 65,843! Avg. number of triples per resource! 5.9! 4.4! 5.9! 15.2! Number of classes! 19,585! 1,928! 61! 1! Number of property types! 7,844! 2,748! 477! 16! Number of URIs! 519,692! 642,183! 174,317! 407,418! Number of inlinks! 207,712! 192,179! 35,329! 78,377! Number of literals! 480,279! 357,817! 825,564! 592,582! Avg. number of property values! 127.5 363.9! 2096.2! 62,500.0!
  • 9. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. PERFORMANCE EVALUATION
  • 10. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. PERFORMANCE EVALUATION • 10-15s scheduling overhead per MapReduce job (~3.4 jobs per script) • Earlier MapReduce jobs have longer runtimes • Earlier jobs handle more data more HDFS activity • Most scripts scale linearly • Most scripts reduce amount of data in workflow • Exceptions e.g. property co-occurrence scripts
  • 11. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. OPTIMIZATION GOALS • Optimize concurrent execution of multiple scripts • Reduce number of operators • Reduce data flow between operators
  • 12. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. NUMBER OF INSTANCES (PIG)
  • 13. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. LODOP - SYSTEM OVERVIEW
  • 14. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. MULTI-QUERY OPTIMIZATION 1. Merging identical operators 2. Combining FILTER operators 3. Combining FOREACH operators
  • 15. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. • Merging all logical plans into one master plan • Allows parallel execution • Reduces runtime to 25-30% of sequential execution ! STEP 0: MASTER PLAN
  • 16. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. MERGING IDENTICAL OPERATORS Number of property types per class! URI Literal Ratio per class!
  • 17. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. Number of property types per class! URI Literal Ratio per class! 1. Identify and compare sibling operators 2. Merge matching siblings 1. MERGING IDENTICAL OPERATORS
  • 18. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. Number of property types per class! URI Literal Ratio per class! 1. Identify and compare sibling operators 2. Merge matching siblings 1. MERGING IDENTICAL OPERATORS
  • 19. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. • Number of operators reduced from 365 to 267 • Number of MapReduce jobs reduced from 176 to 140 • Frees up cluster resources • Prerequisite step for other optimisations • Restricts parallelism 1. MERGING IDENTICAL OPERATORS
  • 20. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. MERGING IDENTICAL OPERATORS
  • 21. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS
  • 22. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  • 23. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  • 24. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  • 25. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS
  • 26. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 3. COMBINING FOREACH OPERATORS
  • 27. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. Create combined FOREACH operator 2. Replace with simple projections 3. Remove redundant projection 3. COMBINING FOREACH OPERATORS
  • 28. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. Create combined FOREACH operator 2. Replace with simple projections 3. Remove redundant projection 3. COMBINING FOREACH OPERATORS
  • 29. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 3. COMBINING FOREACH OPERATORS
  • 30. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. ALL OPTIMIZATIONS
  • 31. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. SUMMARY • Optimizations reduce • Number of operations • Number of MapReduce jobs • Data flow between operators → less HDFS I/O → Improved execution time • Reduces execution time by 70% • … but rules should not be applied in all cases • More advanced (cost-based) approach is needed
  • 32. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. FUTURE WORK • Additional logical optimization rules • Ignore projections if it allows further merging of operators • Advanced optimization strategies • Cost-based approach could use previous profiling results (e.g. cardinalities) → on-the-go • Materialization of intermediate results • Materialize common subsets, e.g. only triples with typed object values for later scripts
  • 33. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. http://github.com/bforchhammer/lodop/ ! @anjeve anja.jentzsch@hpi.uni-potsdam.de