SlideShare a Scribd company logo
1 of 33
Download to read offline
LODOP
Multi-Query Optimization for 	

Linked Data Profiling Queries
Anja Jentzsch (@anjeve), Benedikt Forchhammer, Felix Naumann	

Hasso Plattner Institute, Potsdam, Germany	

!
!
!
!
1st International Workshop on Dataset Profiling & 	

Federated Search for Linked Data (PROFILES2014), ESWC 2014	

2014/05/26
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. Challenges of Linked Data Profiling	

2. ProfilingTasks	

3. LODOP	

4. Multi-Query Optimizations
OUTLINE
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
LINKED DATA PROFILING
• Metadata often not available	

• e.g. statistical information on predicates, classes, vocabularies, value
patterns, property co-occurrence, …	

• Data registries,VoiD, and Semantic Sitemaps provide only basic
information. e.g., description, author & license information,
estimated triple and link count	

!
• Use cases requiring metadata	

• Query optimization	

• Data cleansing	

• Data integration	

• Schema induction	

!
• Data profiling: methods for computing metrics / metadata for datasets
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
TRADITIONALVS LINKED DATA PROFILING
• State of the art data profiling	

• Based on columns	

• Assumes well-defined semantics	

• Expects regular data	

!
• Heterogeneity on the Web of Data	

• Diverse sources	

• Diverse structures	

• Diverse views
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
CHALLENGES OF LD PROFILING
• Heterogeneity	

• Nested graphs	

 	

 Makes reasoning difficult	

• Loose structure	

	

 Things have different predicate sets	

• Incomplete	

 	

 	

 Missing property definitions	

• Poorly formatted	

 Property types used inconsistently	

• Inconsistent	

	

 	

 Multiple representations claim opposite things 	

!
• Existing (relational) data profiling tools don’t work	

!
• Volume of data	

• Requires parallelization
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
LODOP - CONTRIBUTIONS
• Implementation of 15 profiling tasks as Apache Pig scripts (56 scripts)	

• System for executing, benchmarking and optimizing data profiling
scripts with Apache Pig on Hadoop	

• Development and evaluation of 3 multi-script optimization rules	

!
• Apache Pig:	

• Platform for analyzing large datasets	

• High-level language: Pig Latin	

• Scripts executed on Hadoop / MapReduce
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
PROFILINGTASKS
• Groupings	

• e.g. by resource, class, property type, language, vocabulary, …	

!
• Tasks	

• Number of triples	

• Average number of triples per resource	

• Average number of triples per object URI	

• Average number of triples per context URL	

• Number of property types	

• Average number of property values	

• Number of resources	

• Number of inlinks / outlinks
• Number of context URLs	

• Number of context PLDs	

• Property co-occurrence	

• Inverse Properties	

• URI-Literal ratio	

• Property value ranges	

• Average value length
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
DATASETS STATISTICS
!
!
!
!
!
!
!
!
!
!
!
*	

 source: BTC 2012 dataset	

**	

 WDC = Web Data Commons	

***	

 EUNIS = European Environment Agency	

!
Statistics for 1M triples! DBpedia*! Freebase*! WDC
RDFa**!
EUNIS
Species***!
Number of resources! 169,035! 226,834! 168,736! 65,843!
Avg. number of triples per resource! 5.9! 4.4! 5.9! 15.2!
Number of classes! 19,585! 1,928! 61! 1!
Number of property types! 7,844! 2,748! 477! 16!
Number of URIs! 519,692! 642,183! 174,317! 407,418!
Number of inlinks! 207,712! 192,179! 35,329! 78,377!
Number of literals! 480,279! 357,817! 825,564! 592,582!
Avg. number of property values! 127.5 363.9! 2096.2! 62,500.0!
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
PERFORMANCE EVALUATION
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
PERFORMANCE EVALUATION
• 10-15s scheduling overhead per MapReduce job (~3.4 jobs per script)	

• Earlier MapReduce jobs have longer runtimes	

• Earlier jobs handle more data more HDFS activity	

• Most scripts scale linearly	

• Most scripts reduce amount of data in workflow	

• Exceptions e.g. property co-occurrence scripts
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
OPTIMIZATION GOALS
• Optimize concurrent execution of multiple scripts	

• Reduce number of operators	

• Reduce data flow between operators
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
NUMBER OF INSTANCES (PIG)
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
LODOP - SYSTEM OVERVIEW
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
MULTI-QUERY OPTIMIZATION
1. Merging identical operators	

2. Combining FILTER operators	

3. Combining FOREACH operators
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
• Merging all logical plans into one master plan	

• Allows parallel execution	

• Reduces runtime to 25-30% of sequential execution	

!
STEP 0: MASTER PLAN
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. MERGING IDENTICAL OPERATORS
Number of property types
per class!
URI Literal Ratio per class!
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
Number of property types
per class!
URI Literal Ratio per class!
1. Identify and compare
sibling operators
2. Merge matching siblings
1. MERGING IDENTICAL OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
Number of property types
per class!
URI Literal Ratio per class!
1. Identify and compare
sibling operators	

2. Merge matching siblings
1. MERGING IDENTICAL OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
• Number of operators reduced from 365 to 267	

• Number of MapReduce jobs reduced from 176 to 140	

• Frees up cluster resources	

• Prerequisite step for other optimisations	

• Restricts parallelism	

1. MERGING IDENTICAL OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. MERGING IDENTICAL OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
1. Create combined FILTER operator
2. Rearrange original FILTER operators	

3. Remove redundant operators
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
1. Create combined FILTER operator	

2. Rearrange original FILTER operators
3. Remove redundant operators
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
1. Create combined FILTER operator	

2. Rearrange original FILTER operators	

3. Remove redundant operators
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
3. COMBINING FOREACH OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. Create combined FOREACH
operator
2. Replace with simple
projections
3. Remove redundant projection
3. COMBINING FOREACH OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. Create combined FOREACH
operator	

2. Replace with simple projections	

3. Remove redundant
projection
3. COMBINING FOREACH OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
3. COMBINING FOREACH OPERATORS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
ALL OPTIMIZATIONS
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
SUMMARY
• Optimizations reduce	

• Number of operations	

• Number of MapReduce jobs	

• Data flow between operators → less HDFS I/O	

→ Improved execution time 	

• Reduces execution time by 70%	

• … but rules should not be applied in all cases	

• More advanced (cost-based) approach is needed
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
FUTURE WORK
• Additional logical optimization rules	

• Ignore projections if it allows further merging of operators	

• Advanced optimization strategies	

• Cost-based approach could use previous profiling results (e.g.
cardinalities) → on-the-go	

• Materialization of intermediate results	

• Materialize common subsets, e.g. only triples with typed object
values for later scripts
LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
http://github.com/bforchhammer/lodop/ 	

!
@anjeve	

anja.jentzsch@hpi.uni-potsdam.de

More Related Content

What's hot

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pigdaijy
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data ScienceDonald Miner
 
LinkedGov extension for Google Refine
LinkedGov extension for Google RefineLinkedGov extension for Google Refine
LinkedGov extension for Google Refinedanpaulsmith
 
Relational Database Design Bootcamp
Relational Database Design BootcampRelational Database Design Bootcamp
Relational Database Design BootcampMark Niebergall
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache DrillCharles Givre
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaEdureka!
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)mortardata
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Trainingstratapps
 
Semantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care AnalyticsSemantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care AnalyticsRobert Piro
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and HadoopDonald Miner
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolAll Things Open
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into ProductionMapR Technologies
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and FriendsRob Vesse
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with RJeffrey Breen
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
 
How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses HadoopNarayan Bharadwaj
 
Денис Головняк - Продвинутый поиск с помощью Search API
Денис Головняк - Продвинутый поиск с помощью Search APIДенис Головняк - Продвинутый поиск с помощью Search API
Денис Головняк - Продвинутый поиск с помощью Search APILEDC 2016
 

What's hot (20)

Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Pig programming is more fun: New features in Pig
Pig programming is more fun: New features in PigPig programming is more fun: New features in Pig
Pig programming is more fun: New features in Pig
 
Hadoop for Data Science
Hadoop for Data ScienceHadoop for Data Science
Hadoop for Data Science
 
LinkedGov extension for Google Refine
LinkedGov extension for Google RefineLinkedGov extension for Google Refine
LinkedGov extension for Google Refine
 
Relational Database Design Bootcamp
Relational Database Design BootcampRelational Database Design Bootcamp
Relational Database Design Bootcamp
 
Killing ETL with Apache Drill
Killing ETL with Apache DrillKilling ETL with Apache Drill
Killing ETL with Apache Drill
 
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | EdurekaPig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
Pig Tutorial | Twitter Case Study | Apache Pig Script and Commands | Edureka
 
Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)Hadoop, Pig, and Python (PyData NYC 2012)
Hadoop, Pig, and Python (PyData NYC 2012)
 
Big Data Hadoop Training
Big Data Hadoop TrainingBig Data Hadoop Training
Big Data Hadoop Training
 
Semantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care AnalyticsSemantic Web Technologies in Health Care Analytics
Semantic Web Technologies in Health Care Analytics
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Data science and Hadoop
Data science and HadoopData science and Hadoop
Data science and Hadoop
 
Semantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and StanbolSemantic Integration with Apache Jena and Stanbol
Semantic Integration with Apache Jena and Stanbol
 
Putting Apache Drill into Production
Putting Apache Drill into ProductionPutting Apache Drill into Production
Putting Apache Drill into Production
 
Apache Jena Elephas and Friends
Apache Jena Elephas and FriendsApache Jena Elephas and Friends
Apache Jena Elephas and Friends
 
Tapping the Data Deluge with R
Tapping the Data Deluge with RTapping the Data Deluge with R
Tapping the Data Deluge with R
 
An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
How Salesforce.com uses Hadoop
How Salesforce.com uses HadoopHow Salesforce.com uses Hadoop
How Salesforce.com uses Hadoop
 
Денис Головняк - Продвинутый поиск с помощью Search API
Денис Головняк - Продвинутый поиск с помощью Search APIДенис Головняк - Продвинутый поиск с помощью Search API
Денис Головняк - Продвинутый поиск с помощью Search API
 

Viewers also liked

Open Education and Open Development – working together
Open Education and Open Development – working togetherOpen Education and Open Development – working together
Open Education and Open Development – working togetherMarieke Guy
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataStefan Dietze
 
Open Education Handbook
Open Education HandbookOpen Education Handbook
Open Education HandbookMarieke Guy
 
Lessons Learnt from LinkedUp
Lessons Learnt from LinkedUpLessons Learnt from LinkedUp
Lessons Learnt from LinkedUpMarieke Guy
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Stefan Dietze
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open DataBlerina Spahiu
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Stefan Dietze
 
Open data in Education
Open data in EducationOpen data in Education
Open data in EducationMarieke Guy
 

Viewers also liked (9)

Open Education and Open Development – working together
Open Education and Open Development – working togetherOpen Education and Open Development – working together
Open Education and Open Development – working together
 
Demo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open DataDemo: Profiling & Exploration of Linked Open Data
Demo: Profiling & Exploration of Linked Open Data
 
Open Education Handbook
Open Education HandbookOpen Education Handbook
Open Education Handbook
 
Lessons Learnt from LinkedUp
Lessons Learnt from LinkedUpLessons Learnt from LinkedUp
Lessons Learnt from LinkedUp
 
Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)Turning Data into Knowledge (KESW2014 Keynote)
Turning Data into Knowledge (KESW2014 Keynote)
 
Profiling Linked Open Data
Profiling Linked Open DataProfiling Linked Open Data
Profiling Linked Open Data
 
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
Open Education Challenge 2014: exploiting Linked Data in Educational Applicat...
 
Open data in Education
Open data in EducationOpen data in Education
Open data in Education
 
Big Data Profiling
Big Data Profiling Big Data Profiling
Big Data Profiling
 

Similar to Multi-Query Optimization for Linked Data Profiling Queries

balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference InformationKai Schlegel
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of MusicLars Albertsson
 
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...Amazon Web Services
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Webebiquity
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...t_ivanov
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollinkSSSW
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sqlaftab alam
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...VMware Tanzu
 
Grails And The Semantic Web
Grails And The Semantic WebGrails And The Semantic Web
Grails And The Semantic Webwilliam_greenly
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...Cambridge Semantics
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresSerena Villata
 
Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers  Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers EDB
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Gautier Poupeau
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedSören Auer
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Dataaba-sah
 
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...TERN Australia
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And VisualizationIvan Ermilov
 

Similar to Multi-Query Optimization for Linked Data Profiling Queries (20)

balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Informationballoon Fusion: SPARQL Rewriting Based on  Unified Co-Reference Information
balloon Fusion: SPARQL Rewriting Based on Unified Co-Reference Information
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
Data Infrastructure for a World of Music
Data Infrastructure for a World of MusicData Infrastructure for a World of Music
Data Infrastructure for a World of Music
 
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
NEW LAUNCH! How to build graph applications with SPARQL and Gremlin using Ama...
 
Finding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic WebFinding knowledge, data and answers on the Semantic Web
Finding knowledge, data and answers on the Semantic Web
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Knowledge discoverylaurahollink
Knowledge discoverylaurahollinkKnowledge discoverylaurahollink
Knowledge discoverylaurahollink
 
Apache Spark sql
Apache Spark sqlApache Spark sql
Apache Spark sql
 
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
Machine Learning, Graph, Text and Geospatial on Postgres and Greenplum - Gree...
 
Grails And The Semantic Web
Grails And The Semantic WebGrails And The Semantic Web
Grails And The Semantic Web
 
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
AnzoGraph DB: Driving AI and Machine Insights with Knowledge Graphs in a Conn...
 
Context-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph StoresContext-Aware Access Control for RDF Graph Stores
Context-Aware Access Control for RDF Graph Stores
 
Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers  Postgres Foreign Data Wrappers
Postgres Foreign Data Wrappers
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...Why I don't use Semantic Web technologies anymore, event if they still influe...
Why I don't use Semantic Web technologies anymore, event if they still influe...
 
The web of interlinked data and knowledge stripped
The web of interlinked data and knowledge strippedThe web of interlinked data and knowledge stripped
The web of interlinked data and knowledge stripped
 
Hide the Stack: Toward Usable Linked Data
Hide the Stack:Toward Usable Linked DataHide the Stack:Toward Usable Linked Data
Hide the Stack: Toward Usable Linked Data
 
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
Siddeswara Guru_ TERN's Data Discovery Portal: finding and accessing Australi...
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 

More from Anja Jentzsch

DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinAnja Jentzsch
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Anja Jentzsch
 
Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...
Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...
Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...Anja Jentzsch
 
Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)Anja Jentzsch
 
Visualizing Web Data Query Results
Visualizing Web Data Query ResultsVisualizing Web Data Query Results
Visualizing Web Data Query ResultsAnja Jentzsch
 

More from Anja Jentzsch (9)

Wikidata
WikidataWikidata
Wikidata
 
Linked Data
Linked DataLinked Data
Linked Data
 
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, BerlinDBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
DBpedia Mappings Wiki, SMWCon Fall 2013, Berlin
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...
Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...
Wikidata - The free knowledge base that anyone can edit (1st Linked Data Meet...
 
Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)Link Sets And Why They Are Important (EDF2012)
Link Sets And Why They Are Important (EDF2012)
 
Visualizing Web Data Query Results
Visualizing Web Data Query ResultsVisualizing Web Data Query Results
Visualizing Web Data Query Results
 
Finding Data Sets
Finding Data SetsFinding Data Sets
Finding Data Sets
 
Linked Data Basics
Linked Data BasicsLinked Data Basics
Linked Data Basics
 

Recently uploaded

DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx201bo007
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsMarkus Roggen
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGSoniaBajaj10
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests GlycosidesNandakishor Bhaurao Deshmukh
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxpriyankatabhane
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...Chayanika Das
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsSérgio Sacani
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxMedical College
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Christina Parmionova
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxtuking87
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPRPirithiRaju
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationSanghamitraMohapatra5
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Sérgio Sacani
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfGABYFIORELAMALPARTID1
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyChayanika Das
 

Recently uploaded (20)

DETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptxDETECTION OF MUTATION BY CLB METHOD.pptx
DETECTION OF MUTATION BY CLB METHOD.pptx
 
PLASMODIUM. PPTX
PLASMODIUM. PPTXPLASMODIUM. PPTX
PLASMODIUM. PPTX
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of CannabinoidsTotal Legal: A “Joint” Journey into the Chemistry of Cannabinoids
Total Legal: A “Joint” Journey into the Chemistry of Cannabinoids
 
Probability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UGProbability.pptx, Types of Probability, UG
Probability.pptx, Types of Probability, UG
 
Introduction Classification Of Alkaloids
Introduction Classification Of AlkaloidsIntroduction Classification Of Alkaloids
Introduction Classification Of Alkaloids
 
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests GlycosidesGLYCOSIDES Classification Of GLYCOSIDES  Chemical Tests Glycosides
GLYCOSIDES Classification Of GLYCOSIDES Chemical Tests Glycosides
 
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptxEnvironmental Acoustics- Speech interference level, acoustics calibrator.pptx
Environmental Acoustics- Speech interference level, acoustics calibrator.pptx
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
 
Observational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive starsObservational constraints on mergers creating magnetism in massive stars
Observational constraints on mergers creating magnetism in massive stars
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
Introduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptxIntroduction of Human Body & Structure of cell.pptx
Introduction of Human Body & Structure of cell.pptx
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
 
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptxQ4-Mod-1c-Quiz-Projectile-333344444.pptx
Q4-Mod-1c-Quiz-Projectile-333344444.pptx
 
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
6.1 Pests of Groundnut_Binomics_Identification_Dr.UPR
 
Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
cybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitationcybrids.pptx production_advanges_limitation
cybrids.pptx production_advanges_limitation
 
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
Observation of Gravitational Waves from the Coalescence of a 2.5–4.5 M⊙ Compa...
 
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdfKDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
KDIGO-2023-CKD-Guideline-Public-Review-Draft_5-July-2023.pdf
 
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary MicrobiologyLAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
 

Multi-Query Optimization for Linked Data Profiling Queries

  • 1. LODOP Multi-Query Optimization for Linked Data Profiling Queries Anja Jentzsch (@anjeve), Benedikt Forchhammer, Felix Naumann Hasso Plattner Institute, Potsdam, Germany ! ! ! ! 1st International Workshop on Dataset Profiling & Federated Search for Linked Data (PROFILES2014), ESWC 2014 2014/05/26
  • 2. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. Challenges of Linked Data Profiling 2. ProfilingTasks 3. LODOP 4. Multi-Query Optimizations OUTLINE
  • 3. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. LINKED DATA PROFILING • Metadata often not available • e.g. statistical information on predicates, classes, vocabularies, value patterns, property co-occurrence, … • Data registries,VoiD, and Semantic Sitemaps provide only basic information. e.g., description, author & license information, estimated triple and link count ! • Use cases requiring metadata • Query optimization • Data cleansing • Data integration • Schema induction ! • Data profiling: methods for computing metrics / metadata for datasets
  • 4. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. TRADITIONALVS LINKED DATA PROFILING • State of the art data profiling • Based on columns • Assumes well-defined semantics • Expects regular data ! • Heterogeneity on the Web of Data • Diverse sources • Diverse structures • Diverse views
  • 5. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. CHALLENGES OF LD PROFILING • Heterogeneity • Nested graphs Makes reasoning difficult • Loose structure Things have different predicate sets • Incomplete Missing property definitions • Poorly formatted Property types used inconsistently • Inconsistent Multiple representations claim opposite things ! • Existing (relational) data profiling tools don’t work ! • Volume of data • Requires parallelization
  • 6. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. LODOP - CONTRIBUTIONS • Implementation of 15 profiling tasks as Apache Pig scripts (56 scripts) • System for executing, benchmarking and optimizing data profiling scripts with Apache Pig on Hadoop • Development and evaluation of 3 multi-script optimization rules ! • Apache Pig: • Platform for analyzing large datasets • High-level language: Pig Latin • Scripts executed on Hadoop / MapReduce
  • 7. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. PROFILINGTASKS • Groupings • e.g. by resource, class, property type, language, vocabulary, … ! • Tasks • Number of triples • Average number of triples per resource • Average number of triples per object URI • Average number of triples per context URL • Number of property types • Average number of property values • Number of resources • Number of inlinks / outlinks • Number of context URLs • Number of context PLDs • Property co-occurrence • Inverse Properties • URI-Literal ratio • Property value ranges • Average value length
  • 8. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. DATASETS STATISTICS ! ! ! ! ! ! ! ! ! ! ! * source: BTC 2012 dataset ** WDC = Web Data Commons *** EUNIS = European Environment Agency ! Statistics for 1M triples! DBpedia*! Freebase*! WDC RDFa**! EUNIS Species***! Number of resources! 169,035! 226,834! 168,736! 65,843! Avg. number of triples per resource! 5.9! 4.4! 5.9! 15.2! Number of classes! 19,585! 1,928! 61! 1! Number of property types! 7,844! 2,748! 477! 16! Number of URIs! 519,692! 642,183! 174,317! 407,418! Number of inlinks! 207,712! 192,179! 35,329! 78,377! Number of literals! 480,279! 357,817! 825,564! 592,582! Avg. number of property values! 127.5 363.9! 2096.2! 62,500.0!
  • 9. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. PERFORMANCE EVALUATION
  • 10. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. PERFORMANCE EVALUATION • 10-15s scheduling overhead per MapReduce job (~3.4 jobs per script) • Earlier MapReduce jobs have longer runtimes • Earlier jobs handle more data more HDFS activity • Most scripts scale linearly • Most scripts reduce amount of data in workflow • Exceptions e.g. property co-occurrence scripts
  • 11. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. OPTIMIZATION GOALS • Optimize concurrent execution of multiple scripts • Reduce number of operators • Reduce data flow between operators
  • 12. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. NUMBER OF INSTANCES (PIG)
  • 13. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. LODOP - SYSTEM OVERVIEW
  • 14. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. MULTI-QUERY OPTIMIZATION 1. Merging identical operators 2. Combining FILTER operators 3. Combining FOREACH operators
  • 15. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. • Merging all logical plans into one master plan • Allows parallel execution • Reduces runtime to 25-30% of sequential execution ! STEP 0: MASTER PLAN
  • 16. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. MERGING IDENTICAL OPERATORS Number of property types per class! URI Literal Ratio per class!
  • 17. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. Number of property types per class! URI Literal Ratio per class! 1. Identify and compare sibling operators 2. Merge matching siblings 1. MERGING IDENTICAL OPERATORS
  • 18. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. Number of property types per class! URI Literal Ratio per class! 1. Identify and compare sibling operators 2. Merge matching siblings 1. MERGING IDENTICAL OPERATORS
  • 19. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. • Number of operators reduced from 365 to 267 • Number of MapReduce jobs reduced from 176 to 140 • Frees up cluster resources • Prerequisite step for other optimisations • Restricts parallelism 1. MERGING IDENTICAL OPERATORS
  • 20. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. MERGING IDENTICAL OPERATORS
  • 21. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS
  • 22. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  • 23. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  • 24. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS 1. Create combined FILTER operator 2. Rearrange original FILTER operators 3. Remove redundant operators
  • 25. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 2. COMBINING FILTER OPERATORS
  • 26. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 3. COMBINING FOREACH OPERATORS
  • 27. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. Create combined FOREACH operator 2. Replace with simple projections 3. Remove redundant projection 3. COMBINING FOREACH OPERATORS
  • 28. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 1. Create combined FOREACH operator 2. Replace with simple projections 3. Remove redundant projection 3. COMBINING FOREACH OPERATORS
  • 29. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. 3. COMBINING FOREACH OPERATORS
  • 30. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. ALL OPTIMIZATIONS
  • 31. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. SUMMARY • Optimizations reduce • Number of operations • Number of MapReduce jobs • Data flow between operators → less HDFS I/O → Improved execution time • Reduces execution time by 70% • … but rules should not be applied in all cases • More advanced (cost-based) approach is needed
  • 32. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. FUTURE WORK • Additional logical optimization rules • Ignore projections if it allows further merging of operators • Advanced optimization strategies • Cost-based approach could use previous profiling results (e.g. cardinalities) → on-the-go • Materialization of intermediate results • Materialize common subsets, e.g. only triples with typed object values for later scripts
  • 33. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014. http://github.com/bforchhammer/lodop/ ! @anjeve anja.jentzsch@hpi.uni-potsdam.de