LAMP PCR.pptx by Dr. Chayanika Das, Ph.D, Veterinary Microbiology
Multi-Query Optimization for Linked Data Profiling Queries
1. LODOP
Multi-Query Optimization for
Linked Data Profiling Queries
Anja Jentzsch (@anjeve), Benedikt Forchhammer, Felix Naumann
Hasso Plattner Institute, Potsdam, Germany
!
!
!
!
1st International Workshop on Dataset Profiling &
Federated Search for Linked Data (PROFILES2014), ESWC 2014
2014/05/26
2. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. Challenges of Linked Data Profiling
2. ProfilingTasks
3. LODOP
4. Multi-Query Optimizations
OUTLINE
3. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
LINKED DATA PROFILING
• Metadata often not available
• e.g. statistical information on predicates, classes, vocabularies, value
patterns, property co-occurrence, …
• Data registries,VoiD, and Semantic Sitemaps provide only basic
information. e.g., description, author & license information,
estimated triple and link count
!
• Use cases requiring metadata
• Query optimization
• Data cleansing
• Data integration
• Schema induction
!
• Data profiling: methods for computing metrics / metadata for datasets
4. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
TRADITIONALVS LINKED DATA PROFILING
• State of the art data profiling
• Based on columns
• Assumes well-defined semantics
• Expects regular data
!
• Heterogeneity on the Web of Data
• Diverse sources
• Diverse structures
• Diverse views
5. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
CHALLENGES OF LD PROFILING
• Heterogeneity
• Nested graphs
Makes reasoning difficult
• Loose structure
Things have different predicate sets
• Incomplete
Missing property definitions
• Poorly formatted
Property types used inconsistently
• Inconsistent
Multiple representations claim opposite things
!
• Existing (relational) data profiling tools don’t work
!
• Volume of data
• Requires parallelization
6. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
LODOP - CONTRIBUTIONS
• Implementation of 15 profiling tasks as Apache Pig scripts (56 scripts)
• System for executing, benchmarking and optimizing data profiling
scripts with Apache Pig on Hadoop
• Development and evaluation of 3 multi-script optimization rules
!
• Apache Pig:
• Platform for analyzing large datasets
• High-level language: Pig Latin
• Scripts executed on Hadoop / MapReduce
7. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
PROFILINGTASKS
• Groupings
• e.g. by resource, class, property type, language, vocabulary, …
!
• Tasks
• Number of triples
• Average number of triples per resource
• Average number of triples per object URI
• Average number of triples per context URL
• Number of property types
• Average number of property values
• Number of resources
• Number of inlinks / outlinks
• Number of context URLs
• Number of context PLDs
• Property co-occurrence
• Inverse Properties
• URI-Literal ratio
• Property value ranges
• Average value length
8. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
DATASETS STATISTICS
!
!
!
!
!
!
!
!
!
!
!
*
source: BTC 2012 dataset
**
WDC = Web Data Commons
***
EUNIS = European Environment Agency
!
Statistics for 1M triples! DBpedia*! Freebase*! WDC
RDFa**!
EUNIS
Species***!
Number of resources! 169,035! 226,834! 168,736! 65,843!
Avg. number of triples per resource! 5.9! 4.4! 5.9! 15.2!
Number of classes! 19,585! 1,928! 61! 1!
Number of property types! 7,844! 2,748! 477! 16!
Number of URIs! 519,692! 642,183! 174,317! 407,418!
Number of inlinks! 207,712! 192,179! 35,329! 78,377!
Number of literals! 480,279! 357,817! 825,564! 592,582!
Avg. number of property values! 127.5 363.9! 2096.2! 62,500.0!
9. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
PERFORMANCE EVALUATION
10. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
PERFORMANCE EVALUATION
• 10-15s scheduling overhead per MapReduce job (~3.4 jobs per script)
• Earlier MapReduce jobs have longer runtimes
• Earlier jobs handle more data more HDFS activity
• Most scripts scale linearly
• Most scripts reduce amount of data in workflow
• Exceptions e.g. property co-occurrence scripts
11. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
OPTIMIZATION GOALS
• Optimize concurrent execution of multiple scripts
• Reduce number of operators
• Reduce data flow between operators
12. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
NUMBER OF INSTANCES (PIG)
13. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
LODOP - SYSTEM OVERVIEW
15. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
• Merging all logical plans into one master plan
• Allows parallel execution
• Reduces runtime to 25-30% of sequential execution
!
STEP 0: MASTER PLAN
16. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. MERGING IDENTICAL OPERATORS
Number of property types
per class!
URI Literal Ratio per class!
17. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
Number of property types
per class!
URI Literal Ratio per class!
1. Identify and compare
sibling operators
2. Merge matching siblings
1. MERGING IDENTICAL OPERATORS
18. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
Number of property types
per class!
URI Literal Ratio per class!
1. Identify and compare
sibling operators
2. Merge matching siblings
1. MERGING IDENTICAL OPERATORS
19. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
• Number of operators reduced from 365 to 267
• Number of MapReduce jobs reduced from 176 to 140
• Frees up cluster resources
• Prerequisite step for other optimisations
• Restricts parallelism
1. MERGING IDENTICAL OPERATORS
20. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
1. MERGING IDENTICAL OPERATORS
21. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
2. COMBINING FILTER OPERATORS
29. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
3. COMBINING FOREACH OPERATORS
30. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
ALL OPTIMIZATIONS
31. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
SUMMARY
• Optimizations reduce
• Number of operations
• Number of MapReduce jobs
• Data flow between operators → less HDFS I/O
→ Improved execution time
• Reduces execution time by 70%
• … but rules should not be applied in all cases
• More advanced (cost-based) approach is needed
32. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
FUTURE WORK
• Additional logical optimization rules
• Ignore projections if it allows further merging of operators
• Advanced optimization strategies
• Cost-based approach could use previous profiling results (e.g.
cardinalities) → on-the-go
• Materialization of intermediate results
• Materialize common subsets, e.g. only triples with typed object
values for later scripts
33. LODOP - Multi-Query Optimization for Linked Data Profiling Queries.A. Jentzsch. PROFILES2014, ESWC2014.
http://github.com/bforchhammer/lodop/
!
@anjeve
anja.jentzsch@hpi.uni-potsdam.de