1. Institute for Web Science and Technologies
University of Koblenz ▪ Landau, Germany
SPLENDID: SPARQL Endpoint Federation
Exploiting VOID Descriptions
Olaf Görlitz, Steffen Staab
2. Motivation
How to access a large number of linked data sources?
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 2
3. Data Integration Approaches
Data Warehouse Link Traversal
Efficient query execution Live Data Access
Complete results Flexible / On Demand
Data copies Incomplete results
Inflexible Biased by starting point
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 3
4. Our Approach
Data Federation
Live data access
Flexible source integration
Effective query planning
Complete results
Hypothesis:
Efficient query federation is possible using core Semantic
Web technology (i.e. SPARQL endpoints, VoiD descriptions)
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 4
5. VoiD: „Vocabulary of Interlinked Datasets“
} General Information
} Basic statistics
triples = 732744
} Type statistics
chebi:Compound = 50477
} Predicate statistics
bio:formula = 39555
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 5
6. Distributed Query Processing
Contribution:
Apply Best Practices of RDBMS for RDF Federation
http://code.google.com/p/rdffederator/
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 6
7. Query Example
Which drugs are categorized as micronutrients?
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategory category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrug rdf:type kegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrug purl:title ?title . }
}
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 7
8. Query Processing
Source Selection Join Optimization Query Execution
SELECT ?drug ?title WHERE {
?drug drugbank:drugCategory category:micronutrient .
?drug drugbank:casRegistryNumber ?id .
?keggDrug rdf:type kegg:Drug .
?keggDrug bio2rdf:xRef ?id .
?keggDrug purl:title ?title . }
}
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 8
12. Join Order Optimization
Source Selection Join Optimization Query Execution
Dynamic Programming with statistics-based cost estimation
bind join /
hash join
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 12
13. Evaluation
FedBench Evaluation Suite Measuring
• Life Science + Cross Domain Data • #data sources selected
• different query characteristics • query execution time
Orthogonal State-of-the-Art approaches:
DARQ AliBaba FedX SPLENDID
Statistics ServiceDesc – – VoiD
Source Statistics All sources ASK queries Statistics +
Selection (predicates) ASK queries
Query DynProg Heuristics Heuristics DynProg
Optimization
Query Bind join Bind join Bound Join + Bind Join +
Execution parallelization Hash Join
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 13
14. Evaluation: Source Selection
Source Selection Join Optimization Query Execution
owl:sameAs rdf:type
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 14
15. Evaluation: Query Optimization
Source Selection Join Optimization Query Execution
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 15
16. Conclusion
Publish more VoiD description!
VoiD-based query federation is efficient
What next?
Combination with FedX
Improving estimation and cost model
Integrating SPARQL 1.1 features
WeST Institute Olaf Görlitz
People and Knowledge Networks COLD 2011, Bonn, Germany Slide 16