SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

Institute for Web Science and Technologies
University of Koblenz ▪ Landau, Germany

Systematic Generation of
SPARQL Benchmark Queries
for Linked Open Data

Olaf Görlitz, Matthias Thimm, Steffen Staab

Linked Data Federation

SPARQL Queries on the Linked Data Cloud

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation
Slide 2 Olaf Görlitz, Matthias Thimm, Steffen Staab

The Problem

Why not use
benchmark
queries?

distributed federation
queries implementation


RDF Benchmarks

LUBM, BSBM, SP²B, ... FedBench (ISWC'11)

• Synthetic datasets • 10 Linked Data sets
• Domain-specific (~170M triples)
• Highly structured • 25 handpicked
• Sophisticated queries distributed queries

Centralized Fixed

Scalable, Flexible, Expressive
Linked Data Benchmark


Overview

 Benchmark Idea
 Methodology
 Evaluation


Linked Data Benchmark Features

Scalability Flexibility Expressiveness

Real Linked Data Sets Customization Typical+Complex Queries

Systematic SPARQL Benchmark Query Generator
for Linked Open Data


Requirements

What we want:

1. Define Query Customize Benchmark
Characteristics
2. Automatic Query Random Queries
Generation
3. Query Validation #results > 0


Contribution

Methodology and toolset for
systematic query generation

Linked Data

Config Benchmark
Queries

Parameterization Query Generation Query Validation


Overview

 Benchmark Idea
 Methodology
 Evaluation


SPLODGE Methodology

Query Query Query
Parameterization Generation Validation

Define typical + challenging distributed queries

No federation query Analyze queries
logs available of benchmarks

SELECT ?drug ?keggUrl ?chebiImage WHERE {
  ?drug rdf:type drugbank:drugs .
  ?drug drugbank:keggCompoundId ?keggDrug .
  ?keggDrug bio2rdf:url ?keggUrl .
  ?drug drugbank:genericName ?drugBankName .
  ?chebiDrug purl:title ?drugBankName .
  ?chebiDrug chebi:image ?chebiImage . }
FedBench/LifeScience#5

SPLODGE Methodology

Query Query Query

Algebra Structure Cardinality
• Query Form • Variable Patterns • # Data Sources
(Select, Construct, ...) (s, o, s+o, ...)
• Join Type • Join Patterns • # Joins/ Patterns
(conj. / disj. / left-join) (star, path)
• Result Modifiers • Cross Product • # Results
(limit, offs, order by)


SPLODGE Methodology

Query Query Query

Main query parameter: join structure
path join

FedBench queries star join


SPLODGE Methodology

Query Query Query

Additional query parameters: # triple patterns
# data sources
result size
...

Path-join: n triple patterns, Star-join: n triple pattern,
m sources (m≤n) anchor node (s/o)


SPLODGE Methodology

Query Query Query

s rdf:type
m eA
l:sa
ow rdfs:label

foaf
:kno
ws

Iteratively add random triple pattern #results > 0 ?

Need background knowledge level of detail?

Predicate combinations how provided?


SPLODGE Methodology

Query Query Query

s rdf:type
m eA
l:sa
ow rdfs:label

foaf
:kno
ws

Linked Predicates Characteristics Sets*
(owl:sameAs → rdf:type) {rdfs:label, foaf:knows, …}
DBpedia → geonames (43, 58) DBpedia (322), rdfs:label (437)
freebase → DBpedia (86, 72) foaf:knows (322)
... ...
*[Neumann, Moerkotte, ICDE 2011]

SPLODGE Methodology

Query Query Query

p1 p2 p3

p4

Linked Predicates Characteristics Sets

(p1 → p2) ⊗ (p2 → p3) {p1, p4}
⊗ (p3 → pi ) {p1, p4, ...}


SPLODGE Methodology

Query Query Query

Verify generated queries (#results >0)

How to evaluate? Compute
confidence value

minimum join selectivity > e


Overview

 Benchmark Idea
 Methodology
 Evaluation


Evaluation Objective

 Verify generation of valid queries (#results >0)
 Compare variations of query generation algorithms

Baseline SPLODGElite SPLODGE
“random“ background + minimum
predicate knowlege join selectivity
(> 10-4/10-3/10-2)

 Metrics:
 #queries with non-empty results
 #result per query


Evaluation Setup

 Real Linked Data Billion Triple Challenge Dataset
 Random queries
 Triple Store • Path-joins across data sources
• 3-6 patterns, bound predicates
• 100 queries per batch
RDF3X

SELECT * WHERE {
?var1 <http://dbpedia.org/property/description> ?var2 .
?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 .
?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 .
?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 .
?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6
}


Evaluation Results
#queries

Joined triple patterns


Evaluation Results
#results

Joined triple patterns


Estimated vs. actual results size
actual result size

estimated result size


Predicate Occurrence in Queries


Conclusion

SPLODGE provides
 Flexible query characterization + parameterization
 Methodology for Systematic & Scalable Query Generation
 Toolset as Open Source (http://code.google.com/p/splodge/)

Future Work:
 Create a LOD Federation Benchmark
 Interactive SPARQL query construction

Questions?


SPLODGE Evaluation Setup

BTC 2011 dataset in RDF3X
 pure triples, no context
 160 GB repository file
(14h loading, 200 GB tmp mem)


SPLODGE Pre-Processing for BTC data

Identify common domains
17 GB gzip (e.g. jane08.lifejournal.com/home) 3,0 h

Replace quad context
4,4 h
(reduce number of sources)

Sort quads + remove duplicates 8,5 h

<1 MB gzip Build predicate/context dictionary 1,0 h

1.7 GB gzip Create resource in/out-link index 9,7 h

Create linked predicate stats Compute characteristic sets 1,6 h


SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

Recommandé

Recommandé

Contenu connexe

Similaire à SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

Similaire à SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data (20)

SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data