SlideShare une entreprise Scribd logo
1  sur  27
Télécharger pour lire hors ligne
Institute for Web Science and Technologies
                       University of Koblenz ▪ Landau, Germany




                 Systematic Generation of
              SPARQL Benchmark Queries
                    for Linked Open Data



Olaf Görlitz, Matthias Thimm, Steffen Staab
Linked Data Federation


            SPARQL Queries on the Linked Data Cloud




                                                 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/



ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 2                       Olaf Görlitz, Matthias Thimm, Steffen Staab
The Problem



              Why not use
              benchmark
              queries?




              distributed                                       federation
                queries                                       implementation


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 3                       Olaf Görlitz, Matthias Thimm, Steffen Staab
RDF Benchmarks


       LUBM, BSBM, SP²B, ...                                FedBench (ISWC'11)

       • Synthetic datasets                                  • 10 Linked Data sets
       • Domain-specific                                       (~170M triples)
       • Highly structured                                   • 25 handpicked
       • Sophisticated queries                                 distributed queries

                Centralized                                              Fixed



                              Scalable, Flexible, Expressive
                                Linked Data Benchmark


ISWC'12, Boston, 11/15/2012     SPLODGE: Systematic LOD Benchmark Query Generation
Slide 4                         Olaf Görlitz, Matthias Thimm, Steffen Staab
Overview

 Benchmark Idea
 Methodology
 Evaluation




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 5                       Olaf Görlitz, Matthias Thimm, Steffen Staab
Linked Data Benchmark Features



         Scalability                      Flexibility                      Expressiveness

  Real Linked Data Sets                  Customization               Typical+Complex Queries




       Systematic SPARQL Benchmark Query Generator
                     for Linked Open Data




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 6                       Olaf Görlitz, Matthias Thimm, Steffen Staab
Requirements



          What we want:

          1. Define Query                                   Customize Benchmark
             Characteristics
          2. Automatic Query                                Random Queries
             Generation
          3. Query Validation                               #results > 0




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 7                       Olaf Görlitz, Matthias Thimm, Steffen Staab
Contribution

                              Methodology and toolset for
                              systematic query generation


                                          Linked Data




                 Config                                                      Benchmark
                                                                              Queries




        Parameterization              Query Generation                 Query Validation



ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 8                       Olaf Görlitz, Matthias Thimm, Steffen Staab
Overview

 Benchmark Idea
 Methodology
 Evaluation




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 9                       Olaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation



                Define typical + challenging distributed queries

               No federation query                                 Analyze queries
                 logs available                                    of benchmarks

                 SELECT ?drug ?keggUrl ?chebiImage WHERE {
                   ?drug rdf:type drugbank:drugs .
                   ?drug drugbank:keggCompoundId ?keggDrug .
                   ?keggDrug bio2rdf:url ?keggUrl .
                   ?drug drugbank:genericName ?drugBankName .
                   ?chebiDrug purl:title ?drugBankName .
                   ?chebiDrug chebi:image ?chebiImage . }
                FedBench/LifeScience#5
ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 10                      Olaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation




    Algebra                               Structure                          Cardinality
  • Query Form                          • Variable Patterns                 • # Data Sources
    (Select, Construct, ...)              (s, o, s+o, ...)
  • Join Type                           • Join Patterns                     • # Joins/ Patterns
    (conj. / disj. / left-join)           (star, path)
  • Result Modifiers                    • Cross Product                     • # Results
    (limit, offs, order by)




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 11                      Olaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation



      Main query parameter: join structure
                                                                  path join




    FedBench queries                                              star join


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 12                      Olaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation



      Additional query parameters: # triple patterns
                                   # data sources
                                   result size
                                   ...


     Path-join: n triple patterns,                      Star-join:        n triple pattern,
                m sources (m≤n)                                           anchor node (s/o)




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 13                      Olaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                        Query                      Query
                 Parameterization               Generation                 Validation

                                                       s      rdf:type
                                                  m eA
                                              l:sa
                                         ow                  rdfs:label

                                                           foaf
                                                                  :kno
                                                                      ws


        Iteratively add random triple pattern                                  #results > 0 ?

             Need background knowledge                                         level of detail?

                  Predicate combinations                                       how provided?

ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 14                      Olaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                        Query                        Query
                 Parameterization               Generation                   Validation

                                                       s       rdf:type
                                                  m eA
                                              l:sa
                                         ow                    rdfs:label

                                                             foaf
                                                                    :kno
                                                                        ws


    Linked Predicates                                      Characteristics Sets*
    (owl:sameAs → rdf:type)                                {rdfs:label, foaf:knows, …}
    DBpedia → geonames (43, 58)                            DBpedia (322), rdfs:label (437)
    freebase → DBpedia (86, 72)                                               foaf:knows (322)
    ...                                                    ...
                                                           *[Neumann, Moerkotte, ICDE 2011]
ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 15                      Olaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation




                               p1               p2              p3

                              p4


    Linked Predicates                                   Characteristics Sets

   (p1 → p2) ⊗ (p2 → p3)                                 {p1, p4}
                     ⊗ (p3 → pi )                        {p1, p4, ...}


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 16                      Olaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Methodology

                     Query                       Query                    Query
                 Parameterization              Generation               Validation



                Verify generated queries (#results >0)

                How to evaluate?                                      Compute
                                                                   confidence value


                              minimum join selectivity > e




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 17                      Olaf Görlitz, Matthias Thimm, Steffen Staab
Overview

 Benchmark Idea
 Methodology
 Evaluation




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 18                      Olaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Objective

 Verify generation of valid queries (#results >0)
 Compare variations of query generation algorithms


               Baseline                SPLODGElite                        SPLODGE
             “random“                    background                      + minimum
             predicate                    knowlege                      join selectivity
                                                                        (> 10-4/10-3/10-2)

 Metrics:
   #queries with non-empty results
   #result per query


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 19                      Olaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Setup

 Real Linked Data                                  Billion Triple Challenge Dataset
 Random queries
 Triple Store                                      • Path-joins across data sources
                                                    • 3-6 patterns, bound predicates
                                                    • 100 queries per batch
                                  RDF3X



    SELECT * WHERE {
       ?var1 <http://dbpedia.org/property/description> ?var2 .
       ?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 .
       ?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 .
       ?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 .
       ?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6
    }


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 20                      Olaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Results
#queries




                                          Joined triple patterns


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 21                      Olaf Görlitz, Matthias Thimm, Steffen Staab
Evaluation Results
#results




                                          Joined triple patterns


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 22                      Olaf Görlitz, Matthias Thimm, Steffen Staab
Estimated vs. actual results size
actual result size




                                          estimated result size


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 23                      Olaf Görlitz, Matthias Thimm, Steffen Staab
Predicate Occurrence in Queries




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 24                      Olaf Görlitz, Matthias Thimm, Steffen Staab
Conclusion

SPLODGE provides
 Flexible query characterization + parameterization
 Methodology for Systematic & Scalable Query Generation
 Toolset as Open Source (http://code.google.com/p/splodge/)

Future Work:
 Create a LOD Federation Benchmark
 Interactive SPARQL query construction


                                    Questions?


ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 25                      Olaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Evaluation Setup

BTC 2011 dataset in RDF3X
 pure triples, no context
 160 GB repository file
  (14h loading, 200 GB tmp mem)




ISWC'12, Boston, 11/15/2012   SPLODGE: Systematic LOD Benchmark Query Generation
Slide 26                      Olaf Görlitz, Matthias Thimm, Steffen Staab
SPLODGE Pre-Processing for BTC data


                               Identify common domains
17 GB gzip               (e.g. jane08.lifejournal.com/home)                           3,0 h

                                   Replace quad context
                                                                                      4,4 h
                                (reduce number of sources)

                              Sort quads + remove duplicates                          8,5 h

<1 MB gzip                    Build predicate/context dictionary                      1,0 h

1.7 GB gzip                   Create resource in/out-link index                       9,7 h


   Create linked predicate stats                        Compute characteristic sets           1,6 h

ISWC'12, Boston, 11/15/2012      SPLODGE: Systematic LOD Benchmark Query Generation
Slide 27                         Olaf Görlitz, Matthias Thimm, Steffen Staab

Contenu connexe

Similaire à SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbenchEuropean Data Forum
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingPeter Haase
 
SeCold - A Linked Data Platform for Mining Software Repositories
SeCold - A Linked Data Platform for  Mining Software RepositoriesSeCold - A Linked Data Platform for  Mining Software Repositories
SeCold - A Linked Data Platform for Mining Software Repositoriesimanmahsa
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
OSTHUS-Allotrope presents "Laboratory Informatics Strategy" at SmartLab 2015
OSTHUS-Allotrope presents "Laboratory Informatics Strategy" at SmartLab 2015OSTHUS-Allotrope presents "Laboratory Informatics Strategy" at SmartLab 2015
OSTHUS-Allotrope presents "Laboratory Informatics Strategy" at SmartLab 2015OSTHUS
 
Thinking About Guideline for Data Interoperability - Design concept and workf...
Thinking About Guideline for Data Interoperability - Design concept and workf...Thinking About Guideline for Data Interoperability - Design concept and workf...
Thinking About Guideline for Data Interoperability - Design concept and workf...Open Cyber University of Korea
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Cassandra advanced data modeling
Cassandra advanced data modelingCassandra advanced data modeling
Cassandra advanced data modelingRomain Hardouin
 
E clinical solutions irug 2012 12sep2012
E clinical solutions irug 2012 12sep2012E clinical solutions irug 2012 12sep2012
E clinical solutions irug 2012 12sep2012Chandi Kodthiwada
 
An Extensible Framework to Validate and Build Dataset Profiles
An Extensible Framework to Validate and Build Dataset ProfilesAn Extensible Framework to Validate and Build Dataset Profiles
An Extensible Framework to Validate and Build Dataset ProfilesAhmad Assaf
 
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...Muhammad Saleem
 
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)Lviv Startup Club
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Zaloni
 
Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practicesRobert Oostenveld
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detectionMostafaAliAbbas
 
Conceptional Data Vault
Conceptional Data VaultConceptional Data Vault
Conceptional Data VaultTorsten Glunde
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...NoSQLmatters
 

Similaire à SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data (20)

EDF2012 Peter Boncz - LOD benchmarking SRbench
EDF2012   Peter Boncz - LOD benchmarking SRbenchEDF2012   Peter Boncz - LOD benchmarking SRbench
EDF2012 Peter Boncz - LOD benchmarking SRbench
 
Aaai2012
Aaai2012Aaai2012
Aaai2012
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data ProcessingFedbench - A Benchmark Suite for Federated Semantic Data Processing
Fedbench - A Benchmark Suite for Federated Semantic Data Processing
 
SeCold - A Linked Data Platform for Mining Software Repositories
SeCold - A Linked Data Platform for  Mining Software RepositoriesSeCold - A Linked Data Platform for  Mining Software Repositories
SeCold - A Linked Data Platform for Mining Software Repositories
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
OSTHUS-Allotrope presents "Laboratory Informatics Strategy" at SmartLab 2015
OSTHUS-Allotrope presents "Laboratory Informatics Strategy" at SmartLab 2015OSTHUS-Allotrope presents "Laboratory Informatics Strategy" at SmartLab 2015
OSTHUS-Allotrope presents "Laboratory Informatics Strategy" at SmartLab 2015
 
Thinking About Guideline for Data Interoperability - Design concept and workf...
Thinking About Guideline for Data Interoperability - Design concept and workf...Thinking About Guideline for Data Interoperability - Design concept and workf...
Thinking About Guideline for Data Interoperability - Design concept and workf...
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Cassandra advanced data modeling
Cassandra advanced data modelingCassandra advanced data modeling
Cassandra advanced data modeling
 
E clinical solutions irug 2012 12sep2012
E clinical solutions irug 2012 12sep2012E clinical solutions irug 2012 12sep2012
E clinical solutions irug 2012 12sep2012
 
An Extensible Framework to Validate and Build Dataset Profiles
An Extensible Framework to Validate and Build Dataset ProfilesAn Extensible Framework to Validate and Build Dataset Profiles
An Extensible Framework to Validate and Build Dataset Profiles
 
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
 
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
Kyryl Truskovskyi: Training and Serving Open-Sourced Foundational Models (UA)
 
Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...Understanding Metadata: Why it's essential to your big data solution and how ...
Understanding Metadata: Why it's essential to your big data solution and how ...
 
Donders neuroimage toolkit - open science and good practices
Donders neuroimage toolkit -  open science and good practicesDonders neuroimage toolkit -  open science and good practices
Donders neuroimage toolkit - open science and good practices
 
Database novelty detection
Database novelty detectionDatabase novelty detection
Database novelty detection
 
Conceptional Data Vault
Conceptional Data VaultConceptional Data Vault
Conceptional Data Vault
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
Dan Sullivan - Data Analytics and Text Mining with MongoDB - NoSQL matters Du...
 

SPLODGE: Systematic Generation of SPARQL Benchmark Queries for Linked Open Data

  • 1. Institute for Web Science and Technologies University of Koblenz ▪ Landau, Germany Systematic Generation of SPARQL Benchmark Queries for Linked Open Data Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 2. Linked Data Federation SPARQL Queries on the Linked Data Cloud Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/ ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 2 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 3. The Problem Why not use benchmark queries? distributed federation queries implementation ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 3 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 4. RDF Benchmarks LUBM, BSBM, SP²B, ... FedBench (ISWC'11) • Synthetic datasets • 10 Linked Data sets • Domain-specific (~170M triples) • Highly structured • 25 handpicked • Sophisticated queries distributed queries Centralized Fixed Scalable, Flexible, Expressive Linked Data Benchmark ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 4 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 5. Overview  Benchmark Idea  Methodology  Evaluation ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 5 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 6. Linked Data Benchmark Features Scalability Flexibility Expressiveness Real Linked Data Sets Customization Typical+Complex Queries Systematic SPARQL Benchmark Query Generator for Linked Open Data ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 6 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 7. Requirements What we want: 1. Define Query Customize Benchmark Characteristics 2. Automatic Query Random Queries Generation 3. Query Validation #results > 0 ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 7 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 8. Contribution Methodology and toolset for systematic query generation Linked Data Config Benchmark Queries Parameterization Query Generation Query Validation ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 8 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 9. Overview  Benchmark Idea  Methodology  Evaluation ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 9 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 10. SPLODGE Methodology Query Query Query Parameterization Generation Validation Define typical + challenging distributed queries No federation query Analyze queries logs available of benchmarks SELECT ?drug ?keggUrl ?chebiImage WHERE {   ?drug rdf:type drugbank:drugs .   ?drug drugbank:keggCompoundId ?keggDrug .   ?keggDrug bio2rdf:url ?keggUrl .   ?drug drugbank:genericName ?drugBankName .   ?chebiDrug purl:title ?drugBankName .   ?chebiDrug chebi:image ?chebiImage . } FedBench/LifeScience#5 ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 10 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 11. SPLODGE Methodology Query Query Query Parameterization Generation Validation Algebra Structure Cardinality • Query Form • Variable Patterns • # Data Sources (Select, Construct, ...) (s, o, s+o, ...) • Join Type • Join Patterns • # Joins/ Patterns (conj. / disj. / left-join) (star, path) • Result Modifiers • Cross Product • # Results (limit, offs, order by) ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 11 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 12. SPLODGE Methodology Query Query Query Parameterization Generation Validation Main query parameter: join structure path join FedBench queries star join ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 12 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 13. SPLODGE Methodology Query Query Query Parameterization Generation Validation Additional query parameters: # triple patterns # data sources result size ... Path-join: n triple patterns, Star-join: n triple pattern, m sources (m≤n) anchor node (s/o) ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 13 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 14. SPLODGE Methodology Query Query Query Parameterization Generation Validation s rdf:type m eA l:sa ow rdfs:label foaf :kno ws Iteratively add random triple pattern #results > 0 ? Need background knowledge level of detail? Predicate combinations how provided? ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 14 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 15. SPLODGE Methodology Query Query Query Parameterization Generation Validation s rdf:type m eA l:sa ow rdfs:label foaf :kno ws Linked Predicates Characteristics Sets* (owl:sameAs → rdf:type) {rdfs:label, foaf:knows, …} DBpedia → geonames (43, 58) DBpedia (322), rdfs:label (437) freebase → DBpedia (86, 72) foaf:knows (322) ... ... *[Neumann, Moerkotte, ICDE 2011] ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 15 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 16. SPLODGE Methodology Query Query Query Parameterization Generation Validation p1 p2 p3 p4 Linked Predicates Characteristics Sets (p1 → p2) ⊗ (p2 → p3) {p1, p4} ⊗ (p3 → pi ) {p1, p4, ...} ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 16 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 17. SPLODGE Methodology Query Query Query Parameterization Generation Validation Verify generated queries (#results >0) How to evaluate? Compute confidence value minimum join selectivity > e ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 17 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 18. Overview  Benchmark Idea  Methodology  Evaluation ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 18 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 19. Evaluation Objective  Verify generation of valid queries (#results >0)  Compare variations of query generation algorithms Baseline SPLODGElite SPLODGE “random“ background + minimum predicate knowlege join selectivity (> 10-4/10-3/10-2)  Metrics:  #queries with non-empty results  #result per query ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 19 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 20. Evaluation Setup  Real Linked Data Billion Triple Challenge Dataset  Random queries  Triple Store • Path-joins across data sources • 3-6 patterns, bound predicates • 100 queries per batch RDF3X SELECT * WHERE { ?var1 <http://dbpedia.org/property/description> ?var2 . ?var2 <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> ?var3 . ?var3 <http://www.w3.org/2002/07/owl#disjointWith> ?var4 . ?var4 <http://www.w3.org/2002/07/owl#disjointWith> ?var5 . ?var5 <http://semantic-mediawiki.org/swivt/1.0#wikiPageModificationDate> ?var6 } ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 20 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 21. Evaluation Results #queries Joined triple patterns ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 21 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 22. Evaluation Results #results Joined triple patterns ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 22 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 23. Estimated vs. actual results size actual result size estimated result size ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 23 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 24. Predicate Occurrence in Queries ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 24 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 25. Conclusion SPLODGE provides  Flexible query characterization + parameterization  Methodology for Systematic & Scalable Query Generation  Toolset as Open Source (http://code.google.com/p/splodge/) Future Work:  Create a LOD Federation Benchmark  Interactive SPARQL query construction Questions? ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 25 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 26. SPLODGE Evaluation Setup BTC 2011 dataset in RDF3X  pure triples, no context  160 GB repository file (14h loading, 200 GB tmp mem) ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 26 Olaf Görlitz, Matthias Thimm, Steffen Staab
  • 27. SPLODGE Pre-Processing for BTC data Identify common domains 17 GB gzip (e.g. jane08.lifejournal.com/home) 3,0 h Replace quad context 4,4 h (reduce number of sources) Sort quads + remove duplicates 8,5 h <1 MB gzip Build predicate/context dictionary 1,0 h 1.7 GB gzip Create resource in/out-link index 9,7 h Create linked predicate stats Compute characteristic sets 1,6 h ISWC'12, Boston, 11/15/2012 SPLODGE: Systematic LOD Benchmark Query Generation Slide 27 Olaf Görlitz, Matthias Thimm, Steffen Staab