SlideShare une entreprise Scribd logo
1  sur  27
Summary Models for Routing Keywords
        to Linked Data Sources
        Thanh Tran, Lei Zhang, Rudi Studer
        AIFB Institute, KIT




    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
1                                                            National Laboratory of the Helmholtz Association
Agenda

 Introduction
      Opportunities & challenges

      Contributions

 Problem Definition
      LOD Data

      Keyword Query Answer

      Keyword Query Routing

 Summary Models
      Keyword sets

      Element-level vs. schema-level vs.
         source-level Summary

      Validity of Results vs. complexity

 Theo. / Exp. Results

 2 Conclusions ducthanh.tran@kit.edu
     Thanh Tran, AIFB Institute, KIT,       KIT – University of the State of Baden-Wuerttemberg and
                                            National Laboratory of the Helmholtz Association
Semantic Data




    - 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links
    - As of 09-2010 + other data (e.g. LON, ontologies, RDFa ) + increasing rapidly...
    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
3                                                            National Laboratory of the Helmholtz Association
Opportunities
         “Articles from awarded researchers at Stanford ”




     Freebase contains data about people                       More complex information needs
     DBPedia contains information about awards                 More precise results
     DBLP contains bibliographic data                          More integrated results
    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu              KIT – University of the State of Baden-Wuerttemberg and
4                                                                       National Laboratory of the Helmholtz Association
Problems
         “Articles from awarded researchers at Stanford ”


                                      Large number of unknown
                                       & irrelevant sources!
                                                What is in there?
                                                What is relevant?




    Formulating queries is a hard task!                          Processing queries is expensive!
    • Which data sources?
                 USABILITY                                       • Process against all data sources?
                                                                             SCALABILITY
    • Which schema elements?

( z). x, y.prizes(x, Turing Award) worksAt(x,y) name(y,Stanford) publication(x, z)


    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                KIT – University of the State of Baden-Wuerttemberg and
5                                                                         National Laboratory of the Helmholtz Association
Keyword Query Routing

     Given the needs expressed as sets of keywords,
               are there “corresponding answers” in linked data?
               and what combination of data sources can be used to
                produce them?



                  Identify valid combination of                Let user choose
                   sources using keywords                        combination of sources
                  Present schema elements for                  Process only relevant
                   the user to formulate query                   combinations of sources




    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu        KIT – University of the State of Baden-Wuerttemberg and
6                                                                 National Laboratory of the Helmholtz Association
Contributions
     Introduce the novel problem of keyword query routing

     Propose the multi-level relationship graph to capture its
      search space.
     Introduce various summary models, which aim to
      compactly represent the search space.

     Investigate the resulting trade-offs between result quality
      and efficiency through theoretical analysis and practical
      experiments using publicly available linked data sources.




    Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
7                                                            National Laboratory of the Helmholtz Association
Agenda

 Introduction
      Opportunities & challenges

      Contributions

 Problem Definition
      LOD Data

      Keyword Query Answer

      Keyword Query Routing

 Summary Models
      Keyword sets

      Element-level vs. schema-level vs.
         source-level Summary

      Validity of Results vs. complexity

 Theo. / Exp. Results

 8 Conclusions ducthanh.tran@kit.edu
     Thanh Tran, AIFB Institute, KIT,       KIT – University of the State of Baden-Wuerttemberg and
                                            National Laboratory of the Helmholtz Association
LOD Element-level Graph
     Web data modeled as a set of interlinked data graphs
     Each data graph represent a source
     Element-level graph vs. schema-level graph vs. source-level graph

                                            Freebase                             DBLP                                           DBPedia
                                                                              …                 John                     Music
                                                                            John.               Smith                    Award
                                                                               title                name                      label

            uni1                          pub2                pub1        pub3               per4                    prize2
                                                                                    author             prizes
                         employ                 author           author

                                        per2                  per1                           per3                    prize1
                                                    sameAs                  sameAs                     prizes
               name                            name                  name                           name                    label

         Stanford                      John                    John                             John                     Turing
         University                   McCarthy                Mccarthy                         McCarthy                  Award


     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                  KIT – University of the State of Baden-Wuerttemberg and
9                                                                                            National Laboratory of the Helmholtz Association
LOD Schema-level Graph
  Web data modeled as a set of interlinked data graphs
  Each data graph represent a source
  Element-level graph vs. schema-level graph vs. source-level graph

                                            Freebase                        DBLP                                        DBPedia




                                           Written
          University                                          Article
                                            Work
                         employ                 author             author

                                    Person                    Author                    Person                       Prize
                                                    sameAs                  sameAs                    prizes




     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                          KIT – University of the State of Baden-Wuerttemberg and
10                                                                                   National Laboratory of the Helmholtz Association
LOD Source-level Graph
 Web data modeled as a set of interlinked data graphs
 Each data graph represent a source
 Element-level graph vs. schema-level graph vs. source-level graph

                                            Freebase          DBLP                                        DBPedia




                                                                  author



                                                    sames     sameAs




     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu            KIT – University of the State of Baden-Wuerttemberg and
11                                                                     National Laboratory of the Helmholtz Association
“Corresponding” Answers
          User information need                                                 „stanford           article       award“




                                            Freebase                               DBLP                                         DBPedia
                                                                                  …             John                     Music
                                                              Article
                                                                                John.           Smith                    Award
                                                                  type             title            name                      label

            uni1                          pub2                pub1          pub3             per4                    prize2
                                                                                    author             prizes
                         employ                 author             author

                                        per2                    per1                         per3                    prize1
                                                    sameAs                     sameAs                  prizes
               name                            name                     name                        name                    label

         Stanford                      John                     John                            John                     Turing
         University                   McCarthy                 Mccarthy                        McCarthy                  Award


     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                  KIT – University of the State of Baden-Wuerttemberg and
12                                                                                           National Laboratory of the Helmholtz Association
Problem Definition

      Keyword query result (also called Steiner graph) is a
       subgraph of the union of the data- and schema-level graph
       that for every keyword, contains a matching element, and
       these elements are pairwise connected over a path.

      d-max Steiner graph is a Steiner graph where paths
       between keyword elements is d-max or less.

      Keyword query routing: compute valid set of data sources
       called keyword routing plan. A plan is valid if its sources
       produce non-empty keyword query results.


     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
13                                                            National Laboratory of the Helmholtz Association
A Valid Keyword Routing Plan
          User information need                                                 „stanford           article       award“




                                            Freebase                               DBLP                                         DBPedia
                                                                                  …             John                     Music
                                                              Article
                                                                                John.           Smith                    Award
                                                                  type             title            name                      label

            uni1                          pub2                pub1          pub3             per4                    prize2
                                                                                    author             prizes
                         employ                 author             author

                                        per2                    per1                         per3                    prize1
                                                    sameAs                     sameAs                  prizes
               name                            name                     name                        name                    label

         Stanford                      John                     John                            John                     Turing
         University                   McCarthy                 Mccarthy                        McCarthy                  Award


     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                  KIT – University of the State of Baden-Wuerttemberg and
14                                                                                           National Laboratory of the Helmholtz Association
The Search Space
           Multi-level inter-relationship graphs capture the entire search space
           Relationships between elements
           and between different levels

      Search space is too large!
      Naïve solution not applicable: apply existing approaches to
       keyword search for computing Steiner graphs
         Steiner graphs might span several linked sources
         Search space grow exponentially with the number of
          sources and their associated links




     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu     KIT – University of the State of Baden-Wuerttemberg and
15                                                              National Laboratory of the Helmholtz Association
Agenda

 Introduction
      Opportunities & challenges

      Contributions

 Problem Definition
      LOD Data

      Keyword Query Answer

      Keyword Query Routing

 Summary Models
      Keyword sets

      Element-level vs. schema-level vs.
         source-level KERG

      Validity of Results vs. complexity

 Theo. / Exp. Results

16 Conclusions ducthanh.tran@kit.edu
     Thanh Tran, AIFB Institute, KIT,       KIT – University of the State of Baden-Wuerttemberg and
                                            National Laboratory of the Helmholtz Association
Keyword Sets
      One keyword set for every data source
      Elements stand for distinct keywords mentioned in a source


                                            Freebase                             DBLP                                           DBPedia
                                                                              …              John                       Music
                                                                                               Smith                   Music
                                                                            John.            Smith                      Award
                                                                               title                name                      label

            uni1                          pub2                pub1        pub3               per4                    prize2
                                                                                    author             prizes
                                                author           author

                                        per2                  per1                           per3                    prize1
                                                    sameAs                  sameAs                     prizes
                         employ

        Stanford                           John                McCarthy                        John                     Award
                                               name                  name                             name                    label
         Stanford                      John                    John                             John                     Turing
         University                   McCarthy                  John                          McCarthy                  Turing
         University                   McCarthy                Mccarthy                        McCarthy                   Award

     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                  KIT – University of the State of Baden-Wuerttemberg and
17                                                                                           National Laboratory of the Helmholtz Association
Element-level Keyword-Element Relationship Graph (E- KERG)
           A keyword-element captures a keyword k and the data element mentioning k
           A relationship between two keyword-elements exists iff there is a path between
            their associated data elements
           In d-max KERG, the paths to be considered have length d-max or less
                                            Freebase                                 DBLP                                           DBPedia
                                                                       pub4                    per4                    prize2
                                                                                …                 John                      Music
                                                                              John                  Smith                  Music
                                                                              John.              Smith                      Award
                                                                                 title                  name                      label

            uni1                          pub2                 pub1           pub3                 John
                                                                                                per4                       Award
                                                                                                                         prize2
                                                                                      author               prizes
                                                author               author

                                        per2                    per1                            per3                     prize1
                                                    sameAs                     sameAs                      prizes
                         employ
     uni1                    per2                             per1                             per3                      prize1
         Stanford                          John                 McCarthy                              John                  Award
                                               name                    name                               name                    label
         Stanford                      John                      John                               John                      Turing
         University                   McCarthy                    John                            McCarthy                   Turin
         University                   McCarthy                  Mccarthy                          McCarthy                    Award
     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                      KIT – University of the State of Baden-Wuerttemberg and
18                                                                                               National Laboratory of the Helmholtz Association
Schema-level Keyword-Element Relationship Graph (S-KERG)
           A keyword-element captures a keyword k and the schema element which contains
            some instances (date elements) mentioning k
           A relationship between two keyword-elements exists if there is a path between some
            instances of their associated schema elements
           Groups elements (relationships) when they capture same pair of keywords in the
            same class (same keyword relationships between same pair of classes)
                                            Freebase                                 DBLP                                            DBPedia
                                                                        Article
                                                                        pub4                    Person
                                                                                                per4                     Prize
                                                                                                                         prize2
                                                                                 …                 John                       Music
                                                                               John                  Smith                  Music
                                                                               John.              Smith                       Award
                                                                                    title                 name                     label

            uni1                          pub2                  pub1         pub3                   John
                                                                                                 per4                       Award
                                                                                                                          prize2
                                                                                       author               prizes
                                                author             author

                                        per2                     per1                            per3                     prize1
                                                    sameAs                        sameAs                    prizes
                         employ
     University
     uni1                    Person
                             per2                             Author
                                                               per1                             per3                      prize1
         Stanford                          John                  McCarthy                               John                  Award
                                               name                     name                               name                     label
          Stanford                        John                   John                                   John                Turing
        University                        McCarthy
     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu      John                              McCarthy of the State of Baden-Wuerttemberg and
                                                                                                  KIT – University
                                                                                                                          Turin
        University                      McCarthy                Mccarthy                             McCarthy              Award
19                                                                                                National Laboratory of the Helmholtz Association
Data-Source-level Keyword-Element Relationship Graph (D-KERG)
           A keyword-element captures a keyword k and the source which contains some
            instances (date elements) mentioning k
           A relationship between two keyword-elements exists if there is a path between some
            instances of their associated sources
           Groups elements (relationships) when they capture same pair of keywords in the
            same source (same keyword relationships between the same of pair sources)
                                            Freebase                                 DBLP                                            DBPedia
                                                                        Article
                                                                        pub4                    Person
                                                                                                per4                     Prize
                                                                                                                         prize2
                                                                                 …                 John                       Music
                                                                               John                  Smith                  Music
                                                                               John.              Smith                       Award
                                                                                    title                 name                     label

            uni1                          pub2                  pub1         pub3                   John
                                                                                                 per4                       Award
                                                                                                                          prize2
                                                                                       author               prizes
                                                author             author

                                        per2                     per1                            per3                     prize1
                                                    sameAs                        sameAs                    prizes
                         employ
     University
     uni1                    Person
                             per2                             Author
                                                               per1                             per3                      prize1
         Stanford                          John                  McCarthy                               John                  Award
                                               name                     name                               name                     label
          Stanford                        John                   John                                   John                Turing
        University                        McCarthy
     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu      John                              McCarthy of the State of Baden-Wuerttemberg and
                                                                                                  KIT – University
                                                                                                                          Turin
        University                      McCarthy                Mccarthy                             McCarthy              Award
20                                                                                                National Laboratory of the Helmholtz Association
Agenda

 Introduction
      Opportunities & challenges

      Contributions

 Problem Definition
      LOD Data

      Keyword Query Answer

      Keyword Query Routing

 Summary Models
      Keyword sets

      Element-level vs. schema-level vs.
         source-level KERG

      Validity of Results vs. complexity

 Theo. / Exp. Results

22 Conclusions ducthanh.tran@kit.edu
     Thanh Tran, AIFB Institute, KIT,       KIT – University of the State of Baden-Wuerttemberg and
                                            National Laboratory of the Helmholtz Association
Theoretical Results
      When Steiner graphs can be found for K in the
       data, then there will be keyword routing plan that
       can be found in KERG.
      The keyword routing plan derived from the
       summary are not necessarily valid s.t. there might
       be no corresponding Steiner graph in the data
      Detailed results + algorithms + complexity results in
       the paper!



     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
23                                                            National Laboratory of the Helmholtz Association
Experiments

      Chunk of the BTC dataset containing 10M RDF
       triples from 154 sources, linked via 500K mappings

      Manually crafted 30 keyword valid multi-data-
       source queries, i.e., produce non-empty keyword
       answers and involve more than 2 sources
               Town River America
               Beijing Conference Database 2007




     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
24                                                            National Laboratory of the Helmholtz Association
Validity
                P@k measure the percentage of plans that are valid out of the top-k plans
                P@5 up to 100% for E-KERG (dmax =4), P@5 for KS only 6%
                More valid plans were computed when a higher value was used for dmax
                dmax =3 seems to be a good tradeoff
                Queries with larger number of keywords resulted in lower precision


           1.0                                                                1.0
                                                                                                     E-KERG           D-KERG
                         E-KERG
           0.9                                                                0.9
                         D-KERG                                                                      S-KERG           KS
           0.8                                                                0.8

           0.7           S-KERG                                               0.7
           0.6           KS                                                   0.6
                                                                        P@5
     P@5




           0.5                                                                0.5
           0.4                                                                0.4
           0.3                                                                0.3
           0.2                                                                0.2
           0.1                                                                0.1
           0.0                                                                0.0
                     0            1             2               3   4               2          3              4            5
                                             dmax                                                   |K|
       Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                           KIT – University of the State of Baden-Wuerttemberg and
25                                                                                      National Laboratory of the Helmholtz Association
Performance
                Times increased with higher values for dmax
                                       Sharp for E-KERG and S-KERG
                                       Relatively stable for D-KERG
                Times increase with number of keywords
                                       All other models had poor performance w.r.t complex queries but D-KERG
                                       E-KERG needed more than 100s for queries with more than 2 keywords
                Time for D-KERG was no more than 10ms on average

                                       S-KERG       D-KERG        KS       E-KERG                                              S-KERG       D-KERG       KS       E-KERG

                                  1000000                                                                            1000000
     Query Processing Time (ms)




                                                                                        Query Processing Time (ms)
                                   100000                                                                             100000

                                    10000                                                                              10000

                                     1000                                                                               1000

                                      100                                                                                100

                                       10                                                                                 10
                                        1
                                                                                                                           1
                                                0     1       2        3            4
                                                                                                                                        2        3            4            5
                                                             dmax
                                                                                                                                                      |K|

               Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                                                            KIT – University of the State of Baden-Wuerttemberg and
26                                                                                                                               National Laboratory of the Helmholtz Association
Conclusions

      Keyword query routing helps users without knowledge of linked data
       and schemas to find combination of sources that contain answers
       corresponding to their needs
      Summarizing relationships is essential for dealing with the large-scale
       linked data Web (E-KERG achieved poor performance, requires more
       than 100s for complex queries)
      Summarizing at the level of sources (D-KERG) represents the most
       practical trade-off, produces results in less than 10ms out of which
       every second one was valid
      However, validity still low for complex queries (<30% when 4 keywords)

      Baseline approaches for novel problem
      Further improve validity and consider relevance!
      Combine keyword query routing with source and structured query
       processing to compute final results!
     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu   KIT – University of the State of Baden-Wuerttemberg and
27                                                            National Laboratory of the Helmholtz Association
Thanks for Your Attention!

                                                                  Institute AIFB, KIT

                                                               ducthanh.tran@kit.edu




     Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu                             KIT – University of the State of Baden-Wuerttemberg and
28                                                                                      National Laboratory of the Helmholtz Association

Contenu connexe

Tendances

A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)
Julie Allinson
 
16. Anne Schumann (USAAR) Terminology and Ontologies 1
16. Anne Schumann (USAAR) Terminology and Ontologies 116. Anne Schumann (USAAR) Terminology and Ontologies 1
16. Anne Schumann (USAAR) Terminology and Ontologies 1
RIILP
 
Measuring electronic resource availability final version
Measuring electronic resource availability final versionMeasuring electronic resource availability final version
Measuring electronic resource availability final version
Sanjeet Mann
 
Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi Mohammadzadeh
Hadi Mohammadzadeh
 

Tendances (17)

Tutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsTutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and Systems
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
OpenEHR and IHE Ecosystem
OpenEHR and IHE Ecosystem OpenEHR and IHE Ecosystem
OpenEHR and IHE Ecosystem
 
Data Interlinking
Data InterlinkingData Interlinking
Data Interlinking
 
A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)A Dublin Core Application Profile for Scholarly Works (eprints)
A Dublin Core Application Profile for Scholarly Works (eprints)
 
16. Anne Schumann (USAAR) Terminology and Ontologies 1
16. Anne Schumann (USAAR) Terminology and Ontologies 116. Anne Schumann (USAAR) Terminology and Ontologies 1
16. Anne Schumann (USAAR) Terminology and Ontologies 1
 
Knowledge acquisition using automated techniques
Knowledge acquisition using automated techniquesKnowledge acquisition using automated techniques
Knowledge acquisition using automated techniques
 
Gleaning Types for Literals in RDF with Application to Entity Summarization
Gleaning Types for Literals in RDF with Application to Entity SummarizationGleaning Types for Literals in RDF with Application to Entity Summarization
Gleaning Types for Literals in RDF with Application to Entity Summarization
 
Question answering
Question answeringQuestion answering
Question answering
 
Measuring electronic resource availability final version
Measuring electronic resource availability final versionMeasuring electronic resource availability final version
Measuring electronic resource availability final version
 
Computational phylogenetics theoretical concepts, methods with practical on C...
Computational phylogenetics theoretical concepts, methods with practical on C...Computational phylogenetics theoretical concepts, methods with practical on C...
Computational phylogenetics theoretical concepts, methods with practical on C...
 
Text mining, By Hadi Mohammadzadeh
Text mining, By Hadi MohammadzadehText mining, By Hadi Mohammadzadeh
Text mining, By Hadi Mohammadzadeh
 
Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008Survey of Generative Clustering Models 2008
Survey of Generative Clustering Models 2008
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Practical machine learning - Part 1
Practical machine learning - Part 1Practical machine learning - Part 1
Practical machine learning - Part 1
 
Linking Universities - A broader look at the application of linked data and s...
Linking Universities - A broader look at the application of linked data and s...Linking Universities - A broader look at the application of linked data and s...
Linking Universities - A broader look at the application of linked data and s...
 
OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extr...
OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extr...OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extr...
OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extr...
 

En vedette

Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance Models
Thanh Tran
 
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Thanh Tran
 
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
Thanh Tran
 
Linked Data Query Processing Strategies
Linked Data Query Processing StrategiesLinked Data Query Processing Strategies
Linked Data Query Processing Strategies
Thanh Tran
 
Lifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswcLifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswc
Thanh Tran
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
Thanh Tran
 
поляризация диэлектриков
поляризация диэлектриковполяризация диэлектриков
поляризация диэлектриков
AndronovaAnna
 
Index Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesIndex Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search Databases
Thanh Tran
 

En vedette (17)

Graphinder semantic search
Graphinder semantic searchGraphinder semantic search
Graphinder semantic search
 
Keyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance ModelsKeyword Search on Structured Data using Relevance Models
Keyword Search on Structured Data using Relevance Models
 
Recent Trends in Semantic Search Technologies
Recent Trends in Semantic Search TechnologiesRecent Trends in Semantic Search Technologies
Recent Trends in Semantic Search Technologies
 
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
Usability of Keyword-driven Schema-agnostic Search - A Comparative Study of K...
 
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
SERIMI: Class-based Disambiguation for Effective Instance Matching over Heter...
 
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
KOIOS: Utilizing Semantic Search for Easy-Access and Visualization of Structu...
 
Semantic Web Search - Searching Documents and Semantic Data on the Web
Semantic Web Search - Searching Documents and Semantic Data on the WebSemantic Web Search - Searching Documents and Semantic Data on the Web
Semantic Web Search - Searching Documents and Semantic Data on the Web
 
Big data search
Big data search Big data search
Big data search
 
Linked Data Query Processing Strategies
Linked Data Query Processing StrategiesLinked Data Query Processing Strategies
Linked Data Query Processing Strategies
 
Lifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswcLifecycle support in architectures for ontology-based information systems - iswc
Lifecycle support in architectures for ontology-based information systems - iswc
 
Гастро-тур в Италию
Гастро-тур в ИталиюГастро-тур в Италию
Гастро-тур в Италию
 
Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012 Semantic Search Tutorial at SemTech 2012
Semantic Search Tutorial at SemTech 2012
 
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
 
ESSIR 2011 Semantic Search Tutorial
ESSIR 2011 Semantic Search TutorialESSIR 2011 Semantic Search Tutorial
ESSIR 2011 Semantic Search Tutorial
 
поляризация диэлектриков
поляризация диэлектриковполяризация диэлектриков
поляризация диэлектриков
 
Query Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the WebQuery Processing Using Structure Index for RDF Data on the Web
Query Processing Using Structure Index for RDF Data on the Web
 
Index Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search DatabasesIndex Structures and Top-k Joins for Native Keyword Search Databases
Index Structures and Top-k Joins for Native Keyword Search Databases
 

Similaire à Summary Models for Routing Keywords to Linked Data Sources

JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
GUANGYUAN PIAO
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
Uma Se
 
2006-05-25__coi-semdis
2006-05-25__coi-semdis2006-05-25__coi-semdis
2006-05-25__coi-semdis
webuploader
 

Similaire à Summary Models for Routing Keywords to Linked Data Sources (20)

Linked Data and Sevices
Linked Data and SevicesLinked Data and Sevices
Linked Data and Sevices
 
How the Web can change social science research (including yours)
How the Web can change social science research (including yours)How the Web can change social science research (including yours)
How the Web can change social science research (including yours)
 
Metric Learning for Music Discovery with Source and Target Playlists
Metric Learning for Music Discovery with Source and Target PlaylistsMetric Learning for Music Discovery with Source and Target Playlists
Metric Learning for Music Discovery with Source and Target Playlists
 
Quantifying the bias in data links
Quantifying the bias in data linksQuantifying the bias in data links
Quantifying the bias in data links
 
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
Linked Data and Knowledge Graphs -- Constructing and Understanding Knowledge ...
 
Telescope Bibliometrics 101
Telescope Bibliometrics 101Telescope Bibliometrics 101
Telescope Bibliometrics 101
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
 
2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal2011linked science4mccuskermcguinnessfinal
2011linked science4mccuskermcguinnessfinal
 
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs:A DBpedia StudyCrowdsourcing the Quality of Knowledge Graphs:A DBpedia Study
Crowdsourcing the Quality of Knowledge Graphs: A DBpedia Study
 
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
JIST2015-Computing the Semantic Similarity of Resources in DBpedia for Recomm...
 
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
EARL: Joint Entity and Relation Linking for Question Answering over Knowledge...
 
Mendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic LiteratureMendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic Literature
 
A multi criteria evaluation of environmental databases using hasse
A multi criteria evaluation of environmental databases using hasseA multi criteria evaluation of environmental databases using hasse
A multi criteria evaluation of environmental databases using hasse
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
2006-05-25__coi-semdis
2006-05-25__coi-semdis2006-05-25__coi-semdis
2006-05-25__coi-semdis
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
Materials Data Facility: Streamlined and automated data sharing,  discovery, ...Materials Data Facility: Streamlined and automated data sharing,  discovery, ...
Materials Data Facility: Streamlined and automated data sharing, discovery, ...
 

Dernier

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Dernier (20)

Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 

Summary Models for Routing Keywords to Linked Data Sources

  • 1. Summary Models for Routing Keywords to Linked Data Sources Thanh Tran, Lei Zhang, Rudi Studer AIFB Institute, KIT Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 1 National Laboratory of the Helmholtz Association
  • 2. Agenda  Introduction  Opportunities & challenges  Contributions  Problem Definition  LOD Data  Keyword Query Answer  Keyword Query Routing  Summary Models  Keyword sets  Element-level vs. schema-level vs. source-level Summary  Validity of Results vs. complexity  Theo. / Exp. Results  2 Conclusions ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
  • 3. Semantic Data - 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links - As of 09-2010 + other data (e.g. LON, ontologies, RDFa ) + increasing rapidly... Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 3 National Laboratory of the Helmholtz Association
  • 4. Opportunities “Articles from awarded researchers at Stanford ”  Freebase contains data about people  More complex information needs  DBPedia contains information about awards  More precise results  DBLP contains bibliographic data  More integrated results Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 4 National Laboratory of the Helmholtz Association
  • 5. Problems “Articles from awarded researchers at Stanford ”  Large number of unknown & irrelevant sources!  What is in there?  What is relevant? Formulating queries is a hard task! Processing queries is expensive! • Which data sources? USABILITY • Process against all data sources? SCALABILITY • Which schema elements? ( z). x, y.prizes(x, Turing Award) worksAt(x,y) name(y,Stanford) publication(x, z) Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 5 National Laboratory of the Helmholtz Association
  • 6. Keyword Query Routing  Given the needs expressed as sets of keywords,  are there “corresponding answers” in linked data?  and what combination of data sources can be used to produce them?  Identify valid combination of  Let user choose sources using keywords combination of sources  Present schema elements for  Process only relevant the user to formulate query combinations of sources Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 6 National Laboratory of the Helmholtz Association
  • 7. Contributions  Introduce the novel problem of keyword query routing  Propose the multi-level relationship graph to capture its search space.  Introduce various summary models, which aim to compactly represent the search space.  Investigate the resulting trade-offs between result quality and efficiency through theoretical analysis and practical experiments using publicly available linked data sources. Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 7 National Laboratory of the Helmholtz Association
  • 8. Agenda  Introduction  Opportunities & challenges  Contributions  Problem Definition  LOD Data  Keyword Query Answer  Keyword Query Routing  Summary Models  Keyword sets  Element-level vs. schema-level vs. source-level Summary  Validity of Results vs. complexity  Theo. / Exp. Results  8 Conclusions ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
  • 9. LOD Element-level Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Element-level graph vs. schema-level graph vs. source-level graph Freebase DBLP DBPedia … John Music John. Smith Award title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 9 National Laboratory of the Helmholtz Association
  • 10. LOD Schema-level Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Element-level graph vs. schema-level graph vs. source-level graph Freebase DBLP DBPedia Written University Article Work employ author author Person Author Person Prize sameAs sameAs prizes Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 10 National Laboratory of the Helmholtz Association
  • 11. LOD Source-level Graph  Web data modeled as a set of interlinked data graphs  Each data graph represent a source  Element-level graph vs. schema-level graph vs. source-level graph Freebase DBLP DBPedia author sames sameAs Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 11 National Laboratory of the Helmholtz Association
  • 12. “Corresponding” Answers User information need „stanford article award“ Freebase DBLP DBPedia … John Music Article John. Smith Award type title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 12 National Laboratory of the Helmholtz Association
  • 13. Problem Definition  Keyword query result (also called Steiner graph) is a subgraph of the union of the data- and schema-level graph that for every keyword, contains a matching element, and these elements are pairwise connected over a path.  d-max Steiner graph is a Steiner graph where paths between keyword elements is d-max or less.  Keyword query routing: compute valid set of data sources called keyword routing plan. A plan is valid if its sources produce non-empty keyword query results. Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 13 National Laboratory of the Helmholtz Association
  • 14. A Valid Keyword Routing Plan User information need „stanford article award“ Freebase DBLP DBPedia … John Music Article John. Smith Award type title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes employ author author per2 per1 per3 prize1 sameAs sameAs prizes name name name name label Stanford John John John Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 14 National Laboratory of the Helmholtz Association
  • 15. The Search Space  Multi-level inter-relationship graphs capture the entire search space  Relationships between elements  and between different levels  Search space is too large!  Naïve solution not applicable: apply existing approaches to keyword search for computing Steiner graphs  Steiner graphs might span several linked sources  Search space grow exponentially with the number of sources and their associated links Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 15 National Laboratory of the Helmholtz Association
  • 16. Agenda  Introduction  Opportunities & challenges  Contributions  Problem Definition  LOD Data  Keyword Query Answer  Keyword Query Routing  Summary Models  Keyword sets  Element-level vs. schema-level vs. source-level KERG  Validity of Results vs. complexity  Theo. / Exp. Results 16 Conclusions ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
  • 17. Keyword Sets  One keyword set for every data source  Elements stand for distinct keywords mentioned in a source Freebase DBLP DBPedia … John Music Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 per4 prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turing University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 17 National Laboratory of the Helmholtz Association
  • 18. Element-level Keyword-Element Relationship Graph (E- KERG)  A keyword-element captures a keyword k and the data element mentioning k  A relationship between two keyword-elements exists iff there is a path between their associated data elements  In d-max KERG, the paths to be considered have length d-max or less Freebase DBLP DBPedia pub4 per4 prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ uni1 per2 per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy John McCarthy Turin University McCarthy Mccarthy McCarthy Award Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 18 National Laboratory of the Helmholtz Association
  • 19. Schema-level Keyword-Element Relationship Graph (S-KERG)  A keyword-element captures a keyword k and the schema element which contains some instances (date elements) mentioning k  A relationship between two keyword-elements exists if there is a path between some instances of their associated schema elements  Groups elements (relationships) when they capture same pair of keywords in the same class (same keyword relationships between same pair of classes) Freebase DBLP DBPedia Article pub4 Person per4 Prize prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ University uni1 Person per2 Author per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu John McCarthy of the State of Baden-Wuerttemberg and KIT – University Turin University McCarthy Mccarthy McCarthy Award 19 National Laboratory of the Helmholtz Association
  • 20. Data-Source-level Keyword-Element Relationship Graph (D-KERG)  A keyword-element captures a keyword k and the source which contains some instances (date elements) mentioning k  A relationship between two keyword-elements exists if there is a path between some instances of their associated sources  Groups elements (relationships) when they capture same pair of keywords in the same source (same keyword relationships between the same of pair sources) Freebase DBLP DBPedia Article pub4 Person per4 Prize prize2 … John Music John Smith Music John. Smith Award title name label uni1 pub2 pub1 pub3 John per4 Award prize2 author prizes author author per2 per1 per3 prize1 sameAs sameAs prizes employ University uni1 Person per2 Author per1 per3 prize1 Stanford John McCarthy John Award name name name label Stanford John John John Turing University McCarthy Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu John McCarthy of the State of Baden-Wuerttemberg and KIT – University Turin University McCarthy Mccarthy McCarthy Award 20 National Laboratory of the Helmholtz Association
  • 21. Agenda  Introduction  Opportunities & challenges  Contributions  Problem Definition  LOD Data  Keyword Query Answer  Keyword Query Routing  Summary Models  Keyword sets  Element-level vs. schema-level vs. source-level KERG  Validity of Results vs. complexity  Theo. / Exp. Results 22 Conclusions ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, KIT – University of the State of Baden-Wuerttemberg and National Laboratory of the Helmholtz Association
  • 22. Theoretical Results  When Steiner graphs can be found for K in the data, then there will be keyword routing plan that can be found in KERG.  The keyword routing plan derived from the summary are not necessarily valid s.t. there might be no corresponding Steiner graph in the data  Detailed results + algorithms + complexity results in the paper! Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 23 National Laboratory of the Helmholtz Association
  • 23. Experiments  Chunk of the BTC dataset containing 10M RDF triples from 154 sources, linked via 500K mappings  Manually crafted 30 keyword valid multi-data- source queries, i.e., produce non-empty keyword answers and involve more than 2 sources  Town River America  Beijing Conference Database 2007 Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 24 National Laboratory of the Helmholtz Association
  • 24. Validity  P@k measure the percentage of plans that are valid out of the top-k plans  P@5 up to 100% for E-KERG (dmax =4), P@5 for KS only 6%  More valid plans were computed when a higher value was used for dmax  dmax =3 seems to be a good tradeoff  Queries with larger number of keywords resulted in lower precision 1.0 1.0 E-KERG D-KERG E-KERG 0.9 0.9 D-KERG S-KERG KS 0.8 0.8 0.7 S-KERG 0.7 0.6 KS 0.6 P@5 P@5 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0 0 1 2 3 4 2 3 4 5 dmax |K| Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 25 National Laboratory of the Helmholtz Association
  • 25. Performance  Times increased with higher values for dmax  Sharp for E-KERG and S-KERG  Relatively stable for D-KERG  Times increase with number of keywords  All other models had poor performance w.r.t complex queries but D-KERG  E-KERG needed more than 100s for queries with more than 2 keywords  Time for D-KERG was no more than 10ms on average S-KERG D-KERG KS E-KERG S-KERG D-KERG KS E-KERG 1000000 1000000 Query Processing Time (ms) Query Processing Time (ms) 100000 100000 10000 10000 1000 1000 100 100 10 10 1 1 0 1 2 3 4 2 3 4 5 dmax |K| Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 26 National Laboratory of the Helmholtz Association
  • 26. Conclusions  Keyword query routing helps users without knowledge of linked data and schemas to find combination of sources that contain answers corresponding to their needs  Summarizing relationships is essential for dealing with the large-scale linked data Web (E-KERG achieved poor performance, requires more than 100s for complex queries)  Summarizing at the level of sources (D-KERG) represents the most practical trade-off, produces results in less than 10ms out of which every second one was valid  However, validity still low for complex queries (<30% when 4 keywords)  Baseline approaches for novel problem  Further improve validity and consider relevance!  Combine keyword query routing with source and structured query processing to compute final results! Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 27 National Laboratory of the Helmholtz Association
  • 27. Thanks for Your Attention! Institute AIFB, KIT ducthanh.tran@kit.edu Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and 28 National Laboratory of the Helmholtz Association

Notes de l'éditeur

  1. More complex information needs More precise results More integrated results
  2. So far, these requirements have proven to be a large burden. Given the amount of linked data is large and continuously evolving, it is inherently dicultto know what is in there (i.e., the data and the schema) and to formulate the corresponding structured queries for addressing some given information needs.Hence, it is desirable to have a mechanism, which allows users to express information needs in their own words. Another aspect of dealing with the large Web of linked data is scalability. Processing the needs against the entire Web might be too time consuming and not needed, especially when users are interested in and want to choose some particular sources of information. Processing against a relevant subset of linked data identied by the user is more scalable and possibly the only practical solution for the large Web of linked data.
  3. (Rank combination of sources)(Automatically process relevant combination of sources)Concerning these problems, the question we deal with is given the needs expressed by users as sets of keywords, are there corresponding answers in linked data and what combination of data sources shall be used to produce them?Further, the aim is not to directly compute results but to quickly identify and let users andsystem focus on the combination of sources that produce non-empty results.They recognized the fact that the computational complexity resulting from a large-scalesetting can be partially addressed when allowing users to choose and retrieve an-swers from only some particular databases. Given a set of keywords, the goal is tond and rank the single most relevant databases that contain the answers. Follow-ing this line, we propose specic solutions for the linked data context. The dier-
  4. This novel keyword query routing problem raises additional challenges. Most notably, query keywordsmay be covered by several linked sources, resulting in a large search space. Thesize of this search space grow exponentially with the number of sources and theirassociated links. Targeting this problem of scale, we report the following contri-butions in this paper:{ We propose solutions for keyword query routing which enable the exploita-tion of linked data. Without putting any burden on the users, this kind ofapproaches help to nd relevant sources containing complex answers to ad-hocinformation needs in the large and evolving Web of linked data.{ We propose a multi-level relationship graph to capture the search space ofthe keyword query routing problem. Based on this, we elaborate on a fam-ily of summary models, which compactly represent the Web of linked data.These models capture information at dierent levels, representing summariesof dierent granularities. In a theoretical analysis, we prove that ner grainedmodels can improve the result quality. This however, comes at the expenseof higher complexity. Thus, the models represent dierent trade-os betweeneectiveness and eciency.{ In the experiments, we investigate these trade-os by analyzing the precisionand the processing time needed using dierent models. The experiments werecarried out in a real-world setting using more than 150 publicly availabledatasets, and an open-source implementation we made available at http://code.google.com/p/rdfstores/. Results of using summaries are promising.While the \\best&quot; one shall be determined w.r.t a concrete application, there isone model that seems to represent the most practical trade-o: the D-KERGmodel, which summarizes elements according to sources, produces results inless than 10ms, out of which every second is a valid one.
  5. Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  6. Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  7. Linked data can be conceived as a set of data graphs, each represents a particular source. As a working denition, we present a simple graph-based model of linked data called the Web graph. In that model, we distinguish between the - Web data graph representing relationships between individual data elements, - the Web schema graph, which captures information about group of elements, and the Web source graph that contains information at the level of data sources.- This is a simple model of linked data that omits details not necessary forthis work. In particular, data elements may correspond to RDF resources, blank nodes or literals. Schema elements might stand for classes or data types. For keyword query routing, these distinctions are not relevant but the fact that theelements can be recognized via their labels. While dierent kinds of links can beestablished, the ones frequently found are sameAs links, which denote that twoRDF resources or two classes are the same. There is also no need to distinguishthe types of links. Only the fact that sources can be reached via some kinds oflink m 2M matters.
  8. A valid plan in our example is RP = fFreebase;DBLP;DBPediag. Note that validity does not imply relevance. That is, a valid plan ensures that resultscan be produced, but for the users, these results may dier in relevance. A properaccount of relevance and the ranking of routing plans based on the relevance oftheir results go beyond the scope of this paper, which is focused on eciencyaspects of computing valid plans. We assume a xed ranking function, whichequally applies to all summaries discussed in this paper. We refer the interestedreaders to our report [8], which discusses relevance and the ranking function.Does not consider RELEVANCE, focus on EFFICIENCY
  9. - Keywords map against elements of the entire data web- Routing simply based on coverage- Consider further factors for data source identification, i.e. characteristics of the data, the data sources and links between them-Keyword query routing: Keyword routing in a truly distributed setting such that several data sources might be used to answer a set of keywordsOnly the highly relevant data sources are selected to answer the user query
  10. Elements stands for all the keywordsthat are mentioned in elements of the graphs G. Every nKSk 2 NKSKis in fact atuple (k; Gk) that represents a keyword k and the graphs Gk G mentioning k.
  11. Elements stands for all the keywordsthat are mentioned in elements of the graphs G. Every nKSk 2 NKSKis in fact atuple (k; Gk) that represents a keyword k and the graphs Gk G mentioning k.
  12. As opposed to E-KERG, this one is indeed a summary model because itclusters two element-level relationships (hki; nKi (ni; gi;Ki)i; hkj ; nKj (nj ; gj ;Kj)i)and (hkv; nKv (nv; gv;Kv)i; hkw; nKw(nw; gw;Kw)i) to one schema-level relation-ship when they capture the same keyword relationships (i.e., ki = kvand kj = kw) between the same classes (i.e, n0i = n0v and n0j =
  13. As opposed to E-KERG, this one is indeed a summary model because itclusters two element-level relationships (hki; nKi (ni; gi;Ki)i; hkj ; nKj (nj ; gj ;Kj)i)and (hkv; nKv (nv; gv;Kv)i; hkw; nKw(nw; gw;Kw)i) to one schema-level relation-ship when they capture the same keyword relationships (i.e., ki = kvand kj = kw) between the same classes (i.e, n0i = n0v and n0j =
  14. Intuitively speaking, this procedure simply retrieves sources that cover thekeywords and in order to cover all jKj query keywords, it uses jKj-combinationsof these sources as routing plans.
  15. Valid plans (D-KERG) ≤ valid plans (S-KERG) ≤ valid plans (E-KERG) All plans are valid for D-KERG when d-max (summary) ≥ d-max (Steiner graph)This procedure is the same for all KERGs. Given that the underlying datacontain results, we provide proofs in the report [8] to show that applying thisprocedure on the S-KERG summary will yield routing plans, i.e., when Steinergraphs can be found for K in the data, then there will be corresponding graphsthat can be found in the summary. Thus, given K, the procedure will output anon-empty set of RP if W contains a result for K. In the same manner, it isstraightforward to show that E-KERG and D-KERG can provide this guarantee.However, we show formally in [8] that the other way around is not true, i.e., thegraphs derived from the summary are not necessarily valid such that there mightbe no corresponding Steiner graph in the data. Thus, the fact that a routingplan can be derived from the summaries does not guarantee there exists a resultfor K. This formal result is interesting because it makes clear that while theIn summary, the percentage of valid plans for D-KERG is less or equal thatfor S-KERG, which in turn is less or equal that for E-KERG. When dsummax valueof E-KERG is suciently large to cover all paths relevant for Steiner graph com-putation, i.e., dsummax = ddatamax, this percentage is 100 for E-KERG. By chance, thepercentage of valid plans for KS might be higher than that for the summary mod-els but in general, is expected to be less (because relationships between elementsare not considered).Compared to the KERG models, KS does not capture relationships betweenkeywords at all. Given two keywords ki; kj , the sources which cover these key-words can be derived from KS, e.g. the graphs n00 i ; n00 j . However, this does notimply there exist two elements ni 2 n00 i and nj 2 n00 j , and ni !nj . More gener-ally, a combination of sources derived from KS covers all keywords but does notensure that elements matching these keywords are connected, and thus, does notnecessarily correspond to a Steiner graph.
  16. values represent the average computed for all 30 queries. Using E-KERG, precision was up to 100 percent, i.e., for dsum max = ddatamax = 4. With P@5 being always above 0.6 whendmax &gt; 1, S-KERG and D-KERG also achieved relatively good results. P@5 for KS was only 6%. Clearly, dmax had a positive effect. More valid plans werecomputed when a higher value was used for dmax. However, using dmax = 4instead of 3 did not yield clear improvemenFig. 4b shows the eect of query length jKj. Quite clear, queries with largernumber of keywords resulted in lower precision. It dropped as low as 0.23 whenusing D-KERG for queries with 5 keywords.KS is the model that produces only very few valid plans. This result was improved byone order of magnitude when relationships between keywords were used. The morene-grained a model captures the relationships, the larger was the percentage ofvalid plans. Even a summary at the level of sources produced reasonably highquality results, i.e., every second plan was a valid one
  17. Performance is measured as the average response time for com-puting routing plans. Fig. 5a shows the performance for queries at various settingsusing dierent values for dmax. This parameter had no eect on the KS&apos;s resultsbut clearly inuenced the performance achieved with KERG summaries. Times increased with higher values for dmax. While this increase was sharp for E-KERGand S-KERG, time performance of D-KERG was relatively stable. In particular,time required by D-KERG was no more than 10ms on average.While the times shown are the actual times obtainedfor the other models, only the lower bound was shown for E-KERG. This is be-cause we applied a timeout of 6min. Fig. 5c shows the exact times obtained forE-KERG and the queries that had to be aborted due to timeout. For dmax = 4for instance, 1 out of every three queries was abortedExpectedly, more time was needed when the number of query keywords in-creases, as illustrated in Fig. 5b. It seems that all the other models had poorperformance w.r.t complex queries but D-KERG.
  18. We presented a solution to the novel problem of keyword query routing. It helpsusers without knowledge of the evolving linked data and schema to ndcombina-tion of sources that contain answers corresponding to their needs. This solutionalso partially addresses the aspect of eciency as queries can be then evaluatedagainst the relevant sources identied by the user, instead of using the entire Webof linked data.We have proposed a family of summary models. Through theoretical and ex-perimental analysis, we showed that it is important to capture keyword relation-ships. Compared to the KS model representing the naive baseline that stores onlysingle keywords, the KERG models relying on relationships could produce a much