Summary Models for Routing Keywords to Linked Data Sources

Summary Models for Routing Keywords
to Linked Data Sources
Thanh Tran, Lei Zhang, Rudi Studer
AIFB Institute, KIT

Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu KIT – University of the State of Baden-Wuerttemberg and
1 National Laboratory of the Helmholtz Association

Agenda

 Introduction
 Opportunities & challenges

 Contributions

 Problem Definition
 LOD Data

 Keyword Query Answer

 Keyword Query Routing

 Summary Models
 Keyword sets

 Element-level vs. schema-level vs.
source-level Summary

 Validity of Results vs. complexity

 Theo. / Exp. Results

 2 Conclusions ducthanh.tran@kit.edu
Thanh Tran, AIFB Institute, KIT, KIT – University of the State of Baden-Wuerttemberg and
National Laboratory of the Helmholtz Association

Semantic Data

- 203 linked datasets serve 25 billion RDF triples interconnected by 395 million links
- As of 09-2010 + other data (e.g. LON, ontologies, RDFa ) + increasing rapidly...

Opportunities
“Articles from awarded researchers at Stanford ”

 Freebase contains data about people  More complex information needs
 DBPedia contains information about awards  More precise results
 DBLP contains bibliographic data  More integrated results

Problems
“Articles from awarded researchers at Stanford ”

 Large number of unknown
& irrelevant sources!
 What is in there?
 What is relevant?

Formulating queries is a hard task! Processing queries is expensive!
• Which data sources?
USABILITY • Process against all data sources?
SCALABILITY
• Which schema elements?

( z). x, y.prizes(x, Turing Award) worksAt(x,y) name(y,Stanford) publication(x, z)


Keyword Query Routing

 Given the needs expressed as sets of keywords,
 are there “corresponding answers” in linked data?
 and what combination of data sources can be used to
produce them?

 Identify valid combination of  Let user choose
sources using keywords combination of sources
 Present schema elements for  Process only relevant
the user to formulate query combinations of sources


Contributions
 Introduce the novel problem of keyword query routing

 Propose the multi-level relationship graph to capture its
search space.
 Introduce various summary models, which aim to
compactly represent the search space.

 Investigate the resulting trade-offs between result quality
and efficiency through theoretical analysis and practical
experiments using publicly available linked data sources.


Agenda

 Introduction

 Contributions

 LOD Data



 Summary Models
 Keyword sets

source-level Summary



 8 Conclusions ducthanh.tran@kit.edu

LOD Element-level Graph
 Web data modeled as a set of interlinked data graphs
 Each data graph represent a source
 Element-level graph vs. schema-level graph vs. source-level graph

Freebase DBLP DBPedia
… John Music
John. Smith Award
title name label

uni1 pub2 pub1 pub3 per4 prize2
author prizes
employ author author

per2 per1 per3 prize1
sameAs sameAs prizes
name name name name label

Stanford John John John Turing
University McCarthy Mccarthy McCarthy Award


LOD Schema-level Graph


Written
University Article
Work

Person Author Person Prize


LOD Source-level Graph


author

sames sameAs


“Corresponding” Answers
User information need „stanford article award“

… John Music
Article
John. Smith Award
type title name label

author prizes




Problem Definition

 Keyword query result (also called Steiner graph) is a
subgraph of the union of the data- and schema-level graph
that for every keyword, contains a matching element, and
these elements are pairwise connected over a path.

 d-max Steiner graph is a Steiner graph where paths
between keyword elements is d-max or less.

 Keyword query routing: compute valid set of data sources
called keyword routing plan. A plan is valid if its sources
produce non-empty keyword query results.


A Valid Keyword Routing Plan
User information need „stanford article award“

… John Music
Article
John. Smith Award
type title name label

author prizes




The Search Space
 Multi-level inter-relationship graphs capture the entire search space
 Relationships between elements
 and between different levels

 Search space is too large!
 Naïve solution not applicable: apply existing approaches to
keyword search for computing Steiner graphs
 Steiner graphs might span several linked sources
 Search space grow exponentially with the number of
sources and their associated links


Agenda

 Introduction

 Contributions

 LOD Data



 Summary Models
 Keyword sets

source-level KERG



16 Conclusions ducthanh.tran@kit.edu

Keyword Sets
 One keyword set for every data source
 Elements stand for distinct keywords mentioned in a source

… John Music
Smith Music
John. Smith Award
title name label

author prizes
author author

employ

Stanford John McCarthy John Award
name name name label
University McCarthy John McCarthy Turing


Element-level Keyword-Element Relationship Graph (E- KERG)
 A keyword-element captures a keyword k and the data element mentioning k
 A relationship between two keyword-elements exists iff there is a path between
their associated data elements
 In d-max KERG, the paths to be considered have length d-max or less
pub4 per4 prize2
… John Music
John Smith Music
John. Smith Award
title name label

uni1 pub2 pub1 pub3 John
per4 Award
prize2
author prizes
author author

employ
uni1 per2 per1 per3 prize1
University McCarthy John McCarthy Turin

Schema-level Keyword-Element Relationship Graph (S-KERG)
 A keyword-element captures a keyword k and the schema element which contains
some instances (date elements) mentioning k
 A relationship between two keyword-elements exists if there is a path between some
instances of their associated schema elements
 Groups elements (relationships) when they capture same pair of keywords in the
same class (same keyword relationships between same pair of classes)
Article
pub4 Person
per4 Prize
prize2
… John Music
John Smith Music
John. Smith Award
title name label

per4 Award
prize2
author prizes
author author

employ
University
uni1 Person
per2 Author
per1 per3 prize1
University McCarthy
Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu John McCarthy of the State of Baden-Wuerttemberg and
KIT – University
Turin

Data-Source-level Keyword-Element Relationship Graph (D-KERG)
 A keyword-element captures a keyword k and the source which contains some
instances (date elements) mentioning k
 A relationship between two keyword-elements exists if there is a path between some
instances of their associated sources
 Groups elements (relationships) when they capture same pair of keywords in the
same source (same keyword relationships between the same of pair sources)
Article
pub4 Person
per4 Prize
prize2
… John Music
John Smith Music
John. Smith Award
title name label

per4 Award
prize2
author prizes
author author

employ
University
uni1 Person
per2 Author
per1 per3 prize1
University McCarthy
Thanh Tran, AIFB Institute, KIT, ducthanh.tran@kit.edu John McCarthy of the State of Baden-Wuerttemberg and
KIT – University
Turin

Agenda

 Introduction

 Contributions

 LOD Data



 Summary Models
 Keyword sets

source-level KERG



22 Conclusions ducthanh.tran@kit.edu

Theoretical Results
 When Steiner graphs can be found for K in the
data, then there will be keyword routing plan that
can be found in KERG.
 The keyword routing plan derived from the
summary are not necessarily valid s.t. there might
be no corresponding Steiner graph in the data
 Detailed results + algorithms + complexity results in
the paper!


Experiments

 Chunk of the BTC dataset containing 10M RDF
triples from 154 sources, linked via 500K mappings

 Manually crafted 30 keyword valid multi-data-
source queries, i.e., produce non-empty keyword
answers and involve more than 2 sources
 Town River America
 Beijing Conference Database 2007


Validity
 P@k measure the percentage of plans that are valid out of the top-k plans
 P@5 up to 100% for E-KERG (dmax =4), P@5 for KS only 6%
 More valid plans were computed when a higher value was used for dmax
 dmax =3 seems to be a good tradeoff
 Queries with larger number of keywords resulted in lower precision

1.0 1.0
E-KERG D-KERG
E-KERG
0.9 0.9
D-KERG S-KERG KS
0.8 0.8

0.7 S-KERG 0.7
0.6 KS 0.6
P@5
P@5

0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0.0 0.0
0 1 2 3 4 2 3 4 5
dmax |K|

Performance
 Times increased with higher values for dmax
 Sharp for E-KERG and S-KERG
 Relatively stable for D-KERG
 Times increase with number of keywords
 All other models had poor performance w.r.t complex queries but D-KERG
 E-KERG needed more than 100s for queries with more than 2 keywords
 Time for D-KERG was no more than 10ms on average

S-KERG D-KERG KS E-KERG S-KERG D-KERG KS E-KERG

1000000 1000000
Query Processing Time (ms)

Query Processing Time (ms)
100000 100000

10000 10000

1000 1000

100 100

10 10
1
1
0 1 2 3 4
2 3 4 5
dmax
|K|


Conclusions

 Keyword query routing helps users without knowledge of linked data
and schemas to find combination of sources that contain answers
corresponding to their needs
 Summarizing relationships is essential for dealing with the large-scale
linked data Web (E-KERG achieved poor performance, requires more
than 100s for complex queries)
 Summarizing at the level of sources (D-KERG) represents the most
practical trade-off, produces results in less than 10ms out of which
every second one was valid
 However, validity still low for complex queries (<30% when 4 keywords)

 Baseline approaches for novel problem
 Further improve validity and consider relevance!
 Combine keyword query routing with source and structured query
processing to compute final results!

Thanks for Your Attention!

Institute AIFB, KIT

ducthanh.tran@kit.edu


Summary Models for Routing Keywords to Linked Data Sources

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (17)

En vedette

En vedette (17)

Similaire à Summary Models for Routing Keywords to Linked Data Sources

Similaire à Summary Models for Routing Keywords to Linked Data Sources (20)

Dernier

Dernier (20)

Summary Models for Routing Keywords to Linked Data Sources

Notes de l'éditeur