Linked Data Top-K Query Processing

Top-k Linked Data Query Processing
Andreas Wagner, Duc Thanh Tran, Günter Ladwig,
Andreas Harth, and Rudi Studer

Institute of Applied Informatics and Formal Description Methods (AIFB)

KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association www.kit.edu

Introduction and Motivation

Top-k Linked Data Query Processing

Evaluation Results

2 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)

INTRODUCTION & MOTIVATION

3 Institute of Applied Informatics and Formal
Description Methods (AIFB)

Linked Data Query Processing

Linked Data Query
Processing Engine

HTTP lookup

data
URI
Src.
data sources
Problems: Efficiency
and Scalability


Top-K Query Processing

Users are usually interested in only a few results
Top-K query processing addresses the efficiency and
scalability issues
ex:sgt_pepper foaf:name
"Sgt. Pepper";
ex:song "Lucy".

ex:beatles foaf:name Src. 1
"The Beatles"; Src. 2
ex:album ex:sgt_pepper;
ex:album ex:help.
SELECT * WHERE
Src. 3 {
ex:beatles ex:album ?album .
ex:help foaf:name ?album ex:song ?song .
"Help!"; }
ex:song "Help!".


Contributions

Transfer top-k query processing to the Linked Data setting

Linked Data specific improvements of the top-k approach

Evaluation using real-world data


TOP-K LINKED DATA QUERY
PROCESSING


Top-K Query Processing in a Linked Data
Setting (1) – Requirements (1)

Source index mapping triple patterns to sources containing
bindings (e.g., [1,2])
Ranking function determining the relevance of triple pattern
bindings
TP1: ex:beatles ex:album ?album .
Linked Data TP2: ?album ex:song ?song .
Query Processing source
Engine index
TP2
TP1
TP2 ex:sgt_pepper foaf:name
score∈ [0,1] "Sgt. Pepper";
score ∈ [2,3] Src. 3 ex:song "Lucy".
ex:beatles foaf:name Src. 1
"The Beatles"; ex:help foaf:name
ex:album ex:sgt_pepper; "Help!";
ex:album ex:help. Src. 2 score∈ [1,2]
ex:song "Help!".


Setting (2) – Requirements (2)

Sorted access on each join input

2
Src. 3
score ∈ [2,3] Scheduling
1 Strategy
Src. 1
3
Src. 2
score ∈ [0,1] Bindings with
TP1: score ∈ [1,2] descending
ex:beatles ex:album ?album TP2: ?album ex:song ?song scores


Top-K Query Processing in a Scheduling Strategy:
Linked Data
Setting (3) – Push Bound Rank Joinsource 1
Load (1) 3

Score Query Bindings – Output Queue

Score Seen Triples (TP1)
1 ex:beatles ex:album
ex:sgt_pepper Score Seen Triples (TP2)
1 ex:beatles ex:album 3 ex:help ex:song "Help!"
ex:help

Sorted Access for Sorted Access for
ex:beatles foaf:name Src.
ex:beatles ex:album ?album1.
"The Beatles"; ?album foaf:name ?song 3
ex:help ex:song Src.
ex:album ex:sgt_pepper; "Help!";
ex:album ex:help. ex:song "Help!".

Setting (4) – Push Bound Rank Join (2)

Threshold: 4 4 ex:beatles ex:album ex:help .
ex:help ex:song "Help!" .

1 ex:beatles ex:album Found query binding with
ex:sgt_pepper score ≥ threshold Seen Triples (TP2)
Score
1 ex:beatles ex:album STOP
3 ex:help ex:song "Help!"
ex:help

Sorted Access for Sorted Access for
ex:beatles ex:album ?album . ?album ex:song ?song

Src. 2


Improving the Threshold Estimation (1)

Threshold estimation:
Threshold: max { max_1 + min_2 , max_2 + min_1 }
upper
bound seen
max_1 max_2
Score Seen Triples (TP1) Score Seen Triples (TP2)
+
min_1 min_2

upper
bound unseen

We improve the threshold estimation:
Star-shaped entity query bounds
Look-ahead bounds


Star-shaped Entity Query Bounds

Observation: Results for entity queries come from one single
source
Idea: Upper bound scores for triple pattern bindings via the
maximal possible triple score

score ∈ [1,2]

upper-bound ex:sgt_pepper foaf:name
for triple "Sgt. Pepper";
Src. 3 ex:song "Lucy".
bindings: 3
ex:song ?y
ex:help foaf:name
?x "Help!";
ex:song "Help!".
Src. 2

foaf:name ?z
score ∈ [2,3]

upper-bound
for triple bindings: 3 upper bound for entity query bindings: 3 + 3

Look-ahead Bounds
Idea: Provide a more accurate upper bound for the unseen bindings
scores via the „next possible“ score
Threshold: max { 1 + 3 , 1 + 3 } = 4
2
4 ex:beatles ex:album ex:help .
ex:help ex:song "Help!" .
max_1 = 1 max_2 = 3

Score Seen Triples (TP1) Score Seen Triples (TP2)

1 ex:beatles ex:album 3 ex:help ex:song "Help!" Src. 3
ex:sgt_pepper min_2 = 3
1 ex:beatles ex:album
ex:help min_2 = 2

min_1 = 1 Sorted Access for
?album ex:song ?song
Src. 2
Sorted Access for score ∈ [1,2]
ex:beatles ex:album ?album .

EVALUATION


Evaluation – Setting

We implemented three systems
Push-based symmetric hash join operator [2,5]
Standard top-k operator [6]
Improved top-k operator

Query set: 20 queries (8 FedBench and 12 own queries), having
varying result size (1 to ~10.000) and complexity (2 to 5 triple
patterns)

Data set: ~ 2.000.000 triples, distributed over ~700.000 sources

Parameters: k ∈ {1,5,10,20} and score distributions ∈ {uniform,
normal, exponential}


Evaluation – Results (1)

Overall Results

Overview of processing times for all queries (k = 1, d = n)

Top-k strategies lead to runtime improvement of 35% on average
(compared to standard Linked Data processing)

Tighter bounding lead to further improvements of 12% on average
(compared to standard top-k processing)

Evaluation – Results (2)

Effect of K and Score Distributions


CONCLUSION


Conclusion

We showed that top-k processing techniques are applicable
to the Linked Data setting.

Top-k strategies lead to significant time savings w.r.t. small
values of k (in our experiments 35% on average)

We showed that our improved top-k strategy lead to further
runtime advantages (in our experiments 12% on average)


QUESTIONS


REFERENCES


References
[1] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data
summaries for on-demand queries over linked data. In World Wide Web,
2010.
[2] G. Ladwig and T. Tran. Linked Data Query Processing Strategies. In ISWC,
2010.
[3] M. Wu, L. Berti-Equille, A. Marian, C. M. Procopiuc, and D. Srivastava.
Processing top-k join queries. Proc. VLDB Endow., pages 860–870, 2010.
[4] A. Harth, S. Kinsella, and S. Decker. Using naming authority to rank data and
ontologies for web search. In ISWC, pages 277–292, 2009.
[5] G. Ladwig and T. Tran. SIHJoin: Querying Remote and Local Linked Data. In
ESWC, 2011.
[6] K. Schnaitter and N. Polyzotis. Optimal algorithms for evaluating rank joins in
database systems. ACM Trans. Database Syst., 35:6:1–6:47, 2010.


BACKUP SLIDES


Early Pruning of Partial Results

Motivation: Top-k join processing can be quite costly in terms of
memory consumption
Idea: Prune such partial query results that cannot contribute to
a final top-k result
Currently known top-2 results:
Rank Query Bindings – Output Queue
6 ex:help foaf:name "Help!".
ex:song ?y ex:help ex:song "Help!" .
4 ex:sgt_pepper foaf:name "Sgt. Pepper".
?x ex:sgt_pepper ex:song "Lucy".

foaf:name ?z Currently known partial results:

upper-bound
Rank Triple Pattern Binding
≤
for triple bindings: 3 1 ex:sgt_pepper ex:song "Getting Better".

+
25 maximal score: 3 + 1 = 4 Institute of Applied Informatics and Formal
Andreas Wagner, Duc Thanh Tran, Günter Ladwig,

Linked Data Top-K Query Processing

Recommandé

Recommandé

Contenu connexe

Similaire à Linked Data Top-K Query Processing

Similaire à Linked Data Top-K Query Processing (6)

Dernier

Dernier (20)

Linked Data Top-K Query Processing

Notes de l'éditeur