Hybridoma Technology ( Production , Purification , and Application )
Linked Data Top-K Query Processing
1. Top-k Linked Data Query Processing
Andreas Wagner, Duc Thanh Tran, Günter Ladwig,
Andreas Harth, and Rudi Studer
Institute of Applied Informatics and Formal Description Methods (AIFB)
KIT – University of the State of Baden-Wuerttemberg and
National Research Center of the Helmholtz Association www.kit.edu
2. Introduction and Motivation
Top-k Linked Data Query Processing
Evaluation Results
2 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
4. Linked Data Query Processing
Linked Data Query
Processing Engine
HTTP lookup
data
URI
Src.
data sources
Problems: Efficiency
and Scalability
4 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
5. Top-K Query Processing
Users are usually interested in only a few results
Top-K query processing addresses the efficiency and
scalability issues
ex:sgt_pepper foaf:name
"Sgt. Pepper";
ex:song "Lucy".
ex:beatles foaf:name Src. 1
"The Beatles"; Src. 2
ex:album ex:sgt_pepper;
ex:album ex:help.
SELECT * WHERE
Src. 3 {
ex:beatles ex:album ?album .
ex:help foaf:name ?album ex:song ?song .
"Help!"; }
ex:song "Help!".
5 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
6. Contributions
Transfer top-k query processing to the Linked Data setting
Linked Data specific improvements of the top-k approach
Evaluation using real-world data
6 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
7. TOP-K LINKED DATA QUERY
PROCESSING
7 Institute of Applied Informatics and Formal
Description Methods (AIFB)
8. Top-K Query Processing in a Linked Data
Setting (1) – Requirements (1)
Source index mapping triple patterns to sources containing
bindings (e.g., [1,2])
Ranking function determining the relevance of triple pattern
bindings
TP1: ex:beatles ex:album ?album .
Linked Data TP2: ?album ex:song ?song .
Query Processing source
Engine index
TP2
TP1
TP2 ex:sgt_pepper foaf:name
score∈ [0,1] "Sgt. Pepper";
score ∈ [2,3] Src. 3 ex:song "Lucy".
ex:beatles foaf:name Src. 1
"The Beatles"; ex:help foaf:name
ex:album ex:sgt_pepper; "Help!";
ex:album ex:help. Src. 2 score∈ [1,2]
ex:song "Help!".
8 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
9. Top-K Query Processing in a Linked Data
Setting (2) – Requirements (2)
Sorted access on each join input
2
Src. 3
score ∈ [2,3] Scheduling
1 Strategy
Src. 1
3
Src. 2
score ∈ [0,1] Bindings with
TP1: score ∈ [1,2] descending
ex:beatles ex:album ?album TP2: ?album ex:song ?song scores
9 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
10. Top-K Query Processing in a Scheduling Strategy:
Linked Data
Setting (3) – Push Bound Rank Joinsource 1
Load (1) 3
Score Query Bindings – Output Queue
Score Seen Triples (TP1)
1 ex:beatles ex:album
ex:sgt_pepper Score Seen Triples (TP2)
Score Seen Triples (TP1)
1 ex:beatles ex:album 3 ex:help ex:song "Help!"
ex:help
Sorted Access for Sorted Access for
ex:beatles foaf:name Src.
ex:beatles ex:album ?album1.
"The Beatles"; ?album foaf:name ?song 3
ex:help ex:song Src.
ex:album ex:sgt_pepper; "Help!";
ex:album ex:help. ex:song "Help!".
10 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
11. Top-K Query Processing in a Linked Data
Setting (4) – Push Bound Rank Join (2)
Score Query Bindings – Output Queue
Threshold: 4 4 ex:beatles ex:album ex:help .
ex:help ex:song "Help!" .
Score Seen Triples (TP1)
1 ex:beatles ex:album Found query binding with
ex:sgt_pepper score ≥ threshold Seen Triples (TP2)
Score
1 ex:beatles ex:album STOP
3 ex:help ex:song "Help!"
ex:help
Sorted Access for Sorted Access for
ex:beatles ex:album ?album . ?album ex:song ?song
Src. 2
11 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
12. Improving the Threshold Estimation (1)
Threshold estimation:
Threshold: max { max_1 + min_2 , max_2 + min_1 }
upper
bound seen
max_1 max_2
Score Seen Triples (TP1) Score Seen Triples (TP2)
+
min_1 min_2
upper
bound unseen
We improve the threshold estimation:
Star-shaped entity query bounds
Look-ahead bounds
12 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
13. Improving the Threshold Estimation (2)
Star-shaped Entity Query Bounds
Observation: Results for entity queries come from one single
source
Idea: Upper bound scores for triple pattern bindings via the
maximal possible triple score
score ∈ [1,2]
upper-bound ex:sgt_pepper foaf:name
for triple "Sgt. Pepper";
Src. 3 ex:song "Lucy".
bindings: 3
ex:song ?y
ex:help foaf:name
?x "Help!";
ex:song "Help!".
Src. 2
foaf:name ?z
score ∈ [2,3]
upper-bound
for triple bindings: 3 upper bound for entity query bindings: 3 + 3
13 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
14. Improving the Threshold Estimation (3)
Look-ahead Bounds
Idea: Provide a more accurate upper bound for the unseen bindings
scores via the „next possible“ score
Threshold: max { 1 + 3 , 1 + 3 } = 4
2
Score Query Bindings – Output Queue
4 ex:beatles ex:album ex:help .
ex:help ex:song "Help!" .
max_1 = 1 max_2 = 3
Score Seen Triples (TP1) Score Seen Triples (TP2)
1 ex:beatles ex:album 3 ex:help ex:song "Help!" Src. 3
ex:sgt_pepper min_2 = 3
1 ex:beatles ex:album
ex:help min_2 = 2
min_1 = 1 Sorted Access for
?album ex:song ?song
Src. 2
Sorted Access for score ∈ [1,2]
ex:beatles ex:album ?album .
14 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
15. EVALUATION
15 Institute of Applied Informatics and Formal
Description Methods (AIFB)
16. Evaluation – Setting
We implemented three systems
Push-based symmetric hash join operator [2,5]
Standard top-k operator [6]
Improved top-k operator
Query set: 20 queries (8 FedBench and 12 own queries), having
varying result size (1 to ~10.000) and complexity (2 to 5 triple
patterns)
Data set: ~ 2.000.000 triples, distributed over ~700.000 sources
Parameters: k ∈ {1,5,10,20} and score distributions ∈ {uniform,
normal, exponential}
16 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
17. Evaluation – Results (1)
Overall Results
Overview of processing times for all queries (k = 1, d = n)
Top-k strategies lead to runtime improvement of 35% on average
(compared to standard Linked Data processing)
Tighter bounding lead to further improvements of 12% on average
(compared to standard top-k processing)
17 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
18. Evaluation – Results (2)
Effect of K and Score Distributions
18 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
19. CONCLUSION
19 Institute of Applied Informatics and Formal
Description Methods (AIFB)
20. Conclusion
We showed that top-k processing techniques are applicable
to the Linked Data setting.
Top-k strategies lead to significant time savings w.r.t. small
values of k (in our experiments 35% on average)
We showed that our improved top-k strategy lead to further
runtime advantages (in our experiments 12% on average)
20 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
21. QUESTIONS
21 Institute of Applied Informatics and Formal
Description Methods (AIFB)
22. REFERENCES
22 Institute of Applied Informatics and Formal
Description Methods (AIFB)
23. References
[1] A. Harth, K. Hose, M. Karnstedt, A. Polleres, K. Sattler, and J. Umbrich. Data
summaries for on-demand queries over linked data. In World Wide Web,
2010.
[2] G. Ladwig and T. Tran. Linked Data Query Processing Strategies. In ISWC,
2010.
[3] M. Wu, L. Berti-Equille, A. Marian, C. M. Procopiuc, and D. Srivastava.
Processing top-k join queries. Proc. VLDB Endow., pages 860–870, 2010.
[4] A. Harth, S. Kinsella, and S. Decker. Using naming authority to rank data and
ontologies for web search. In ISWC, pages 277–292, 2009.
[5] G. Ladwig and T. Tran. SIHJoin: Querying Remote and Local Linked Data. In
ESWC, 2011.
[6] K. Schnaitter and N. Polyzotis. Optimal algorithms for evaluating rank joins in
database systems. ACM Trans. Database Syst., 35:6:1–6:47, 2010.
23 Andreas Wagner, Duc Thanh Tran, Günter Ladwig, Institute of Applied Informatics and Formal
Andreas Harth, and Rudi Studer Description Methods (AIFB)
24. BACKUP SLIDES
24 Institute of Applied Informatics and Formal
Description Methods (AIFB)
25. Early Pruning of Partial Results
Motivation: Top-k join processing can be quite costly in terms of
memory consumption
Idea: Prune such partial query results that cannot contribute to
a final top-k result
Currently known top-2 results:
Rank Query Bindings – Output Queue
6 ex:help foaf:name "Help!".
ex:song ?y ex:help ex:song "Help!" .
4 ex:sgt_pepper foaf:name "Sgt. Pepper".
?x ex:sgt_pepper ex:song "Lucy".
foaf:name ?z Currently known partial results:
upper-bound
Rank Triple Pattern Binding
≤
for triple bindings: 3 1 ex:sgt_pepper ex:song "Getting Better".
+
25 maximal score: 3 + 1 = 4 Institute of Applied Informatics and Formal
Andreas Wagner, Duc Thanh Tran, Günter Ladwig,
Andreas Harth, and Rudi Studer Description Methods (AIFB)
Notes de l'éditeur
Introduction:* Challenges in Current Linked Data Query Processing*Processing of Ranked Linked Data* Our ContributionsTop-k* Top-K Query Processing in a Linked Data Setting* Improving the Threshold Estimation* Eager Pruning of Partial Results
* Special case of federated query processing* Only http-lookups are availablefor data access* Entire sources have to be retrieved
* Provides strategies for computing only the k top-ranked results*Other (less relevant) results are not materialized* For computing the top-1 result, no data from src. 2 is needed.
*Tighter threshold estimation and early partial result pruning
* For instance, scores for triples can be obtained through PageRank inspired ranking [4]* However, no triples are indexed (i.e., each source must be scanned)
* Join inputs must be accessible in a descending score order* We store min/max triple score per source, and allow sources to be accessed in descending score order (via a scheduling strategy)
* Given our ranking function, sorted access and source index we can employ a push-based rank join
* The threshold allows us estimate scores of the unseen query result bindings and terminate early
Push-based symmetric hash join operator (shj) Rank-join operator with corner-bound (rj-cc) [6] Rank-join operator with tigther corner-bound and early pruning (rj-tc)* (all push-based join processing and left-deep join trees): * (due to network latency issues, sources were downloaded and Linked Data access was simulated on one single machine)
* Differences due to less input data retrieved* Some queries (e.g., q10 or q20) equal as result set too small (i.e., all (!) data had to retrieved)* Differences between rj-cc and rj-tc not showing properly in (a) as evaluation was on local machineOutlier q19 due to implementation issueQ9: early pruning: 8% of buffered data safed. However, no „real“ impact on efficiency -> main aspect here is number of source to be retrieved
(b) Average number of sources (different k, d = n). (c) Average evaluation time (different k, d = n). (d) Average evaluation time (different n, k = 10). (e) Average evaluation time with varying number of triple patterns (k = 1, d = n).
Q9: early pruning: 8% of buffered data safed. However, no „real“ impact on efficiency -> main aspect here is number of source to be retrieved
* ( „seen“ and „output“ buffers)* That is, any partial result having a (partial) score that together the maximal possible score for the unevaluated query part is ≤ than the currently smallest top-k score