Federated query engines allow to consume linked data from SPARQL endpoints. Replicating data fragments from different sources allows to re-organize data to better fit federated query processing of data consumers. However, existing federated query engines poorly sup- port replication. In this paper, we propose a replication-aware federated query engine that extends state-of-art federated query engine ANAPSID and FedX with Fedra, a source selection strategy that approximates the source selection problem with fragments replication (SSP-FR). For a given set of endpoints with replicated fragments and a SPARQL query, the problem is to find the endpoints to contact in order to minimize the number of tuples to transfer from endpoints to the federated query engines. We devise the Fedra source selection algorithm that approximates SSP-FR. We implement Fedra in the state-of-the-art federated query engines FedX and ANAPSID, and empirically evaluate their performance. Experimental results suggest that Fedra efficiently solves SSP-FR, reducing the number of selected SPARQL endpoints as well as the size of query intermediate results.
Federated SPARQL Query Processing With Replicated Fragment
1. Federated SPARQL Queries
Processing with Replicated
Fragments
Gabriela Montoya1
Hala Skaf-Molli1
Pascal Molli1
Maria-Esther Vidal2
1LINA – Nantes University, France
({first.last}@univ-nantes.fr)
2Universidad Sim´on Bol´ıvar, Venezuela
(mvidal@ldc.usb.ve)
October 13, 2015
ISWC2015
2. Federated Query Engines poorly support
replication
Federated query engines allow to
consume linked data without
moving data.
Unfortunately, in presence of
replication, the performance of
federated query engines is
degraded.
2
3. Replicated data decreases federated query
engines performance
select distinct ?p ?m ?n ?d where {
?p dbprop : name ?m .
?p dbprop : n a t i o n a l i t y ?n .
?p dbprop : d o c t o r a l A d v i s o r ?d
}
#DBpedia Replicas FedX1
Execution Time (ms)
1 1,392
2 215,907
1
Schwarte et al. Fedx: Optimization techniques for federated query processing on
linked data. In ISWC2011
3
4. Users may replicate only the fragments
relevant for their queries
A triple pattern fragment is defined by the
dataset it has been replicated from and a
CONSTRUCT query with a triple pattern.
Fragment with the doctoral advisors triples:
<http://dbpedia.org/sparql, CONSTRUCT
WHERE { ?p dbprop:doctoralAdvisor ?a }>
Replicating fragments from different datasets
provides new data localities and opens new
opportunities for optimization.
4
5. Existing public endpoints may not be the
best choice for federated queries
DBpedia LinkedMDB
C1
select distinct * where {
?director dbo : nationality ?nat .
?film dbo : director ?director .
?movie owl : sameAs ?film .
?movie linkedmdb : genre ?genre }
client
5
6. Endpoints that replicate fragments give
place to new data localities
DBpedia LinkedMDB
C1
select distinct * where {
?director dbo : nationality ?nat .
?film dbo : director ?director .
?movie owl : sameAs ?film .
?movie linkedmdb : genre ?genre }
client
?d dbo:nationality ?n
?f dbo:director ?d
?m owl:sameAs ?f
?m linkedmdb:genre ?g
5
7. Selecting all the sources leads to poor
engine performance
DBpedia LinkedMDB
C1
select distinct * where {
?director dbo : nationality ?nat .
?film dbo : director ?director .
?movie owl : sameAs ?film .
?movie linkedmdb : genre ?genre }
client
?d dbo:nationality ?n
?f dbo:director ?d
?m owl:sameAs ?f
?m linkedmdb:genre ?g
Triples to
transfer
s1 s2 s3 s4 s5
DBpedia 166,177 3,229 3,229 0 0
LinkedMDB 76,180 13,430 0 13,430 0
C1 242,357 0 13,430 3,229 48
Execution
Time (s) 20.22 2.64 2.50 2.79 0.65
5
8. Selecting non-overlapping data may not
be good enough
DBpedia LinkedMDB
C1
select distinct * where {
?director dbo : nationality ?nat .
?film dbo : director ?director .
?movie owl : sameAs ?film .
?movie linkedmdb : genre ?genre }
client
?d dbo:nationality ?n
?f dbo:director ?d
?m owl:sameAs ?f
?m linkedmdb:genre ?g
Triples to
transfer
s1 s2 s3 s4 s5
DBpedia 166,177 3,229 3,229 0 0
LinkedMDB 76,180 13,430 0 13,430 0
C1 242,357 0 13,430 3,229 48
Execution
Time (s) 20.22 2.64 2.50 2.79 0.65
5
9. Selecting non-overlapping data may not
be good enough
DBpedia LinkedMDB
C1
select distinct * where {
?director dbo : nationality ?nat .
?film dbo : director ?director .
?movie owl : sameAs ?film .
?movie linkedmdb : genre ?genre }
client
?d dbo:nationality ?n
?f dbo:director ?d
?m owl:sameAs ?f
?m linkedmdb:genre ?g
Triples to
transfer
s1 s2 s3 s4 s5
DBpedia 166,177 3,229 3,229 0 0
LinkedMDB 76,180 13,430 0 13,430 0
C1 242,357 0 13,430 3,229 48
Execution
Time (s) 20.22 2.64 2.50 2.79 0.65
5
10. Selecting non-overlapping data may not
be good enough
DBpedia LinkedMDB
C1
select distinct * where {
?director dbo : nationality ?nat .
?film dbo : director ?director .
?movie owl : sameAs ?film .
?movie linkedmdb : genre ?genre }
client
?d dbo:nationality ?n
?f dbo:director ?d
?m owl:sameAs ?f
?m linkedmdb:genre ?g
Triples to
transfer
s1 s2 s3 s4 s5
DBpedia 166,177 3,229 3,229 0 0
LinkedMDB 76,180 13,430 0 13,430 0
C1 242,357 0 13,430 3,229 48
Execution
Time (s) 20.22 2.64 2.50 2.79 0.65
5
11. Selecting sources able of evaluating joins
reduces the number of transferred tuples
DBpedia LinkedMDB
C1
select distinct * where {
?director dbo : nationality ?nat .
?film dbo : director ?director .
?movie owl : sameAs ?film .
?movie linkedmdb : genre ?genre }
client
?d dbo:nationality ?n
?f dbo:director ?d
?m owl:sameAs ?f
?m linkedmdb:genre ?g
Triples to
transfer
s1 s2 s3 s4 s5
DBpedia 166,177 3,229 3,229 0 0
LinkedMDB 76,180 13,430 0 13,430 0
C1 242,357 0 13,430 3,229 48
Execution
Time (s) 20.22 2.64 2.50 2.79 0.65
5
12. The best choice transfers less intermediate
results
DBpedia LinkedMDB
C1
select distinct * where {
?director dbo : nationality ?nat .
?film dbo : director ?director .
?movie owl : sameAs ?film .
?movie linkedmdb : genre ?genre }
client
Triples to
transfer
s1 s2 s3 s4 s5
DBpedia 166,177 3,229 3,229 0 0
LinkedMDB 76,180 13,430 0 13,430 0
C1 242,357 0 13,430 3,229 48
Execution
Time (s) 20.22 2.64 2.50 2.79 0.65
5
13. ???
DBpedia LinkedMDB
C1 C2 C3
select distinct
?director ?nat ?genre where {
?director dbo : nationality ?nat . (tp1)
?film dbo : director ?director . (tp2)
?movie owl : sameAs ?film . (tp3)
?movie linkedmdb : genre ?genre } (tp4)
f 2, f 6
f 4
f 2, f 7 f 3, f 5 f 3, f 4
f 2
tp1, tp2, tp4 tp1, tp2, tp3, tp4 tp2, tp3, tp4
F CONSTRUCT WHERE { %s% }
f2 ?film dbo:director ?director
f3 ?movie owl:sameAs ?film
f4 ?movie linkedmdb:genre ?genre
f5 ?movie linkedmdb:genre film genre:14
f6 ?director dbo:nationality dbr:France
f7 ?director dbo:nationality dbr:United Kingdom
6
14. Selecting less endpoints does not always
produce less intermediate results
?director
dbo:nationality
dbr:France
f5
C2, C4
?film
dbo:director
?director
f2
C3, C4, C5, C6
?director
dbo:nationality
dbr:United Kingdom
f6
C2, C5
?film
dbo:director
?director
f2
C3, C4, C5, C6
?director
dbo:nationality
dbr:United States
f7
C2, C6
?film
dbo:director
?director
f2
C3, C4, C5, C6
7
17. Source Selection Problem with Fragment
Replication (SSP-FR)
Given a SPARQL query and a set of SPARQL
endpoints with replicated fragments, choose the
SPARQL endpoints to contact for each query triple
pattern in order to produce a complete query
answer and transfer the minimum amount of
data
9
18. Fedra performs a BGP aware source
selection, and exploits fragment localities
to reduce intermediate results
1. Fedra selects relevant fragments per triple
pattern and prunes fragments using query
containment.
2. Multiple relevant fragments → UNION
Reduction: try to reduce to one fragment.
3. One relevant fragment → BGP Reduction:
reduce to set covering problem to evaluate in
as few endpoints as possible.
10
19. BGP Reduction
BGP Triple Pattern Relevant Relevant
Fragments Endpoints
tp1 ?director dbo:nationality ?nat f1 { C1 }
tp2 ?film dbo:director ?director f2 { C1, C3 }
tp3 ?movie owl:sameAs ?film f3 { C1, C2 }
tp4 ?movie linkedmdb:genre ?genre f4 { C2, C4}
f1 : <dbpedia , ? d i r e c t o r dbo : n a t i o n a l i t y ?nat>
f2 : <dbpedia , ? f i l m dbo : d i r e c t o r ? d i r e c t o r >
f3 : <linkedmdb , ? movie owl : sameAs ? film >
f4 : <linkedmdb , ? movie linkedmdb : genre ? genre>
fragments mapping = {( f1 , {C1}) , ( f2 ,{C1 , C3}) ,
( f3 , {C1 , C2}) , ( f4 ,{C2 , C4})}
11
24. Union Reduction
BGP Triple Pattern Relevant Relevant
Fragments Endpoints
tp1 ?director dbo:nationality ?nat f5 {C2}
f6 {C1}
tp2 ?film dbo:director ?director f2 { C1, C3 }
tp3 ?movie owl:sameAs ?film f3 { C1, C2, C4 }
tp4 ?movie linkedmdb:genre ?genre f4 { C2}
f2 : <dbpedia , ? f i l m dbo : d i r e c t o r ? d i r e c t o r >
f3 : <linkedmdb , ? movie owl : sameAs ? film >
f4 : <linkedmdb , ? movie linkedmdb : genre ? genre>
f5 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : France>
f6 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : United Kingdom>
fragments mapping = {( f2 , {C1 , C2 }) ,( f3 , {C1}) ,
( f4 , {C1 }) ,( f5 ,{ C2}) , ( f6 , {C1})} 12
45. Conclusions
We addressed the problem of partial replication
in Linked Data.
Fedra performs a BGP aware source
selection, and exploits fragment localities to
reduce intermediate results.
Experimental results demonstrated that
Fedra achieves a great reduction of the
number of selected sources and the number of
transferred tuples by ANAPSID and FedX.
22
46. Perspectives
Take into account replicated fragments that
diverge.
Take into account preferences about the
endpoints.
Take advantage of replicated data for parallel
query processing.
23
48. Results in the next slides are from a different setup
where Virtuoso 7.2.1 endpoints were used, and each
endpoint was deployed in a different cluster machine
25
53. FEDRA computes alternative sources per
fragment
DBpedia LinkedMDB
C1
?d dbo:nationality ?n
?f dbo:director ?d
?m owl:sameAs ?f
?m linkedmdb:genre ?g
f1 : <dbpedia , ? d i r e c t o r dbo : n a t i o n a l i t y ? nat>
f2 : <dbpedia , ? f i l m dbo : d i r e c t o r ? d i r e c t o r >
f3 : <linkedmdb , ? movie owl : sameAs ? film >
f4 : <linkedmdb , ? movie linkedmdb : genre ? genre>
fragments mapping = {( f1 , {DBpedia , C1 }) , ( f2 ,{ DBpedia , C1 }) ,
( f3 , {LinkedMDB , C1 }) , ( f4 ,{ LinkedMDB , C1})}
30
54. Alternative Endpoints per Fragment are
Considered
BGP Triple Pattern Relevant Relevant
Fragments Endpoints
tp1 ?director dbo:nationality ?nat f1 { DBpedia, C1 }
tp2 ?film dbo:director ?director f2 { DBpedia, C1 }
tp3 ?movie owl:sameAs ?film f3 { LinkedMDB, C1}
tp4 ?movie linkedmdb:genre ?genre f4 { LinkedMDB, C1}
f1 : <dbpedia , ? d i r e c t o r dbo : n a t i o n a l i t y ? nat>
f2 : <dbpedia , ? f i l m dbo : d i r e c t o r ? d i r e c t o r >
f3 : <linkedmdb , ? movie owl : sameAs ? film >
f4 : <linkedmdb , ? movie linkedmdb : genre ? genre>
fragments mapping = {( f1 , {DBpedia , C1 }) , ( f2 ,{ DBpedia , C1 }) ,
( f3 , {LinkedMDB , C1 }) , ( f4 ,{ LinkedMDB , C1})}
30
57. It may be necessary to simplify to get the
best selection
BGP Triple Pattern Relevant Relevant
Fragments Endpoints
tp1 ?director dbo:nationality ?nat f5 {C1, C2 }
f6 {C1}
tp2 ?film dbo:director ?director f2 { C1, C3 }
tp3 ?movie owl:sameAs ?film f3 { C1, C2, C4 }
tp4 ?movie linkedmdb:genre ?genre f4 { C2}
f2 : <dbpedia , ? f i l m dbo : d i r e c t o r ? d i r e c t o r >
f3 : <linkedmdb , ? movie owl : sameAs ? film >
f4 : <linkedmdb , ? movie linkedmdb : genre ? genre>
f5 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : France>
f6 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : United Kingdom>
fragments mapping = {( f2 , {C1 , C2 }) ,( f3 , {C1 }) ,
( f4 , {C1 }) ,( f5 ,{ C1 , C2 }) , ( f6 , {C1})}
31
58. It may be necessary to simplify to get the
best selection
BGP Triple Pattern Relevant Relevant
Fragments Endpoints
tp1 ?director dbo:nationality ?nat f5 {C1, C2 }
f6 {C1}
tp2 ?film dbo:director ?director f2 { C1, C3 }
tp3 ?movie owl:sameAs ?film f3 { C1, C2, C4 }
tp4 ?movie linkedmdb:genre ?genre f4 { C2}
S = { tp1, tp2, tp3, tp4 }
CC1 = { tp1, tp2, tp3}
CC2 = { tp3, tp4}
CC3 = { tp2 }
CC4 = { tp3 }
31
59. It may be necessary to simplify to get the
best selection
BGP Triple Pattern Relevant Relevant
Fragments Endpoints
tp1 ?director dbo:nationality ?nat f5 {C1, C2 }
f6 {C1}
tp2 ?film dbo:director ?director f2 { C1, C3 }
tp3 ?movie owl:sameAs ?film f3 { C1, C2, C4 }
tp4 ?movie linkedmdb:genre ?genre f4 { C2}
S = { tp1, tp2, tp3, tp4 }
CC1 = { tp1, tp2, tp3}
CC2 = { tp3, tp4}
CC3 = { tp2 }
CC4 = { tp3 }
31
60. Statistical Significance of Data Redundancy Minimization
H0: Fedra selects the same number of sources as DAW does
Ha: Fedra selects less sources than DAW
Federation p-value
ANAPSID FedX
Diseasome 1.811e-08 8.371e-09
SWDF 2.28e-10 5.386e-11
LinkedMDB 5.082e-09 5.254e-11
Geocoordinates 1.301e-05 1.301e-05
WatDiv1 6.209e-07 1.006e-07
WatDiv100 1.563e-05 3.623e-07
For all the federations and engines, the obtained p-values2
allow to discard the null hypothesis (H0) in favor of the
alternative hypothesis (Ha).
2
The Wilcoxon signed rank test was computed using R
32
61. Statistical Significance of Data Transfer Minimization
H0: using sources selected by Fedra leads to transfer the same
number of tuples as using sources selected by DAW
Ha: using sources selected by Fedra leads to transfer less
tuples than using sources selected by DAW
Federation p-value
ANAPSID FedX
Diseasome 3.314e-12 2.821e-06
SWDF 1.472e-08 0.7621
LinkedMDB 2.368e-08 0.001274
Geocoordinates 1.921e-05 1.183e-06
WatDiv1 8.431e-05 7.246e-09
WatDiv100 9.986e-06 0.0001301
For all the federations and engines except SWDF+FedX, the
obtained p-values3
allow to discard the null hypothesis (H0) in
favor of the alternative hypothesis (Ha).
3
The Wilcoxon signed rank test was computed using R
33
62. Statistical Significance of Source Selection Time Reduction
H0: using sources selected by Fedra leads to the same source
selection time as using sources selected by DAW
Ha: using sources selected by Fedra leads to lower source
selection time than using sources selected by DAW
Federation p-value
ANAPSID FedX
Diseasome 1 < 2.2e-16
SWDF 1 < 2.2e-16
LinkedMDB 1.284e-11 < 2.2e-16
Geocoordinates < 2.2e-16 < 2.2e-16
WatDiv1 1 < 2.2e-16
WatDiv100 < 2.2e-16 < 2.2e-16
For all the federations and engines except Diseasome+ANAPSID, SWDF+ANAPSID
and WatDiv1+ANAPSID, the obtained p-values4 allow to discard the null hypothesis
(H0) in favor of the alternative hypothesis (Ha).
4
The Wilcoxon signed rank test was computed using R
34
63. Statistical Significance of Execution Time Reduction
H0: using sources selected by Fedra leads to the same
execution time as using sources selected by DAW
Ha: using sources selected by Fedra leads to lower execution
time than using sources selected by DAW
Federation p-value
ANAPSID FedX
Diseasome 0.0001547 < 2.2e-16
SWDF 1 6.794e-06
LinkedMDB < 2.2e-16 9.223e-15
Geocoordinates < 2.2e-16 7.87e-13
WatDiv1 1 6.315e-16
WatDiv100 5.392e-09 1.384e-14
For all the federations and engines except SWDF+ANAPSID and
WatDiv1+ANAPSID, the obtained p-values5 allow to discard the null hypothesis (H0)
in favor of the alternative hypothesis (Ha).
5
The Wilcoxon signed rank test was computed using R
35
64. Source Selection may not be enough
?director dbo:nationality ?nat ?film dbo:director ?director
f2 : <dbpedia , ? f i l m dbo : d i r e c t o r ? d i r e c t o r >
f5 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : France>
f6 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : United Kingdom>
fragments mapping = {( f2 , {C1 , C2}) , ( f5 ,{ C1}) ,
( f6 , {C2})} 36
65. Source Selection may not be enough
?director dbo:nationality ?nat
f5
?director dbo:nationality ?nat
f6
?film dbo:director ?director
f2
f2 : <dbpedia , ? f i l m dbo : d i r e c t o r ? d i r e c t o r >
f5 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : France>
f6 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : United Kingdom>
fragments mapping = {( f2 , {C1 , C2}) , ( f5 ,{ C1}) ,
( f6 , {C2})} 36
66. Source Selection may not be enough
?director dbo:nationality ?nat
f5
{C1}
?director dbo:nationality ?nat
f6
{C2}
?film dbo:director ?director
f2
{C1, C2}
f2 : <dbpedia , ? f i l m dbo : d i r e c t o r ? d i r e c t o r >
f5 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : France>
f6 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : United Kingdom>
fragments mapping = {( f2 , {C1 , C2}) , ( f5 ,{ C1}) ,
( f6 , {C2})} 36
67. Source Selection may not be enough
?director dbo:nationality ?nat
f5
{ C1}
?director dbo:nationality ?nat
f6
{ C2}
?film dbo:director ?director
f2
{C1, C2}
f2 : <dbpedia , ? f i l m dbo : d i r e c t o r ? d i r e c t o r >
f5 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : France>
f6 : <dbpedia ,
? d i r e c t o r dbo : n a t i o n a l i t y dbr : United Kingdom>
fragments mapping = {( f2 , {C1 , C2}) , ( f5 ,{ C1}) ,
( f6 , {C2})} 36