DAW: Duplicate-AWare Federated Query Processing over the Web of Data

DAW: Duplicate-AWare Federated Query
Processing over the Web of Data
Muhammad Saleem1 , Axel-Cyrille Ngonga Ngomo1, Josiane
Xavier Parreira2 , Helena F. Deus3 , Manfred Hauswirth2
1Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig, Germany
lastname@informatik.uni-leipzig.de
2Digital Enterprise Research Institute(DERI), National University of Ireland.,Galway
firstname.lastnameg@deri.org
International Semantic Web Conference (ISWC), October 21-25 , 2013, Sydney, Australia

Motivation
S1 S2 S3 S4
RDF RDF RDF RDF
Parser
Source Selection
Federator Optimzer
Integrator
Get Individual
Triple Patterns
Identify capable
source against
Individual Triple
Patterns
Generate
optimized sub-
query Exe. Plan
Integrate sub-
queries results
Execute sub-
queries

Motivation
SELECT ?v1 ?v2
WHERE
{
?uri <p1> ?v1. // Triple Pattern 1 (TP1)
?uri <p2> ?v2. // Triple Pattern 2 (TP2)
}
S1
RDF
Source Selection Algorithm
S2
RDF
S3
RDF
S4
RDF
Triple pattern-wise source selection
S1 S2 S3TP1 =
S4TP2 = S2S1
Total triple pattern-wise selected sources = 6

Motivation
Retrieved results for TP1 (?uri <p1> ?v1) Retrieved results for TP2 (?uri <p2> ?v2)
Triple pattern-wise source selection and skipping
S1 S2 S3TP1 =
Total triple pattern-wise selected sources = 4
S1 S2TP2 = S4
Min. number of new triples (threshold) = 20
Total triple pattern-wise skipped sources = 2

Problem Statement
• Data duplication in LOD datasets
– E.g. DrugBank and Neurocommons are duplicated at
DERI health Care and Life Sciences Knowledge Base
• Duplicate results retrieval increase the query
execution time and network traffic
• How to estimate the overlap between data
sources before sub-queries federation?

Sketches
• Data structures that provide dataset summaries
– Min-wise Independent Permutations (MIPs)
– Bloom filters
• Estimate overlap among different ID sets
• MIPs provide good tradeoff between estimation
error and space requirements
• MIPs of different lengths can be compared
• Sketches all alone cannot be used in SPARQL
federation
– SPARQL queries are highly selective when subject,
predicate, or object becomes bound in a triple pattern

Min-wise Independent Permutations
48 24 36 18 820
21 3 12 24 877
9 21 15 24 4640
21 18 45 30 339
h1 = (7x + 3) mod 51
h2 = (5x + 6) mod 51
hN = (3x + 9) mod 51
8
9
9
Apply Permutations to all ID’s
ID set
Create MIP
Vector from
Minima of
Permutation
s
8
9
30
24
36
9
8
24
20
48
36
13
MIPs estimated operations
h(concat(s,o))
T4(s,p,o) T5(s,p,o) T6(s,p,o)
T1(s,p,o) T2(s,p,o) T3(s,p,o)
Triples
VA VB
8
9
20
24
36
9
Union (VA , VB)
Resemblance (VA , VB ) = 2/6 => 0.33
Overlap (VA , VB ) =
0.33*(6+6) / (1+0.33) => 3
hi = ai∗x + bimod U
𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 (𝑆𝐴, 𝑆 𝐵) =
𝑆 𝐴⋂𝑆𝐵
𝑆 𝐴⋃𝑆𝐵
≈
|VA⋂VB|
𝑁
Overlap (𝑆𝐴, 𝑆 𝐵)≈
𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴
,𝑉 𝐵
×( 𝑆 𝐴
+ 𝑆 𝐵
)
(𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴
,𝑉𝐵 +1)
𝐸𝑟𝑟𝑜𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 = 𝑂(1 𝑁)
𝑆′𝑖 =
𝑆𝑖 𝑖𝑓 𝑛𝑒𝑖𝑡ℎ𝑒𝑟 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑛𝑜𝑟 𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑
𝑆𝑖 × 𝑎𝑣𝑔𝑆𝑏𝑗𝑆𝑒𝑙 𝑆 𝑝 𝑖𝑓 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑
𝑆𝑖 × 𝑎𝑣𝑔𝑂𝑏𝑗𝑆𝑒𝑙 𝑆 𝑝 𝑖𝑓𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑

DAW
• A combination of MIPs with compact data
summaries
• Use average selectivities values for bound
subject and objects
• Can be combined with any existing SPARQL
endpoint federation system
• Can be used for partial result retrieval

DAW Index
[] a sd:Service ;
sd:endpointUrl <http://localhost:8890/sparql> ;
sd:capability [
sd:predicate diseasome:name ;
sd:totalTriples 147 ;
sd:avgSbjSel ``0.0068'' ;
sd:avgObjSel ``0.0069'' ;
sd:MIPs ``-6908232 -7090543 -6892373 -7064247 ...''; ] ;
sd:capability [
sd:predicate diseasome:chromosomalLocation ;
sd:totalTtriples 160 ;
sd:avgSbjSel ``0.0062'' ;
sd:avgObjSel ``0.0072'' ;
sd:MIPs ``-7056448 -7056410 -6845713 -6966021 ...''; ] ;

Triple Pattern-wise source ranking and skipping

Evaluation Setup
Dataset
Total Size
(MB)
Index Size
(bytes) No of Slice Discrepancy
No of Dup.
Slices
Index Gen.
Time (sec)
Diseasome 18.62 0.17 10 1500 1 4
Geo 274.14 1.63 10 50000 2 133
LinkedMDB 448.93 1.66 10 100000 1 201
Publication 39.07 0.2 10 2500 1 6
Queries Distribution
Dataset STP S-1 S-2 P-1 P-2 P-3 Total
Diseasome 5 5 5 4 5 2 26
Geo 5 5 5 - - - 15
LinkedMDB 5 - - - - - 5
Publication 5 5 5 7 7 4 33
Total 20 15 15 11 12 6 79
EndPoint CPU(GHz) RAM Hard Disk
12.2. i3 4GB 300GB
22.9. i7 16GB 256GB SSD
32.6. i5 4GB 150GB
42.53. i5 4GB 300GB
52.3. i5 4GB 500GB
62.53. i5 4GB 300GB
72.9. i7 8GB 450GB
82.6. i5 8GB 400GB
92.6. i5 8GB 400GB
102.9. i7 16GB 500GB
• Slice generator tool [1] for random slicing and
duplicates
• We have extended FedX, SPLENDID, DARQ with
DAW
[1] http://goo.gl/trjGSJ

Triple Pattern-wise sources skipped
DARQ
Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall
Diseasome 14(35) 30(77) 40(107) 35(65) 65(125) 30(50) 214(459) 100%
Geo 22(40) 23(55) 37(101) - - -82(196) 99.99%
LinkedMDB 22(38) - - - - -22(38) 100%
Publication 9(30) 10(37) 15(86) 14(60) 21(120) 32(102) 101(435) 100%
Total 67(143) 63(169) 92(294) 49(294) 86(245) 62(152) 419(1128)
FedX and SPLENDID
Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall
Diseasome 7(28) 30(77) 40(107) 35(65) 65(125) 30(50) 207(452) 100%
Geo 19(37) 23(55) 37(101) - - -79(193) 99.99%
LinkedMDB 15(31) - - - - -15(31) 100%
Publication 3(24) 10(37) 15(86) 14(60) 21(120) 32(102) 95(429) 100%
Total 44(120) 63(169) 92(294) 49(125) 86(245) 62(152) 396(1105)

FedX Extension with DAW
0
1
2
3
4
5
6
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP
Diseasome Publication Geo Data Movie
Executiontime(sec)
FedX
DAW
Over all performance Evaluation
Diseasome Publication Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %
FedX 2.44
18.79
1.48
-12.38
4.60
14.71
1.74
7.59
2.44
9.76
DAW 1.98 1.67 3.92 1.61 2.20

SPLENDID Extension with DAW
0
1
2
3
4
5
6
7
8
9
10
Diseasome Publication Geo Movie
Executiontime(sec)
SPLENDID
DAW
Diseasome Publication Geo Data Movie Overall
SPLENDID 3.78 19.48 2.18 -8.94 7.27 14.40 1.9 11.16 3.71 11.11
DAW 3.04 2.37 6.22 1.688 3.30

DARQ Extension with DAW
0
5
10
15
20
25
30
35
40
Diseasome Publication Geo Movie
Executiontime(sec)
DARQ
DAW
Diseaso
me
Publicati
on Geo Data Movie Overall
DARQ 8.27
23.34
5.26
6.14
23.44
16.31
1.96
13.88
9.59
16.46
DAW 6.34 4.94 19.62 1.688 8.01

Source Ranking vs Recall
0
20
40
60
80
100
120
Recallin%
Ranked Sources
Optimal
DAW
0
20
40
60
80
100
120
Recallin%
Ranked Sources
Optimal
DAW
Diseasome Publication

Conclusion and Future Work
• A sub-query can retrieve results that are already retrieved by another query
– Resources are wasted
– Query runtime is increased
– Extra traffic is generated
• Sketches all alone cannot be used due to expressive nature of SPARQL queries
• We used MIPs applied to RDF predicates along with compact data summaries
• Performance improvement
– FedX : 9.76 %
– SPLENDID: 11.11 %
– DAW: 16.76 %
• The effect of MIPs sizes and threshold values to find the optimal trade-off
between execution time and recall will be explored
saleem@informatik.uni-leipzig.de
AKSW, University of Leipzig, Germany

DAW: Duplicate-AWare Federated Query Processing over the Web of Data

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (18)

Similaire à DAW: Duplicate-AWare Federated Query Processing over the Web of Data

Similaire à DAW: Duplicate-AWare Federated Query Processing over the Web of Data (20)

Plus de Muhammad Saleem

Plus de Muhammad Saleem (19)

Dernier

Dernier (20)

DAW: Duplicate-AWare Federated Query Processing over the Web of Data