SlideShare une entreprise Scribd logo
1  sur  18
DAW: Duplicate-AWare Federated Query
Processing over the Web of Data
Muhammad Saleem1 , Axel-Cyrille Ngonga Ngomo1, Josiane
Xavier Parreira2 , Helena F. Deus3 , Manfred Hauswirth2
1Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig, Germany
lastname@informatik.uni-leipzig.de
2Digital Enterprise Research Institute(DERI), National University of Ireland.,Galway
firstname.lastnameg@deri.org
International Semantic Web Conference (ISWC), October 21-25 , 2013, Sydney, Australia
Motivation
S1 S2 S3 S4
RDF RDF RDF RDF
Parser
Source Selection
Federator Optimzer
Integrator
Get Individual
Triple Patterns
Identify capable
source against
Individual Triple
Patterns
Generate
optimized sub-
query Exe. Plan
Integrate sub-
queries results
Execute sub-
queries
Motivation
SELECT ?v1 ?v2
WHERE
{
?uri <p1> ?v1. // Triple Pattern 1 (TP1)
?uri <p2> ?v2. // Triple Pattern 2 (TP2)
}
S1
RDF
Source Selection Algorithm
S2
RDF
S3
RDF
S4
RDF
Triple pattern-wise source selection
S1 S2 S3TP1 =
S4TP2 = S2S1
Total triple pattern-wise selected sources = 6
Motivation
Retrieved results for TP1 (?uri <p1> ?v1) Retrieved results for TP2 (?uri <p2> ?v2)
Triple pattern-wise source selection and skipping
S1 S2 S3TP1 =
Total triple pattern-wise selected sources = 4
S1 S2TP2 = S4
Min. number of new triples (threshold) = 20
Total triple pattern-wise skipped sources = 2
Problem Statement
• Data duplication in LOD datasets
– E.g. DrugBank and Neurocommons are duplicated at
DERI health Care and Life Sciences Knowledge Base
• Duplicate results retrieval increase the query
execution time and network traffic
• How to estimate the overlap between data
sources before sub-queries federation?
Sketches
• Data structures that provide dataset summaries
– Min-wise Independent Permutations (MIPs)
– Bloom filters
• Estimate overlap among different ID sets
• MIPs provide good tradeoff between estimation
error and space requirements
• MIPs of different lengths can be compared
• Sketches all alone cannot be used in SPARQL
federation
– SPARQL queries are highly selective when subject,
predicate, or object becomes bound in a triple pattern
Min-wise Independent Permutations
48 24 36 18 820
21 3 12 24 877
9 21 15 24 4640
21 18 45 30 339
h1 = (7x + 3) mod 51
h2 = (5x + 6) mod 51
hN = (3x + 9) mod 51
8
9
9
Apply Permutations to all ID’s
ID set
Create MIP
Vector from
Minima of
Permutation
s
8
9
30
24
36
9
8
24
20
48
36
13
MIPs estimated operations
h(concat(s,o))
T4(s,p,o) T5(s,p,o) T6(s,p,o)
T1(s,p,o) T2(s,p,o) T3(s,p,o)
Triples
VA VB
8
9
20
24
36
9
Union (VA , VB)
Resemblance (VA , VB ) = 2/6 => 0.33
Overlap (VA , VB ) =
0.33*(6+6) / (1+0.33) => 3
hi = ai∗x + bimod U
𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 (𝑆𝐴, 𝑆 𝐵) =
𝑆 𝐴⋂𝑆𝐵
𝑆 𝐴⋃𝑆𝐵
≈
|VA⋂VB|
𝑁
Overlap (𝑆𝐴, 𝑆 𝐵)≈
𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴
,𝑉 𝐵
×( 𝑆 𝐴
+ 𝑆 𝐵
)
(𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴
,𝑉𝐵 +1)
𝐸𝑟𝑟𝑜𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 = 𝑂(1 𝑁)
𝑆′𝑖 =
𝑆𝑖 𝑖𝑓 𝑛𝑒𝑖𝑡ℎ𝑒𝑟 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑛𝑜𝑟 𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑
𝑆𝑖 × 𝑎𝑣𝑔𝑆𝑏𝑗𝑆𝑒𝑙 𝑆 𝑝 𝑖𝑓 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑
𝑆𝑖 × 𝑎𝑣𝑔𝑂𝑏𝑗𝑆𝑒𝑙 𝑆 𝑝 𝑖𝑓𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑
DAW
• A combination of MIPs with compact data
summaries
• Use average selectivities values for bound
subject and objects
• Can be combined with any existing SPARQL
endpoint federation system
• Can be used for partial result retrieval
DAW Index
[] a sd:Service ;
sd:endpointUrl <http://localhost:8890/sparql> ;
sd:capability [
sd:predicate diseasome:name ;
sd:totalTriples 147 ;
sd:avgSbjSel ``0.0068'' ;
sd:avgObjSel ``0.0069'' ;
sd:MIPs ``-6908232 -7090543 -6892373 -7064247 ...''; ] ;
sd:capability [
sd:predicate diseasome:chromosomalLocation ;
sd:totalTtriples 160 ;
sd:avgSbjSel ``0.0062'' ;
sd:avgObjSel ``0.0072'' ;
sd:MIPs ``-7056448 -7056410 -6845713 -6966021 ...''; ] ;
Triple Pattern-wise source ranking and skipping
Evaluation Setup
Dataset
Total Size
(MB)
Index Size
(bytes) No of Slice Discrepancy
No of Dup.
Slices
Index Gen.
Time (sec)
Diseasome 18.62 0.17 10 1500 1 4
Geo 274.14 1.63 10 50000 2 133
LinkedMDB 448.93 1.66 10 100000 1 201
Publication 39.07 0.2 10 2500 1 6
Queries Distribution
Dataset STP S-1 S-2 P-1 P-2 P-3 Total
Diseasome 5 5 5 4 5 2 26
Geo 5 5 5 - - - 15
LinkedMDB 5 - - - - - 5
Publication 5 5 5 7 7 4 33
Total 20 15 15 11 12 6 79
EndPoint CPU(GHz) RAM Hard Disk
12.2. i3 4GB 300GB
22.9. i7 16GB 256GB SSD
32.6. i5 4GB 150GB
42.53. i5 4GB 300GB
52.3. i5 4GB 500GB
62.53. i5 4GB 300GB
72.9. i7 8GB 450GB
82.6. i5 8GB 400GB
92.6. i5 8GB 400GB
102.9. i7 16GB 500GB
• Slice generator tool [1] for random slicing and
duplicates
• We have extended FedX, SPLENDID, DARQ with
DAW
[1] http://goo.gl/trjGSJ
Triple Pattern-wise sources skipped
DARQ
Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall
Diseasome 14(35) 30(77) 40(107) 35(65) 65(125) 30(50) 214(459) 100%
Geo 22(40) 23(55) 37(101) - - -82(196) 99.99%
LinkedMDB 22(38) - - - - -22(38) 100%
Publication 9(30) 10(37) 15(86) 14(60) 21(120) 32(102) 101(435) 100%
Total 67(143) 63(169) 92(294) 49(294) 86(245) 62(152) 419(1128)
FedX and SPLENDID
Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall
Diseasome 7(28) 30(77) 40(107) 35(65) 65(125) 30(50) 207(452) 100%
Geo 19(37) 23(55) 37(101) - - -79(193) 99.99%
LinkedMDB 15(31) - - - - -15(31) 100%
Publication 3(24) 10(37) 15(86) 14(60) 21(120) 32(102) 95(429) 100%
Total 44(120) 63(169) 92(294) 49(125) 86(245) 62(152) 396(1105)
Triple Pattern-wise sources skipped
DARQ
Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall
Diseasome 14(35) 30(77) 40(107) 35(65) 65(125) 30(50) 214(459) 100%
Geo 22(40) 23(55) 37(101) - - -82(196) 99.99%
LinkedMDB 22(38) - - - - -22(38) 100%
Publication 9(30) 10(37) 15(86) 14(60) 21(120) 32(102) 101(435) 100%
Total 67(143) 63(169) 92(294) 49(294) 86(245) 62(152) 419(1128)
FedX and SPLENDID
Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall
Diseasome 7(28) 30(77) 40(107) 35(65) 65(125) 30(50) 207(452) 100%
Geo 19(37) 23(55) 37(101) - - -79(193) 99.99%
LinkedMDB 15(31) - - - - -15(31) 100%
Publication 3(24) 10(37) 15(86) 14(60) 21(120) 32(102) 95(429) 100%
Total 44(120) 63(169) 92(294) 49(125) 86(245) 62(152) 396(1105)
FedX Extension with DAW
0
1
2
3
4
5
6
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP
Diseasome Publication Geo Data Movie
Executiontime(sec)
FedX
DAW
Over all performance Evaluation
Diseasome Publication Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %
FedX 2.44
18.79
1.48
-12.38
4.60
14.71
1.74
7.59
2.44
9.76
DAW 1.98 1.67 3.92 1.61 2.20
SPLENDID Extension with DAW
0
1
2
3
4
5
6
7
8
9
10
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP
Diseasome Publication Geo Movie
Executiontime(sec)
SPLENDID
DAW
Over all performance Evaluation
Diseasome Publication Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %
SPLENDID 3.78 19.48 2.18 -8.94 7.27 14.40 1.9 11.16 3.71 11.11
DAW 3.04 2.37 6.22 1.688 3.30
DARQ Extension with DAW
0
5
10
15
20
25
30
35
40
STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP
Diseasome Publication Geo Movie
Executiontime(sec)
DARQ
DAW
Over all performance Evaluation
Diseaso
me
Publicati
on Geo Data Movie Overall
Average Gain % Average Gain % Average Gain % Average Gain % Average Gain %
DARQ 8.27
23.34
5.26
6.14
23.44
16.31
1.96
13.88
9.59
16.46
DAW 6.34 4.94 19.62 1.688 8.01
Source Ranking vs Recall
0
20
40
60
80
100
120
Recallin%
Ranked Sources
Optimal
DAW
0
20
40
60
80
100
120
Recallin%
Ranked Sources
Optimal
DAW
Diseasome Publication
Conclusion and Future Work
• A sub-query can retrieve results that are already retrieved by another query
– Resources are wasted
– Query runtime is increased
– Extra traffic is generated
• Sketches all alone cannot be used due to expressive nature of SPARQL queries
• We used MIPs applied to RDF predicates along with compact data summaries
• Performance improvement
– FedX : 9.76 %
– SPLENDID: 11.11 %
– DAW: 16.76 %
• The effect of MIPs sizes and threshold values to find the optimal trade-off
between execution time and recall will be explored
saleem@informatik.uni-leipzig.de
AKSW, University of Leipzig, Germany

Contenu connexe

Tendances

Tendances (18)

05 Analysis of Algorithms: Heap and Quick Sort - Corrected
05 Analysis of Algorithms: Heap and Quick Sort - Corrected05 Analysis of Algorithms: Heap and Quick Sort - Corrected
05 Analysis of Algorithms: Heap and Quick Sort - Corrected
 
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
딥러닝 중급 - AlexNet과 VggNet (Basic of DCNN : AlexNet and VggNet)
 
Efficient Programs
Efficient ProgramsEfficient Programs
Efficient Programs
 
rgDefense
rgDefensergDefense
rgDefense
 
002 ray modeling dynamic systems
002 ray modeling dynamic systems002 ray modeling dynamic systems
002 ray modeling dynamic systems
 
Cs262 2006 lecture6
Cs262 2006 lecture6Cs262 2006 lecture6
Cs262 2006 lecture6
 
Performance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part IIPerformance Analysis with Scalasca, part II
Performance Analysis with Scalasca, part II
 
Tree building 2
Tree building 2Tree building 2
Tree building 2
 
PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)PyTorch 튜토리얼 (Touch to PyTorch)
PyTorch 튜토리얼 (Touch to PyTorch)
 
Heaps
HeapsHeaps
Heaps
 
Scaling Deep Learning with MXNet
Scaling Deep Learning with MXNetScaling Deep Learning with MXNet
Scaling Deep Learning with MXNet
 
Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...
Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...
Big Data Competition: maximizing your potential
 exampled with the 2014 Higgs...
 
CLUSTERGRAM
CLUSTERGRAMCLUSTERGRAM
CLUSTERGRAM
 
PageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structuresPageRank Centrality of dynamic graph structures
PageRank Centrality of dynamic graph structures
 
Ac cuda c_6
Ac cuda c_6Ac cuda c_6
Ac cuda c_6
 
K10692 control theory
K10692 control theoryK10692 control theory
K10692 control theory
 
PyCon Ukraine 2017: Operational Transformation
PyCon Ukraine 2017: Operational Transformation PyCon Ukraine 2017: Operational Transformation
PyCon Ukraine 2017: Operational Transformation
 
EKON22 Introduction to Machinelearning
EKON22 Introduction to MachinelearningEKON22 Introduction to Machinelearning
EKON22 Introduction to Machinelearning
 

Similaire à DAW: Duplicate-AWare Federated Query Processing over the Web of Data

A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
Paul Lam
 
Descriptive analytics in r programming language
Descriptive analytics in r programming languageDescriptive analytics in r programming language
Descriptive analytics in r programming language
Ashwini Mathur
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
Jinho Lee
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
tirlukachaitanya
 

Similaire à DAW: Duplicate-AWare Federated Query Processing over the Web of Data (20)

Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
Klessydra t - designing vector coprocessors for multi-threaded edge-computing...
 
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
 
Nvidia in bioinformatics
Nvidia in bioinformaticsNvidia in bioinformatics
Nvidia in bioinformatics
 
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
クラウドDWHとしても進化を続けるPivotal Greenplumご紹介
 
Algorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions EnumerationAlgorithm Selection for Preferred Extensions Enumeration
Algorithm Selection for Preferred Extensions Enumeration
 
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidiaRAPIDS: ускоряем Pandas и scikit-learn на GPU  Павел Клеменков, NVidia
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidia
 
A gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojureA gentle introduction to functional programming through music and clojure
A gentle introduction to functional programming through music and clojure
 
Descriptive analytics in r programming language
Descriptive analytics in r programming languageDescriptive analytics in r programming language
Descriptive analytics in r programming language
 
Time Series Analysis and Mining with R
Time Series Analysis and Mining with RTime Series Analysis and Mining with R
Time Series Analysis and Mining with R
 
GDG DevFest Kyoto 2014 これからのGoの話をしよう
GDG DevFest Kyoto 2014 これからのGoの話をしようGDG DevFest Kyoto 2014 これからのGoの話をしよう
GDG DevFest Kyoto 2014 これからのGoの話をしよう
 
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent AcceleratorExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
ExtraV - Boosting Graph Processing Near Storage with a Coherent Accelerator
 
R data mining-Time Series Analysis with R
R data mining-Time Series Analysis with RR data mining-Time Series Analysis with R
R data mining-Time Series Analysis with R
 
2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer2015-10-23_wim_davis_r_slides.pptx on consumer
2015-10-23_wim_davis_r_slides.pptx on consumer
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Achitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and ExascaleAchitecture Aware Algorithms and Software for Peta and Exascale
Achitecture Aware Algorithms and Software for Peta and Exascale
 
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
運用CNTK 實作深度學習物件辨識 Deep Learning based Object Detection with Microsoft Cogniti...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?String Comparison Surprises: Did Postgres lose my data?
String Comparison Surprises: Did Postgres lose my data?
 
Examining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail FilesExamining Oracle GoldenGate Trail Files
Examining Oracle GoldenGate Trail Files
 
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
Analytics at Speed: Introduction to ClickHouse and Common Use Cases. By Mikha...
 

Plus de Muhammad Saleem

QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
Muhammad Saleem
 
SQCFramework: SPARQL Query containment Benchmark Generation Framework
SQCFramework: SPARQL Query containment  Benchmark Generation Framework SQCFramework: SPARQL Query containment  Benchmark Generation Framework
SQCFramework: SPARQL Query containment Benchmark Generation Framework
Muhammad Saleem
 
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Muhammad Saleem
 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federation
Muhammad Saleem
 

Plus de Muhammad Saleem (19)

QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
QaldGen: Towards Microbenchmarking of Question Answering Systems Over Knowled...
 
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
How Representative Is a SPARQL Benchmark? An Analysis of RDF Triplestore Benc...
 
LargeRDFBench
LargeRDFBenchLargeRDFBench
LargeRDFBench
 
Extended LargeRDFBench
Extended LargeRDFBenchExtended LargeRDFBench
Extended LargeRDFBench
 
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
CostFed: Cost-Based Query Optimization for SPARQL Endpoint FederationCostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
 
SQCFramework: SPARQL Query containment Benchmark Generation Framework
SQCFramework: SPARQL Query containment  Benchmark Generation Framework SQCFramework: SPARQL Query containment  Benchmark Generation Framework
SQCFramework: SPARQL Query containment Benchmark Generation Framework
 
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
Question Answering Over Linked Data: What is Difficult to Answer? What Affect...
 
Federated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFedFederated Query Formulation and Processing Through BioFed
Federated Query Formulation and Processing Through BioFed
 
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation SystemsFine-grained Evaluation of SPARQL Endpoint Federation Systems
Fine-grained Evaluation of SPARQL Endpoint Federation Systems
 
SPARQL Querying Benchmarks ISWC2016
SPARQL Querying Benchmarks ISWC2016SPARQL Querying Benchmarks ISWC2016
SPARQL Querying Benchmarks ISWC2016
 
Efficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federationEfficient source selection for sparql endpoint federation
Efficient source selection for sparql endpoint federation
 
LSQ: The Linked SPARQL Queries Dataset
LSQ: The Linked SPARQL Queries DatasetLSQ: The Linked SPARQL Queries Dataset
LSQ: The Linked SPARQL Queries Dataset
 
FEASIBLE-Benchmark-Framework-ISWC2015
FEASIBLE-Benchmark-Framework-ISWC2015FEASIBLE-Benchmark-Framework-ISWC2015
FEASIBLE-Benchmark-Framework-ISWC2015
 
Federated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 TutorialFederated SPARQL Query Processing ISWC2015 Tutorial
Federated SPARQL Query Processing ISWC2015 Tutorial
 
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data CubesSAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
SAFE: Policy Aware SPARQL Query Federation Over RDF Data Cubes
 
Federated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of DataFederated SPARQL query processing over the Web of Data
Federated SPARQL query processing over the Web of Data
 
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint FederationHiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
HiBISCuS: Hypergraph-Based Source Selection for SPARQL Endpoint Federation
 
Fostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked DataFostering Serendipity through Big Linked Data
Fostering Serendipity through Big Linked Data
 
Linked Cancer Genome Atlas Database
Linked Cancer Genome Atlas DatabaseLinked Cancer Genome Atlas Database
Linked Cancer Genome Atlas Database
 

Dernier

Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
AnaAcapella
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Dernier (20)

HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
On National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan FellowsOn National Teacher Day, meet the 2024-25 Kenan Fellows
On National Teacher Day, meet the 2024-25 Kenan Fellows
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Spellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please PractiseSpellings Wk 3 English CAPS CARES Please Practise
Spellings Wk 3 English CAPS CARES Please Practise
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 

DAW: Duplicate-AWare Federated Query Processing over the Web of Data

  • 1. DAW: Duplicate-AWare Federated Query Processing over the Web of Data Muhammad Saleem1 , Axel-Cyrille Ngonga Ngomo1, Josiane Xavier Parreira2 , Helena F. Deus3 , Manfred Hauswirth2 1Agile Knowledge Engineering and Semantic Web (AKSW), University of Leipzig, Germany lastname@informatik.uni-leipzig.de 2Digital Enterprise Research Institute(DERI), National University of Ireland.,Galway firstname.lastnameg@deri.org International Semantic Web Conference (ISWC), October 21-25 , 2013, Sydney, Australia
  • 2. Motivation S1 S2 S3 S4 RDF RDF RDF RDF Parser Source Selection Federator Optimzer Integrator Get Individual Triple Patterns Identify capable source against Individual Triple Patterns Generate optimized sub- query Exe. Plan Integrate sub- queries results Execute sub- queries
  • 3. Motivation SELECT ?v1 ?v2 WHERE { ?uri <p1> ?v1. // Triple Pattern 1 (TP1) ?uri <p2> ?v2. // Triple Pattern 2 (TP2) } S1 RDF Source Selection Algorithm S2 RDF S3 RDF S4 RDF Triple pattern-wise source selection S1 S2 S3TP1 = S4TP2 = S2S1 Total triple pattern-wise selected sources = 6
  • 4. Motivation Retrieved results for TP1 (?uri <p1> ?v1) Retrieved results for TP2 (?uri <p2> ?v2) Triple pattern-wise source selection and skipping S1 S2 S3TP1 = Total triple pattern-wise selected sources = 4 S1 S2TP2 = S4 Min. number of new triples (threshold) = 20 Total triple pattern-wise skipped sources = 2
  • 5. Problem Statement • Data duplication in LOD datasets – E.g. DrugBank and Neurocommons are duplicated at DERI health Care and Life Sciences Knowledge Base • Duplicate results retrieval increase the query execution time and network traffic • How to estimate the overlap between data sources before sub-queries federation?
  • 6. Sketches • Data structures that provide dataset summaries – Min-wise Independent Permutations (MIPs) – Bloom filters • Estimate overlap among different ID sets • MIPs provide good tradeoff between estimation error and space requirements • MIPs of different lengths can be compared • Sketches all alone cannot be used in SPARQL federation – SPARQL queries are highly selective when subject, predicate, or object becomes bound in a triple pattern
  • 7. Min-wise Independent Permutations 48 24 36 18 820 21 3 12 24 877 9 21 15 24 4640 21 18 45 30 339 h1 = (7x + 3) mod 51 h2 = (5x + 6) mod 51 hN = (3x + 9) mod 51 8 9 9 Apply Permutations to all ID’s ID set Create MIP Vector from Minima of Permutation s 8 9 30 24 36 9 8 24 20 48 36 13 MIPs estimated operations h(concat(s,o)) T4(s,p,o) T5(s,p,o) T6(s,p,o) T1(s,p,o) T2(s,p,o) T3(s,p,o) Triples VA VB 8 9 20 24 36 9 Union (VA , VB) Resemblance (VA , VB ) = 2/6 => 0.33 Overlap (VA , VB ) = 0.33*(6+6) / (1+0.33) => 3 hi = ai∗x + bimod U 𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 (𝑆𝐴, 𝑆 𝐵) = 𝑆 𝐴⋂𝑆𝐵 𝑆 𝐴⋃𝑆𝐵 ≈ |VA⋂VB| 𝑁 Overlap (𝑆𝐴, 𝑆 𝐵)≈ 𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴 ,𝑉 𝐵 ×( 𝑆 𝐴 + 𝑆 𝐵 ) (𝑅𝑒𝑠𝑒𝑚𝑏𝑙𝑎𝑛𝑐𝑒 𝑉 𝐴 ,𝑉𝐵 +1) 𝐸𝑟𝑟𝑜𝑟 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 = 𝑂(1 𝑁) 𝑆′𝑖 = 𝑆𝑖 𝑖𝑓 𝑛𝑒𝑖𝑡ℎ𝑒𝑟 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑛𝑜𝑟 𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑 𝑆𝑖 × 𝑎𝑣𝑔𝑆𝑏𝑗𝑆𝑒𝑙 𝑆 𝑝 𝑖𝑓 𝑠𝑢𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑 𝑆𝑖 × 𝑎𝑣𝑔𝑂𝑏𝑗𝑆𝑒𝑙 𝑆 𝑝 𝑖𝑓𝑜𝑏𝑗𝑒𝑐𝑡 𝑖𝑠 𝑏𝑜𝑢𝑛𝑑
  • 8. DAW • A combination of MIPs with compact data summaries • Use average selectivities values for bound subject and objects • Can be combined with any existing SPARQL endpoint federation system • Can be used for partial result retrieval
  • 9. DAW Index [] a sd:Service ; sd:endpointUrl <http://localhost:8890/sparql> ; sd:capability [ sd:predicate diseasome:name ; sd:totalTriples 147 ; sd:avgSbjSel ``0.0068'' ; sd:avgObjSel ``0.0069'' ; sd:MIPs ``-6908232 -7090543 -6892373 -7064247 ...''; ] ; sd:capability [ sd:predicate diseasome:chromosomalLocation ; sd:totalTtriples 160 ; sd:avgSbjSel ``0.0062'' ; sd:avgObjSel ``0.0072'' ; sd:MIPs ``-7056448 -7056410 -6845713 -6966021 ...''; ] ;
  • 10. Triple Pattern-wise source ranking and skipping
  • 11. Evaluation Setup Dataset Total Size (MB) Index Size (bytes) No of Slice Discrepancy No of Dup. Slices Index Gen. Time (sec) Diseasome 18.62 0.17 10 1500 1 4 Geo 274.14 1.63 10 50000 2 133 LinkedMDB 448.93 1.66 10 100000 1 201 Publication 39.07 0.2 10 2500 1 6 Queries Distribution Dataset STP S-1 S-2 P-1 P-2 P-3 Total Diseasome 5 5 5 4 5 2 26 Geo 5 5 5 - - - 15 LinkedMDB 5 - - - - - 5 Publication 5 5 5 7 7 4 33 Total 20 15 15 11 12 6 79 EndPoint CPU(GHz) RAM Hard Disk 12.2. i3 4GB 300GB 22.9. i7 16GB 256GB SSD 32.6. i5 4GB 150GB 42.53. i5 4GB 300GB 52.3. i5 4GB 500GB 62.53. i5 4GB 300GB 72.9. i7 8GB 450GB 82.6. i5 8GB 400GB 92.6. i5 8GB 400GB 102.9. i7 16GB 500GB • Slice generator tool [1] for random slicing and duplicates • We have extended FedX, SPLENDID, DARQ with DAW [1] http://goo.gl/trjGSJ
  • 12. Triple Pattern-wise sources skipped DARQ Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall Diseasome 14(35) 30(77) 40(107) 35(65) 65(125) 30(50) 214(459) 100% Geo 22(40) 23(55) 37(101) - - -82(196) 99.99% LinkedMDB 22(38) - - - - -22(38) 100% Publication 9(30) 10(37) 15(86) 14(60) 21(120) 32(102) 101(435) 100% Total 67(143) 63(169) 92(294) 49(294) 86(245) 62(152) 419(1128) FedX and SPLENDID Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall Diseasome 7(28) 30(77) 40(107) 35(65) 65(125) 30(50) 207(452) 100% Geo 19(37) 23(55) 37(101) - - -79(193) 99.99% LinkedMDB 15(31) - - - - -15(31) 100% Publication 3(24) 10(37) 15(86) 14(60) 21(120) 32(102) 95(429) 100% Total 44(120) 63(169) 92(294) 49(125) 86(245) 62(152) 396(1105)
  • 13. Triple Pattern-wise sources skipped DARQ Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall Diseasome 14(35) 30(77) 40(107) 35(65) 65(125) 30(50) 214(459) 100% Geo 22(40) 23(55) 37(101) - - -82(196) 99.99% LinkedMDB 22(38) - - - - -22(38) 100% Publication 9(30) 10(37) 15(86) 14(60) 21(120) 32(102) 101(435) 100% Total 67(143) 63(169) 92(294) 49(294) 86(245) 62(152) 419(1128) FedX and SPLENDID Dataset STP S-1 S-2 P-1 P-2 P-3 Total Recall Diseasome 7(28) 30(77) 40(107) 35(65) 65(125) 30(50) 207(452) 100% Geo 19(37) 23(55) 37(101) - - -79(193) 99.99% LinkedMDB 15(31) - - - - -15(31) 100% Publication 3(24) 10(37) 15(86) 14(60) 21(120) 32(102) 95(429) 100% Total 44(120) 63(169) 92(294) 49(125) 86(245) 62(152) 396(1105)
  • 14. FedX Extension with DAW 0 1 2 3 4 5 6 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP Diseasome Publication Geo Data Movie Executiontime(sec) FedX DAW Over all performance Evaluation Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain % FedX 2.44 18.79 1.48 -12.38 4.60 14.71 1.74 7.59 2.44 9.76 DAW 1.98 1.67 3.92 1.61 2.20
  • 15. SPLENDID Extension with DAW 0 1 2 3 4 5 6 7 8 9 10 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP Diseasome Publication Geo Movie Executiontime(sec) SPLENDID DAW Over all performance Evaluation Diseasome Publication Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain % SPLENDID 3.78 19.48 2.18 -8.94 7.27 14.40 1.9 11.16 3.71 11.11 DAW 3.04 2.37 6.22 1.688 3.30
  • 16. DARQ Extension with DAW 0 5 10 15 20 25 30 35 40 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 P-1 P-2 P-3 STP S-1 S-2 STP Diseasome Publication Geo Movie Executiontime(sec) DARQ DAW Over all performance Evaluation Diseaso me Publicati on Geo Data Movie Overall Average Gain % Average Gain % Average Gain % Average Gain % Average Gain % DARQ 8.27 23.34 5.26 6.14 23.44 16.31 1.96 13.88 9.59 16.46 DAW 6.34 4.94 19.62 1.688 8.01
  • 17. Source Ranking vs Recall 0 20 40 60 80 100 120 Recallin% Ranked Sources Optimal DAW 0 20 40 60 80 100 120 Recallin% Ranked Sources Optimal DAW Diseasome Publication
  • 18. Conclusion and Future Work • A sub-query can retrieve results that are already retrieved by another query – Resources are wasted – Query runtime is increased – Extra traffic is generated • Sketches all alone cannot be used due to expressive nature of SPARQL queries • We used MIPs applied to RDF predicates along with compact data summaries • Performance improvement – FedX : 9.76 % – SPLENDID: 11.11 % – DAW: 16.76 % • The effect of MIPs sizes and threshold values to find the optimal trade-off between execution time and recall will be explored saleem@informatik.uni-leipzig.de AKSW, University of Leipzig, Germany