This document presents an empirical evaluation of seven RDF graph partitioning techniques: horizontal partitioning, subject-based partitioning, predicate-based partitioning, hierarchical partitioning, total communication volume minimization partitioning, recursive-bisection partitioning, and minimal edgecut partitioning. The techniques were evaluated on their partitioning time, query execution time on three systems (FedX, SemaGrow, Koral), total distinct sources selected in a distributed environment, and overall ranking. The results found that total communication volume minimization partitioning had the smallest query runtimes overall and minimized the total number of sources selected, leading to better performance.
Botany krishna series 2nd semester Only Mcq type questions
An Empirical Evaluation of RDF Graph Partitioning Techniques
1. An Empirical Evaluation of RDF Graph
Partitioning Techniques
Adnan Akhter, Axel-Cyrille Ngonga Ngomo and Muhammad Saleem
EKAW, Nancy, France
November 14th, 2018
1
2. Motivation: Handling Big Datasets
* Image Reference https://lod-cloud.net/clouds/lod-cloud.svg
Linked Data has grown significantly
UniProt (Over 10 billion triples)
Linked TCGA (Over 20 billion triples)
Issues with bigger datasets
Performance
Availability
Security
Scalability
Maintenance
One of the solutions is partitioning
2
3. Motivation: Partitioning Techniques Used in RDF Clustered Triple Stores
System Partitioning technique System Partitioning technique
AdPart Subject hash + workload adaptive PigSparql Hash + Triple-based files
AdPart-NA Subject hash S2RDF Extended vertical partitioning
CliqueSquare Hybrid (Hash + VP) Sedge Subject hash
DREAM No partitioning; full replication Sempala VP
EAGRE METIS SHAPE Semantic hash partitioning
gStoreD Partitioning agnostic SHARD Hash
H-RDF-3X METIS TriAD Hash-based sharding
H2RDF+ H-Base partitioner (range) TriAD-SG METIS + Horizontal sharding
HadoopRDF VP + predicate files on HDFS WARP METIS on query workload
* Table Reference https://bit.ly/2JUqH5H
3
Which partitioning technique leads to better performance?
8. Other Evaluation Setups (1 / 2)
Datasets
Semantic Web Dog Food (SWDF)
DBpedia
Benchmark queries (generated by FEASIBLE benchmark generator)
Basic Graph Pattern (BGP-only)
Fully Featured (FF)
Number of benchmark queries
300 queries for each, i.e., BGP and fully featured
Total 1200 queries
8
9. Other Evaluation Setups (2 / 2)
Number of partitions
Total 10 partitions for each dataset, i.e., SWDF and DBpedia
Time out
Three minutes for each query
Performance metrix
Partitions generation time
Overall benchmark query execution time
Average query execution time
Number of timeout queries for each benchmark
The ranking score of the partitioning techniques
Total number of sources selected for the complete benchmark execution in a purely federated environment
Partitioning imbalance among the generated partitions
9
19. Conclusion
We presented an evaluation of seven RDF partitioning techniques
Our overall results of query runtime suggest that TCV-Min leads to smallest query runtimes
followed by Predicate-based, Horizontal, Recursive-Bisection, Subject-based, Hierarchical-based,
and Min-Edgecut, respectively
Number of sources selected has a direct relation with query runtimes
Thus, partitioning techniques which minimize the total number of sources selected generally lead
to better runtime performances
19
20. This work was supported by grants from the EU H2020 Framework Program
provided for the project HOBBIT (GA no. 688227).
20