CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation
1. September 13th, 2018
COSTFED: COST-BASED QUERY
OPTIMIZATION FOR SPARQL
ENDPOINT FEDERATION
Muhammad Saleem, Alexander Potocki, Tommaso Soru,
Olaf Hartig, Axel-Cyrille Ngonga Ngomo
Semantics 2018, Vienna, Austria
1
2. WHAT IS COSTFED?
Federated SPARQL query processing engine
Federation over multiple SPARQL endpoints
Index-assisted
Join-aware source selection
Cost-based query planner
2
3. COSTFED QUERY PROCESSING
3
Endpoint
1
Endpoint
2
Endpoint
3
Endpoint
4
RDF RDF RDF RDF
Parsing
Source
Selection
Federat
or
Optimz
er
Integrator
Rewrite query
and get
Individual
Triple Patterns
Identify
capable source
against
Individual
Triple Patterns
Generate
optimized
sub-query
Exe. Plan
Integrate sub-
queries results
Execute sub-
queries
Index
CostFe
d
5. HOW COSTFED IS DIFFERENT?
Skew distribution of resources
Construction of buckets
Join-aware source selection using prefixes
Effect of multi-valued predicates
Cost-based query planning
5
6. SKEW DISTRIBUTION OF
RESOURCES AND CONSTRUCTION
OF BUCKETS
6
Store each resource along
with its cardinality from
bucket bo (brown)
Store each resource along
with the avg. cardinality of all
the resources in bucket b1
(black)
Only store the avg.
cardinality of all the resources
in bucket b2 (blue)
7. USING PREFIXES AND TRIE DATA
STRUCTURE
<wiwiss.fu-berlin.de/drugbank/resource/references/1002129>
<wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00201>
7
We used character-by-
character insertion in Trie
8. COSTFED INDEX
8
Predicate as
capability
Bucket bo:
Subjects resources
along with its
cardinality from
Bucket b1:
Subjects resources
along with their avg.
cardinality
Bucket b2:
Only store the
avg. selectivity
of all the
subjects and
objects
Subjects and
objects prefixes
Bucket bo:
Objects resources
along with its
cardinality from
Bucket b1:
Objects
resources along
with their avg.
cardinality
13. TRIPLE PATTERN CARDINALITY
ESTIMATION
13
• T(p,D) is the total number of
triples with predicate p in
dataset D
• avgSS(p,D), avgOS(p,D) are the
average subject resp. object
selectivities of p in D in the
corresponding bucket
• tS(D), tO(D) total number of
distinct subjects resp. distinct
objects in D
• tT(D) total number of triples in D
• R(tp) set of all relevant sources
for tp
• b stands for bound
14. JOIN CARDINALITY ESTIMATION
14
M (B) is the average frequency of multivalued predicate or BGP.
C(B) cardinality of B. j(s), j(o) means join based on subject resp.
object of the triple pattern tp.
⋈
⋈
𝜋
B1=tp1 B2=tp2
B4=tp3
B3
15. JOIN-COST ESTIMATION: HASH
JOIN
15
Cost of receiving
the highest
cardinality BGP
results, i.e. a
triple pattern
Cost of
intersecting the
results of both of
the BGPs
TC = 2 number of threads used, CSQ = 100 Cost of
sending a SPARQL query CRT = 0.01 Cost of receiving a
single result tuple. CHT = 0.0025 Cost of intersecting a
single result with another result
16. JOIN-COST ESTIMATION: BIND JOIN
16
CSQ = 100 Cost of sending a SPARQL query CRT =
0.01Cost of receiving a single result tuple. BSZ = 20
binding block size, CTC = 20 number of threads used
Cost of receiving
the smallest
cardinality BGP
results, i.e. a
triple pattern
Cost of binding
and sending
binded results as
SPARQL queries
18. EVALUATION SETUP
Benchmarks
FedBench (9 datasets)
LargeRDFBENCH (13 datasets)
Metrics
Index compression ratio (1-index
size/total size)
Index generation time
Total number of triple-pattern wise
sources selected
Number of ASK request used during the
source selection
Source selection time
Query execution time
18
Federation engines
FedX
ANAPSID
SPLENDID
SemaGrow
HiBISCuS
24. EVALUATION RESULTS: AVG. SOURCE
SELECTION TIME
FedBench
CostFed 1.7 ms
FedX 3 ms (warm), 302
(cold)
HiBISCuS 137 ms
SPLENDID 46 ms
ANAPSID 463 ms
SemaGrow 46 ms
24
LargeRDFBench
CostFed 1 ms
FedX 500 ms (warm), 4
(cold)
HiBISCuS 154.7 ms
SPLENDID 7.8 ms
ANAPSID 33.3 ms
SemaGrow 7.8 ms
25. FEDBENCH QUERY RUNTIMES
25
1
10
100
1000
10000
100000
1000000
CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7 Avg.
Runtimeinmsec(logscale)
FedX SPLENDID ANAPSID SemaGrow CostFed
• Ranked 1st in 11/14
queries
• 3 times faster than
SemaGrow
• 17 times faster than FedX
• 28 times faster than
ANAPSID
• 121 times faster than
SPLENDID
26. LARGERDFBENCH QUERY RUNTIMES
26
• Ranked 1st in 8/10 queries
• 3 times faster than
SemaGrow
• 2 times faster than FedX
• 1.20 times faster than
ANAPSID
• 1.73 times faster than
SPLENDID
• Missing bar indicates a
1
10
100
1000
10000
100000
1000000
10000000
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Avg.
Runtimeinmsec(logscale)
FedX SPLENDID ANAPSID SemaGrow CostFed
27. RESULTS ON LARGERDFBENCH
(SPARQL1.1)
27
• Ranked 1st in 12/14
queries
• 1.7 times faster than
SemaGrow
• 2.71 times faster than FedX
• 7.34 times faster than
ANAPSID
• SPLENDID does not support
SPARQL SERVICE clause
28. RESULTS ON LARGERDFBENCH
(SPARQL1.1)
28
• Ranked 1st in 6/9 queries
• 1.19 times faster than
SemaGrow
• 1.13 times faster than FedX
• 1.19 times faster than
ANAPSID
• SPLENDID does not support
SPARQL SERVICE clause