CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation

September 13th, 2018
COSTFED: COST-BASED QUERY
OPTIMIZATION FOR SPARQL
ENDPOINT FEDERATION
Muhammad Saleem, Alexander Potocki, Tommaso Soru,
Olaf Hartig, Axel-Cyrille Ngonga Ngomo
Semantics 2018, Vienna, Austria
1

WHAT IS COSTFED?
 Federated SPARQL query processing engine
 Federation over multiple SPARQL endpoints
 Index-assisted
 Join-aware source selection
 Cost-based query planner
2

COSTFED QUERY PROCESSING
3
Endpoint
1
Endpoint
2
Endpoint
3
Endpoint
4
RDF RDF RDF RDF
Parsing
Source
Selection
Federat
or
Optimz
er
Integrator
Rewrite query
and get
Individual
Triple Patterns
Identify
capable source
against
Individual
Triple Patterns
Generate
optimized
sub-query
Exe. Plan
Integrate sub-
queries results
Execute sub-
queries
Index
CostFe
d

STATE-OF-THE-ART
 FedX
 SPLENDID
 ANAPSID
 ODYSSEY
 LHD
 LUSAIL
 MULDER
 SemaGrow
4
How CostFed is different?

HOW COSTFED IS DIFFERENT?
 Skew distribution of resources
 Construction of buckets
 Join-aware source selection using prefixes
 Effect of multi-valued predicates
 Cost-based query planning
5

SKEW DISTRIBUTION OF
RESOURCES AND CONSTRUCTION
OF BUCKETS
6
 Store each resource along
with its cardinality from
bucket bo (brown)
 Store each resource along
with the avg. cardinality of all
the resources in bucket b1
(black)
 Only store the avg.
cardinality of all the resources
in bucket b2 (blue)

USING PREFIXES AND TRIE DATA
STRUCTURE
<wiwiss.fu-berlin.de/drugbank/resource/references/1002129>
<wiwiss.fu-berlin.de/drugbank/resource/drugs/DB00201>
7
We used character-by-
character insertion in Trie

COSTFED INDEX
8
Predicate as
capability
Bucket bo:
Subjects resources
along with its
cardinality from
Bucket b1:
Subjects resources
along with their avg.
cardinality
Bucket b2:
Only store the
avg. selectivity
of all the
subjects and
objects
Subjects and
objects prefixes
Bucket bo:
Objects resources
along with its
cardinality from
Bucket b1:
Objects
resources along
with their avg.
cardinality

JOIN-AWARE SOURCE SELECTION
USING PREFIXES
9
Model SPARQL queries as
directed hyper graphs

10
JOIN-AWARE
SOURCE SELECTION
ALGORITHM

EFFECT OF MULTIVALUED
PREDICATES
11
 Cardinality (tp1) = 4
 Cardinality (tp2) = 2
Cardinality (tp1⋈tp2) = ?
 SemaGrow: Min(Cardinality (tp1) , Cardinality (tp2))
= Min (4,2) => 2
CostFed: M(tp1)×M(tp2) × Min(Cardinality (tp1) , Cardinality (tp2))
: (4/2) )×(2/1)× Min(4,2)
: 2×2×2 => 8
Actual cardinality: 6

TRIPLE PATTERN CARDINALITY
ESTIMATION
13
• T(p,D) is the total number of
triples with predicate p in
dataset D
• avgSS(p,D), avgOS(p,D) are the
average subject resp. object
selectivities of p in D in the
corresponding bucket
• tS(D), tO(D) total number of
distinct subjects resp. distinct
objects in D
• tT(D) total number of triples in D
• R(tp) set of all relevant sources
for tp
• b stands for bound

JOIN CARDINALITY ESTIMATION
14
M (B) is the average frequency of multivalued predicate or BGP.
C(B) cardinality of B. j(s), j(o) means join based on subject resp.
object of the triple pattern tp.
⋈
⋈
𝜋
B1=tp1 B2=tp2
B4=tp3
B3

JOIN-COST ESTIMATION: HASH
JOIN
15
Cost of receiving
the highest
cardinality BGP
results, i.e. a
triple pattern
Cost of
intersecting the
results of both of
the BGPs
TC = 2 number of threads used, CSQ = 100 Cost of
sending a SPARQL query CRT = 0.01 Cost of receiving a
single result tuple. CHT = 0.0025 Cost of intersecting a
single result with another result

JOIN-COST ESTIMATION: BIND JOIN
16
CSQ = 100 Cost of sending a SPARQL query CRT =
0.01Cost of receiving a single result tuple. BSZ = 20
binding block size, CTC = 20 number of threads used
Cost of receiving
the smallest
cardinality BGP
results, i.e. a
triple pattern
Cost of binding
and sending
binded results as
SPARQL queries

EVALUATION SETUP
 Benchmarks
 FedBench (9 datasets)
 LargeRDFBENCH (13 datasets)
 Metrics
 Index compression ratio (1-index
size/total size)
 Index generation time
 Total number of triple-pattern wise
sources selected
 Number of ASK request used during the
source selection
 Source selection time
 Query execution time
18
 Federation engines
 FedX
 ANAPSID
 SPLENDID
 SemaGrow
 HiBISCuS

INDEX COMPRESSION RATIO
 CostFed 99.99 %
 HiBISCuS 99.99 %
 SPLENDID 99.99 %
 ANAPSID 99.99 %
 SemaGrow 99.99 %
19

INDEX GENERATION TIME
 CostFed’s 60 min
 HiBISCuS 41 min
 SPLENDID 110 min
 ANAPSID 5 min
 SemaGrow 110 min
20

RESULTS ON LARGERDFBENCH
(SPARQL1.0)
21

EVALUATION RESULTS: NUMBER OF
SOURCES SELECTED
FedBench
 CostFed 70
 FedX 134
 HiBISCuS 76
 SPLENDID 134
 ANAPSID 80
SemaGrow 134
22
LargeRDFBench
 CostFed 104
 FedX 199
 HiBISCuS 106
 SPLENDID 199
 ANAPSID 111
 SemaGrow 199

EVALUATION RESULTS: NUMBER OF
ASK REQUESTS
FedBench
 CostFed 45
 FedX 549
 HiBISCuS 45
 SPLENDID 77
 ANAPSID 89
SemaGrow 77
23
LargeRDFBench
 CostFed 0
 FedX 1196
 HiBISCuS 0
 SPLENDID 11
 ANAPSID 103
 SemaGrow 11

EVALUATION RESULTS: AVG. SOURCE
SELECTION TIME
FedBench
 CostFed 1.7 ms
 FedX 3 ms (warm), 302
(cold)
 HiBISCuS 137 ms
 SPLENDID 46 ms
 ANAPSID 463 ms
SemaGrow 46 ms
24
LargeRDFBench
 CostFed 1 ms
 FedX 500 ms (warm), 4
(cold)
 HiBISCuS 154.7 ms
 SPLENDID 7.8 ms
 ANAPSID 33.3 ms
 SemaGrow 7.8 ms

FEDBENCH QUERY RUNTIMES
25
1
10
100
1000
10000
100000
1000000
CD1 CD2 CD3 CD4 CD5 CD6 CD7 LS1 LS2 LS3 LS4 LS5 LS6 LS7 Avg.
Runtimeinmsec(logscale)
FedX SPLENDID ANAPSID SemaGrow CostFed
• Ranked 1st in 11/14
queries
• 3 times faster than
SemaGrow
• 17 times faster than FedX
ANAPSID
SPLENDID

LARGERDFBENCH QUERY RUNTIMES
26
• Ranked 1st in 8/10 queries
SemaGrow
• 2 times faster than FedX
• 1.20 times faster than
ANAPSID
SPLENDID
• Missing bar indicates a
1
10
100
1000
10000
100000
1000000
10000000
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 Avg.
Runtimeinmsec(logscale)
FedX SPLENDID ANAPSID SemaGrow CostFed

(SPARQL1.1)
27
• Ranked 1st in 12/14
queries
SemaGrow
• 2.71 times faster than FedX
ANAPSID
• SPLENDID does not support
SPARQL SERVICE clause

(SPARQL1.1)
28
• Ranked 1st in 6/9 queries
SemaGrow
• 1.13 times faster than FedX
ANAPSID
• SPLENDID does not support
SPARQL SERVICE clause

Thanks!
Live endpoint
http://costfed.aksw.org/
Code https://github.com/dice-
group/CostFed
29

CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

Similaire à CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation

Similaire à CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation (20)

Plus de Muhammad Saleem

Plus de Muhammad Saleem (18)

Dernier

Dernier (20)

CostFed: Cost-Based Query Optimization for SPARQL Endpoint Federation