3. Paper and authors
Gubichev, Andrey, and Thomas Neumann. "Exploiting
the query structure for efficient join ordering in
SPARQL queries." EDBT. 2014.
Extending Database Technology – Qualis A2/H-index 52
5. Problem
•The join ordering problem is a fundamental challenge that has to be
solved by any query optimizer
•Depending on the order of the join, there is a different computation
time
•SQL solutions are not immediately capable of handling large SPARQL
queries. It is introduced a new join ordering algorithm that
performs a SPARQL-tailored query simplification
6. Problem
•Cardinality estimation is an essential part of any cost-based query
optimizer
•Two different approaches:
• RDF-3X: query compilation time (dominated by finding the optimal
join order) is one order of magnitude higher than the actual
execution time
• Virtuoso 7: greedy algorithm for compilation leads to a slow run
time (sub-optimal order)
7. Solution
•Best of both worlds:
• Heuristics that spends a reasonable amount of time optimizing the
query, and yet gets a decent join order
• The paper presents a SPARQL-tailored query simplification
procedure, that decomposes the query’s join graph into star-shaped
subqueries and chain-shaped subqueries
8. Challenges
•RDF can be very verbose
• TPC-H Query 2 written in SPARQL contains joins between 26 index
scans (as opposed to joins between 5 tables in the SQL
formulation)
• Number of plans:
• 5! = 120 plans in SQL vs 26! = 4 *1026
•Lack of schema
• Foreign keys become structural correlations
9. Solution
• Characteristic set for s defines the properties (attributes) of an entity, thus defining its class (type) in a sense
that the subjects that have the same characteristic set tend to be similar
• Hierarchical Characterization:
• 1. H0
is the set of all characteristic sets of R
• 2. Hi
= {argmin ∀ C ⊂ S ∧|C|=|S|−1 cost(C) | ∀ S ∈ Hi
−1}, that is Hi
consists of the subsets C of sets
from Hi-1
that minimize cost(C).
• 3. ∀ S ∈ Hk
: |S| = 2
• 4. every S ∈ Hi-1
stores a pointer to its cheapest subset C ∈ Hi.
10. Algorithm 1 (part. 1)
• Line 2: S=[{created, bornIn, livedIn, hasName},
{ bornIn, livedIn, hasName},...]
• Line 8: Init Banker's iteration, ie. from the
smallest to the biggest possible set with the
predicates
11. Algorithm 1 (part. 2)
• Line 12: guarantees that S2
is smaller than S1
• Line 15-16: finds the subsets that have smaller
cost
• Cost
• Banker’s iteration potentially enumerates all
the subsets of all predicates in the dataset, in
reality it stops relatively early, since it is
always bounded by the largest set in Sets
12. Algorithm 2 (part. 1/2)
• Objective: finding the optimal join order in (sub)
queries of the form:
select * where {?s p1
?o1
. . . . ?s pk
?ok
}
• Idea: extract the part of the Hierarchical
Characterisation of the dataset starting with the
set S
• Input: Star-shaped graph
• Output: Order of the joins
• Lines 1-9:
• While size S > 2, find the most expensive
subset and push to front of O
13. Algorithm 2 (part. 2/2)
• The first part leads to the optimal for star-
shaped queries in linear time to the graph size
• However, it do not find the optional solution if
the query have constants:
select * where {?s p1
“Berlin”. . . . ?s pk
?ok
}
• Then:
• Lines 12-14: only one of the bounded
objects is in the triple with the key
predicate, ie., the entire star query is
therefore a lookup of properties of a
specific entity
• Lines 15-16: otherwise (many objects are
key), keep pushing down the constants in
the join tree and stop when the cost of the
corresponding index scan is bigger than the
cost of the join on that level of the tree
14. Algorithm 3 (part. 1/4)
• Objective: ordering join in general SPARQL queries
(s1
, hasName, "Marie Curie"),
(s1
, bornIn, s2
),
(s2
, label, "Warsaw"),
(s2
, locatedIn, "Poland")
• Problem: s2
links person to city, corresponding to the "foreign key", but RDF does not require any
schema. Knowledge of such dependencies is extremely useful for the query optimizer: without it, the
optimizer has to assume independence between two entities linked via bornIn predicate, thus almost
inevitably underestimating the selectivity of the join of corresponding triple pattern
• Thus, it uses Characteristic Pair (Paar Charakteristisch) in order to discover this kind of relation, where:
PC (Sc
(s), Sc
(o)) = {(Sc
(s), Sc
(o), p) | Sc(o) != ∅ ∧ ∃p : (s, p, o) ∈ R}
• The CP is a in-memory structure and in theory, with n distinct characteristic sets we can get up to n2
characteristic pairs, in real datasets only few pairs appear frequently enough to be stored. For example,
in YAGO-Facts dataset of the 250000 existing pairs, only 5292 pairs appear more than 100 times in the
dataset. This way, the frequent characteristic pairs for the consume less than 16 KB.
15. Algorithm 3 (part. 2/4)
• Idea: to decompose the query into star-shaped subqueries
connected by chains, and to collapse the subqueries into
meta-nodes
• Input: SPARQL graph
• Output: join ordering for this graph
• Lines 11-24: starts with clustering the query into disjoint
star-shaped subqueries around subjects
• Line 13: order the triple patterns in the query by subject
• Line 15: group triple patterns with identical subjects, since
they potentially form star-shaped subqueries
• Lines 20-23: find starts around objects
16. Algorithm 3 (part. 3/4)
• Lines 4-5: for every star it adds the new meta-node to the
query graph and removes the intra-star edges
• Lines 6-7: the plan for the star subquery is computed using
the Hierarchical Characterisation (Algorithm 2) and added to
the DP table along with the meta-node
• Line 8: After all the star subqueries have been optimized, we
add the edges between meta-nodes to the query graph, if
the original graph has edges between the corresponding star
sub-queries
17. Algorithm 3 (part. 4/4)
• Line 10: selectivities associated with these edges are
computed using the Characteristic Pairs synopsis, and the
regular Dynamic Programming algorithm starts working on
this simplified graph
• In the following Figure simplifying the graph from 8 nodes to
3 nodes gives a reduction from 8!=40320 plans to 3!=6 plans
• This algorithm is also linear to the input graph
19. Conclusions
•The problem is very similar to the Matrix product
•The query simplification techniques reduces the search space size by
making some simplification before the DP algorithm starts
•The time analysis shows how important are the complexity study
•There is no complexity analysis though it mentions DP and Greedy
algorithms along the paper
•The tests did not turned the cache off
•Do not cover OPTIONAL clauses of SPARQL, which are equivalent to
the left outer joins and can not be freely reordered with other joins