Daniel Janke and Steffen Staab. Tutorial at Reasoning Web
With proliferation of semantic data, there is a need to cope with trillions of triples by horizontally scaling data management in the cloud. To this end one needs to advance (i) strategies for data placement over compute and storage nodes, (ii) strategies for distributed query processing, and (iii) strategies for handling failure of compute and storage nodes. In this tutorial, we want to review challenges and how they have been addressed by research and development in the last 15 years.
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
Storing and Querying Semantic Data in the Cloud
1. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Storing and Querying Semantic Data
in the Cloud
Reasoning Web Summer School 2018 (RW 2018)
Daniel Janke & Steffen Staab
24.09.2018
2. Storing and Querying Semantic Data in the Cloud 2Daniel Janke & Steffen Staab
Amount of Available RDF Data Increases
Source: https://lod-cloud.net/
3. Storing and Querying Semantic Data in the Cloud 3Daniel Janke & Steffen Staab
Why using RDF Stores in the Cloud?
Example 1: Wikidata
Ÿ Dataset size: 4.9 billion triples (as of April 2018)
Ÿ Stored in distributed BlazeGraph RDF store because
– Higher query throughput
– Higher availability
Example 2: BBC
Ÿ On average 1 million SPARQL queries per day (in 2010)
Ÿ Stored in distributed GraphDB RDF store because
– Higher query throughput
– Higher availability
4. Storing and Querying Semantic Data in the Cloud 4Daniel Janke & Steffen Staab
Assumptions of this talk
1. There are exceptions for (almost) everything
2. You are always allowed to ask questions
3. You have some knowledge
Required
l RDF
l SPARQL
Helpful
l Cloud processing frameworks like Hadoop or Spark
l Query processing in relational databases
If not -> See 2.
Timeplan
5. Storing and Querying Semantic Data in the Cloud 5Daniel Janke & Steffen Staab
How to deal with increasing volume of RDF?
6. Storing and Querying Semantic Data in the Cloud 6Daniel Janke & Steffen Staab
Centralized RDF Stores
Ÿ Graph database for storing RDF graphs
(includes tasks like data storage, query processing, ...)
Ÿ All RDF store tasks are executed on a single computer
7. Storing and Querying Semantic Data in the Cloud 7Daniel Janke & Steffen Staab
Terminology: RDF Graph
Ÿ Directed graph with labelled vertices and edges
Ÿ Labels of start vertex, edge and end vertex are an RDF triple
Ÿ RDF graph is a set of RDF triples
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knowsg:bello
r:type e:ownedBy
g:Dog
Triple
Subject
Property
Object
8. Storing and Querying Semantic Data in the Cloud 8Daniel Janke & Steffen Staab
Terminology: SPARQL Query
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
How are the employees of WeST called?
Variable
Triple Pattern
9. Storing and Querying Semantic Data in the Cloud 9Daniel Janke & Steffen Staab
Terminology: Query Execution Tree
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
11. Storing and Querying Semantic Data in the Cloud 11Daniel Janke & Steffen Staab
Centralized RDF Stores
Ÿ Graph database for storing RDF graphs
(includes tasks like data storage, query processing, ...)
Ÿ All RDF store tasks are executed on a single computer
Advantages
Ÿ Less complex than RDF stores running on several computers
Disadvantages
Ÿ Hardware of computer limits the size of processable RDF graph
Ÿ No fault tolerance
12. Storing and Querying Semantic Data in the Cloud 12Daniel Janke & Steffen Staab
RDF Stores in the Cloud
Ÿ RDF store tasks are bundled into nodes
– Data storage tasks are bundled to storage nodes
– Query processing tasks are bundled to compute nodes
Ÿ Compute and storage nodes1
are distributed/replicated among several
computers
1 In the following, compute and storage nodes
are referred to as simply compute nodes.
13. Storing and Querying Semantic Data in the Cloud 13Daniel Janke & Steffen Staab
How to place the data?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
14. Storing and Querying Semantic Data in the Cloud 14Daniel Janke & Steffen Staab
Where to find the required data?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
15. Storing and Querying Semantic Data in the Cloud 15Daniel Janke & Steffen Staab
How to distribute the query processing?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
?v1
w:martin
w:daniel
?v1 ?name
w:martin “Martin”
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
w:daniel “Daniel”
?name
“Martin”
“Daniel”
?v1 ?name
g:wanja “Wanja”
16. Storing and Querying Semantic Data in the Cloud 16Daniel Janke & Steffen Staab
RDF Stores in the Cloud
Ÿ RDF store tasks are bundled into nodes
– Data storage tasks are bundled to storage nodes
– Query processing tasks are bundled to compute nodes
Ÿ Compute and storage nodes1
are distributed/replicated among several
computers
Advantages
Ÿ Scalable by adding new compute or storage nodes
– Scaling up the dataset size
– Scaling up the query throughput
Ÿ Possibly fault tolerant
Disadvantages
Ÿ Higher complexity
1 In the following, compute and storage nodes
are referred to as simply compute nodes.
17. Storing and Querying Semantic Data in the Cloud 17Daniel Janke & Steffen Staab
Challenges of RDF Stores in the Cloud
1) How to design the architecture?
2) How to distribute the data?
3) How to identify compute nodes that store required data?
4) How to distribute query processing?
5) How to achieve fault tolerance?
6) How to evaluate?
Many ideas from 50 years of data engineering carry over
-> We focus on approaches more commonly used for RDF
18. Storing and Querying Semantic Data in the Cloud 18Daniel Janke & Steffen Staab
#Related Work about RDF Stores
1) How to design the architecture?
2)How to distribute the data?
3)How to identify compute nodes that store required data?
4)How to distribute query processing?
5) How to achieve fault tolerance?
6) How to evaluate?
Rarely considered
on its own
19. Storing and Querying Semantic Data in the Cloud 19Daniel Janke & Steffen Staab
Architecture Types
How to design the architecture?
20. Storing and Querying Semantic Data in the Cloud 20Daniel Janke & Steffen Staab
Properties of Architecture Types
Implementation complexity:
Ÿ How difficult is the implementation?
Freedom of data placement:
Ÿ To which extent can the data placement be influenced?
Query overhead:
Ÿ Which query overhead is caused by the architecture?
Scalability:
Ÿ To which extent do the storage and query processing capabilities
increase if further compute nodes are added?
Fault tolerance:
Ÿ Do single point of failures exist?
Ÿ How easily can they be removed?
21. Storing and Querying Semantic Data in the Cloud 21Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
22. Storing and Querying Semantic Data in the Cloud 22Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
23. Storing and Querying Semantic Data in the Cloud 23Daniel Janke & Steffen Staab
RDF Stores Using
Cloud Computing Frameworks
Converts and
loads
RDF graph into
cloud computing
framework
Translates SPARQL
queries into task(s)
for cloud computing
framework
Architecture
Cloud computing
Distributed
Federated
Examples: SHARD, S2RDF, S2X, TripleRush, Jena-Hbase, Sempala, D-SPARQ
24. Storing and Querying Semantic Data in the Cloud 24Daniel Janke & Steffen Staab
Cloud Computing Framework Types
RDF stores using
cloud computing
frameworks
Batch processing
frameworks
Graph processing
frameworks
NoSQL databases Column stores
Document stores
Architecture
Cloud computing
Distributed
Federated
Key-value stores
Distinction based on implementation
Architecture
25. Storing and Querying Semantic Data in the Cloud 25Daniel Janke & Steffen Staab
Batch Processing Frameworks
Ÿ Example frameworks: Hadoop, Spark
Ÿ Queries need to be translated into one or several tasks
Ÿ Data exchange between compute nodes via file system
Cloud computing
Batch
Graph
NoSQL
Distributed file system
1. Read input data
2. Process data
3. Write results back
26. Storing and Querying Semantic Data in the Cloud 26Daniel Janke & Steffen Staab
Graph Processing Frameworks
Ÿ Examples: GraphX, Signal/Collect
Ÿ Translation of queries in vertex algorithms
At each vertex:
1. Receive messages
2. Process messages
and update vertex
status
3. Send messages
Termination:
Status of all vertices do
not change any more
Cloud computing
Batch
Graph
NoSQL
27. Storing and Querying Semantic Data in the Cloud 27Daniel Janke & Steffen Staab
Key-Value Stores
Ÿ Example: DynamoDB
Ÿ Distributed map that assigns keys to arbitrary values
Ÿ Values are atomic
Ÿ Distribution based on, e.g., hash of the key, key ranges, …
Ÿ Query translated to several lookups in the map and joins on the
master
g:Gesis
g:wanja
...
e:employs g:wanja, ...
f:knows w:daniel, ...
...
w:WeST
w:martin
...
e:employs w:martin, ...
f:knows g:wanja, ...
...
Cloud computing
Batch
Graph
NoSQL
28. Storing and Querying Semantic Data in the Cloud 28Daniel Janke & Steffen Staab
Column Stores
Ÿ Examples: HBase, Cassandra, Accumulo, Impala
Ÿ Stores tabular data column-wise
Ÿ Maps column name and key to corresponding value
Ÿ Values are atomic
Ÿ Distributes key-value mappings based on keys for each column
separately
g:Gesis
w:WeST
g:wanja
w:martin, w:daniel
g:wanja
w:martin
w:daniel
w:daniel
g:wanja
w:martin
Column e:employs
Column f:knows
Cloud computing
Batch
Graph
NoSQL
29. Storing and Querying Semantic Data in the Cloud 29Daniel Janke & Steffen Staab
Document Stores
Ÿ Examples: Couchbase, MongoDB
Ÿ Store documents with internal structure (e.g., JSON)
(i.e., non-atomic documents = more freedom to model content)
Ÿ Provide indices over documents
Ÿ Distribution based on a key within documents
{_id: “g:Gesis”,
e:employs: “g:wanja”}
{_id: “w:WeST”,
e:employs: [“w:daniel”, “w:martin”]}
{_id: “g:wanja”,
f:knows: “w:daniel”,
f:givenname: “Wanja”}
{_id: “w:martin”,
f:knows: “g:wanja”,
f:givenname: “Martin”}
Cloud computing
Batch
Graph
NoSQL
30. Storing and Querying Semantic Data in the Cloud 30Daniel Janke & Steffen Staab
RDF Stores Using
Cloud Computing Frameworks
Pros:
Ÿ Low implementation complexity
Ÿ Fault tolerance provided by cloud computing framework
Ÿ Scalability provided by cloud computing framework
Ÿ Cloud computing framework is maintained and improved by a
community
Cons:
Ÿ Influence on data placement limited
Ÿ High overhead introduced by cloud computing framework
Ÿ Centralized join of data obtained by single lookups in NoSQL
databases might overload master
Architecture
Cloud computing
Distributed
Federated
31. Storing and Querying Semantic Data in the Cloud 31Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
32. Storing and Querying Semantic Data in the Cloud 32Daniel Janke & Steffen Staab
Federated RDF Stores Architecture
Cloud computing
Distributed
Federated
l Stores RDF data
l Administrated
independently
Coordinates query
execution:
l Decompose query
l Query RDF stores
l Join query results
Stores which data
is contained in
each RDF store
Caches data
retrieved from
previous queries
l Varied by index and cache
l Examples: DARQ, FedX, SPLENDID
33. Storing and Querying Semantic Data in the Cloud 33Daniel Janke & Steffen Staab
Pros:
Ÿ Low implementation complexity
Ÿ Scalability by adding new RDF stores
Cons:
Ÿ No influence on data placement
Ÿ Query federator is a single point of failure
Ÿ Centralized join of results from different RDF stores may become a
bottleneck
Ÿ Identification of RDF stores contributing to a query may be costly
Architecture
Cloud computing
Distributed
Federated
Federated RDF Stores
34. Storing and Querying Semantic Data in the Cloud 34Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
35. Storing and Querying Semantic Data in the Cloud 35Daniel Janke & Steffen Staab
Distributed RDF Stores Architecture
Cloud computing
Distributed
Federated
Distributed RDF stores
Master-slave architecture
Peer-to-peer architecture
Architecture
36. Storing and Querying Semantic Data in the Cloud 36Daniel Janke & Steffen Staab
Master-Slave Architecture Master-slave
Peer-to-peer
Architecture
Cloud computing
Distributed
Federated
Loading Graph:
1.Translate strings to fixed-length identifiers
2.Assigns triples to slaves
3.Stores which data is stored at which slave
4.Transfer triples to slaves
5.Store RDF triples locally
Querying:
1. Translate constant
strings to their integer
identifiers
2. Check occurrences of
constants
3. Decompose query and
send subqueries to
slaves
4. Execute subqueries
on local data
5. Join intermediate
results
6. Translate result ids
back to strings
L1, Q1, Q6
L2
L3, Q2
Q3, Q5
Q4, Q5
L5, Q4
Examples: GraphDB, BlazeGraph, TriAD, DiploCloud
37. Storing and Querying Semantic Data in the Cloud 37Daniel Janke & Steffen Staab
Peer-to-Peer Architecture Master-slave
Peer-to-peer
Architecture
Cloud computing
Distributed
Federated
Responsibilities of master are copied to all slaves resulting in peer
nodes with identical architecture but varying data
Examples: RDFPeers, Edutella, Grid Vine, 3RDF
38. Storing and Querying Semantic Data in the Cloud 38Daniel Janke & Steffen Staab
Pros:
Ÿ Full freedom on data placement
Ÿ Little query processing overhead
Ÿ Direct transfer of intermediate results
Ÿ Fault tolerance (in case of peer-to-peer)
Cons:
Ÿ High implementation complexity
Ÿ Master is a single point of failure
Ÿ Handling of dictionary, index and query coordination may lead to a
bottleneck at master
Architecture
Cloud computing
Distributed
Federated
Distributed RDF Stores
39. Storing and Querying Semantic Data in the Cloud 39Daniel Janke & Steffen Staab
Architecture Summary
RDF stores using
cloud computing
frameworks
Federated
RDF stores
Distributed
RDF stores
Freedom of
data placement
Low/Medium – cloud
computing framework
decides about data
placement
Low – RDF stores
are administrated
independent of
federator
High – data
placement strategy
needs to be
implemented
Fault Tolerance High – master is
stateless and can be
replicated
Low – federator is
single point of
failure
High (peer-to-peer)
Low – master is
single point of failure
Scalability High/Medium –
possible
bottlenecks:
l Disk I/O
l Master-based joins
Medium – federator
can become
bottleneck
High (peer-to-peer)
Medium – if master
becomes bottleneck
40. Storing and Querying Semantic Data in the Cloud 40Daniel Janke & Steffen Staab
Architecture Summary
RDF stores using
cloud computing
frameworks
Federated
RDF stores
Distributed
RDF stores
Query
overhead
High – initialisation of
cloud computing
framework
Medium –
identification of
required RDF
stores
Low – designed to
execute queries
efficiently
Implementation
complexity
Low – only
translation of RDF
dataset and SPARQL
queries
Medium –
dedicated querying,
indexing and
caching strategies
required
High – all
components needs
to be implemented
41. Storing and Querying Semantic Data in the Cloud 41Daniel Janke & Steffen Staab
Data Placement Strategies
How to distribute the data?
42. Storing and Querying Semantic Data in the Cloud 42Daniel Janke & Steffen Staab
Terminology: RDF Graph
Ÿ Directed graph with labelled vertices and edges
Ÿ Labels of start vertex, edge and end vertex are an RDF triple
Ÿ RDF graph is a set of RDF triples
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knowsg:bello
r:type e:ownedBy
g:Dog
Triple
Subject
Property
Object
43. Storing and Querying Semantic Data in the Cloud 43Daniel Janke & Steffen Staab
Terminology: Graph Cover and Graph Chunk
Graph cover (aka sharding)
Assignment of each triple to at least one compute node
Graph chunk (aka shard)
Set of triples assigned to a single compute node
Compute Node 1 Compute Node 2
w:martin
“Martin“
g:wanja
“Wanja“ w:daniel “Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employsf:knows
f:knows
f:knows
g:bello
r:type
e:employs
e:ownedBy
g:Dog
44. Storing and Querying Semantic Data in the Cloud 44Daniel Janke & Steffen Staab
Terminology: Path and Path Length
Path
A sequence of triples in which the object of a triple is the subject of the
succeeding triple
Path length
The number of triples in the path
w:martin g:wanja “Wanja“w:daniel
f:givennamef:knowsf:knows
Length = 3
45. Storing and Querying Semantic Data in the Cloud 45Daniel Janke & Steffen Staab
Terminology: Molecule, Anchor Vertex and Diameter
Molecule
Ÿ Set of triples that are contained in some paths starting at a vertex
called anchor vertex
Ÿ If molecule contains a subject s than all triples with s as subject are
contained
(Directed) molecule diameter
Longest shorted path between anchor vertex and all objects contained
in the molecule
w:martin
“Martin“
g:wanja
“Wanja“
f:givenname
f:givenname
f:knows
w:daniel
f:knows
Anchor vertex
Diameter = 2
46. Storing and Querying Semantic Data in the Cloud 46Daniel Janke & Steffen Staab
Properties of Graph Cover Strategies
Complexity:
Ÿ How complex is the creation of the graph cover?
Balancing:
Ÿ How balanced are the sizes of the resulting graph chunks?
Storage size:
Ÿ Is the sum of all graph chunks sizes larger than the original graph
size?
Path containment:
Ÿ How likely is it that a path can be traversed without leaving one
chunk?
Query parallelisation:
Ÿ How good can the workload of one query be parallelized among
several compute nodes?
Dynamics:
47. Storing and Querying Semantic Data in the Cloud 47Daniel Janke & Steffen Staab
Overview Graph Cover Strategies
Graph Cover
Strategies
Static
Dynamic
Cloud-computing-based
Hash-based
Graph-clustering-based
Workload-aware
N-hop replication
48. Storing and Querying Semantic Data in the Cloud 48Daniel Janke & Steffen Staab
Overview Graph Cover Strategies
Graph Cover
Strategies
Static
Dynamic
Cloud-computing-based
Hash-based
Graph-clustering-based
Workload-aware
N-hop replication
49. Storing and Querying Semantic Data in the Cloud 49Daniel Janke & Steffen Staab
Cloud-Computing-Based
Graph Cover Strategies
Ÿ Data placement is mainly decided by cloud computing framework
Ÿ Influenced only by
– Splitting graph into files or tables
– Encoding of data within files or tables
Ÿ Goal: Reduce the processing effort of queries
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
50. Storing and Querying Semantic Data in the Cloud 50Daniel Janke & Steffen Staab
Molecule Graph Splits
Ÿ Split graph into molecules of directed diameter 1
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
51. Storing and Querying Semantic Data in the Cloud 51Daniel Janke & Steffen Staab
Molecule Graph Splits
Ÿ Store molecules in key-value store (e.g., SHARD, Sempala)
Ÿ Store molecules in one or several files (e.g., D-SPARQ, RAPID+)
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
g:Gesis
g:wanja
e:employs gesis:wanja
f:knows w:daniel, f:givenname “Wanja”
w:WeST
w:martin
...
e:employs w:martin, e:employs w:daniel
f:knows g:wanja, f:givenname “Martin”
...
g:Gesis : (e:employs gesis:wanja)
g:wanja : (f:knows w:daniel), (f:givenname “Wanja”)
w:WeST : (e:employs w:martin), (e:employs w:daniel)
w:martin : (f:knows g:wanja), (f:givenname “Martin”)
...
52. Storing and Querying Semantic Data in the Cloud 52Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute
Ÿ Selection of required molecules easy, if subjects are given in the
context
Ÿ Subject-subject joins can be easily processed
Cons:
Ÿ If subject is not given in the context all molecules have to be
processed
Ÿ Extending molecules by incoming edges or longer diameters
increases dataset size
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Molecule Graph Splits
53. Storing and Querying Semantic Data in the Cloud 53Daniel Janke & Steffen Staab
Vertical Graph Splits
Ÿ Create a file/table for each property
Ÿ Store all triples with that property in the file/table
Ÿ Examples: Jena-HBase, SPARQLGX
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
54. Storing and Querying Semantic Data in the Cloud 54Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute
Cons:
Ÿ Queries that match with a path of length l will match with at most l
files/tables, if the property is given in the context
Ÿ Files/tables of frequent properties like rdf:type can become
large
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Vertical Graph Splits
55. Storing and Querying Semantic Data in the Cloud 55Daniel Janke & Steffen Staab
Hash-Based
Graph Cover Strategies
Ÿ Assignment of triples based on a hash function
Ÿ Possible properties of hash functions
– Determinism
The same input will always produce the same output
– Uniformity
Inputs are evenly mapped over output range
– Non-invertible
Based on a hash value the input datum cannot be reconstructed
– Continuity
The order of the hash values reflect the order of the input values
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
56. Storing and Querying Semantic Data in the Cloud 56Daniel Janke & Steffen Staab
Hash Cover
Hash function applied on the subjects:
Result:
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
57. Storing and Querying Semantic Data in the Cloud 57Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute
Ÿ Chunks are of almost equal size
Cons:
Ÿ Paths are more likely to contain triples that were assigned to
different compute nodes
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Hash Cover
58. Storing and Querying Semantic Data in the Cloud 58Daniel Janke & Steffen Staab
Graph-Clustering-Based
Graph Cover Strategies
Graph clustering
Ÿ Split graph into pairwise disjoint graph chunks, i.e., partitions (aka
shards)
Ÿ Usually vertices are assigned to partitions
Ÿ Partitions satisfy some clustering properties
Vertex-cut transformation:
Ÿ In RDF triples cannot be cut
Ÿ Assign triples to partition to which the subject was assigned to
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
59. Storing and Querying Semantic Data in the Cloud 59Daniel Janke & Steffen Staab
Minimal Edge-Cut Cover
Ÿ Number of cut edges should be reduced
Ÿ Number of vertices in each partition should be ideally the same
Ÿ After vertex-cut transformation:
Number of edges per partition is unbalanced
Ÿ Examples: [Huang2011], [Peng2016]
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
60. Storing and Querying Semantic Data in the Cloud 60Daniel Janke & Steffen Staab
Pros:
Ÿ Likelihood that a path only contains triples of the same compute node is
high
Ÿ #vertices per chunk is balanced
Cons:
Ÿ High computational effort (heuristic approaches are in O(|V|*log(|V|))
Ÿ #triples per chunk is unbalanced
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Minimal Edge-Cut Cover
4 vertices
7 triples
4 vertices
3 triples
61. Storing and Querying Semantic Data in the Cloud 61Daniel Janke & Steffen Staab
Workload-Aware
Graph Cover Strategies
General idea:
Assign triples based on a historic query workload
General procedure:
1. Generalize from actual queries to handle unseen queries
2. Identify triples that are required to answer generalized queries
3. Assign triples to compute nodes
– All triples required to produce all query results are assigned to
the same compute node
– Distribute triple sets for the individual results equally among all
compute nodes
Examples: WARP, DiploCloud
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
62. Storing and Querying Semantic Data in the Cloud 62Daniel Janke & Steffen Staab
Pros:
Ÿ Good query performance for queries similar to the ones in the
historic query workload
Cons:
Ÿ High computational effort
Ÿ Historic query workload required
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Workload-Aware
Graph Cover Strategies
63. Storing and Querying Semantic Data in the Cloud 63Daniel Janke & Steffen Staab
n-hop Replication
Ÿ Based on an initial graph cover with chunks
Ÿ Replicate triples such that all paths of length n
– Starting at a subject contained in chunk
– Consist of triples assigned to
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Example: VB-Partitioner
64. Storing and Querying Semantic Data in the Cloud 64Daniel Janke & Steffen Staab
Pros:
Ÿ Paths of length <=n are guaranteed to belong to one chunk
Cons:
Ÿ Higher computational effort
Ÿ Dataset size increases
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
n-hop Replication
65. Storing and Querying Semantic Data in the Cloud 65Daniel Janke & Steffen Staab
Summary of
Static Graph Cover Strategies
Cloud Hash Clustering Workload N-hop
Complexity Low Low High High Medium
Chunk sizes Imbalanced Balanced Imbalanced - -
Dataset size 100% 100% 100% >= 100% > 100%
Path
containment
Low Low High High Medium
Query
parallelization
Medium High Low Low/High -
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
66. Storing and Querying Semantic Data in the Cloud 66Daniel Janke & Steffen Staab
Overview Graph Cover Strategies
Graph Cover
Strategies
Static
Dynamic
Cloud-computing-based
Hash-based
Graph-clustering-based
Workload-aware
N-hop replication
67. Storing and Querying Semantic Data in the Cloud 67Daniel Janke & Steffen Staab
Dynamic Graph Cover Strategies
Ÿ Adaptation of graph cover during runtime
Ÿ Types of dynamics
– Adaptation of graph cover to actual query workload
– If one chunk becomes overloaded due to insertions of new
triples, move triples to other chunks
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
68. Storing and Querying Semantic Data in the Cloud 68Daniel Janke & Steffen Staab
Adaptation to
Actual Query Workload
Ÿ Initial static graph cover
Ÿ Keep track how frequently
- triple patterns
- molecules
are queried together
Ÿ Replicate triples such that
– Data transfer is reduced
– Workload is equally distributed among compute nodes
Examples: PHD-Store, AdHash, Sedge
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
69. Storing and Querying Semantic Data in the Cloud 69Daniel Janke & Steffen Staab
Dynamic Redistribution of Triples
Ÿ If one compute node stores too many triples (in comparison to
others), redistribute triples based on their hash values
Ÿ If triples are stored in an ordered fashion, send one half to another
compute node
Examples: [Battré2007], [Osorio2017]
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
70. Storing and Querying Semantic Data in the Cloud 70Daniel Janke & Steffen Staab
Indices
How to identify compute nodes that store required data?
71. Storing and Querying Semantic Data in the Cloud 71Daniel Janke & Steffen Staab
Example
Where is the information stored to answer the query:
How are the employees of WeST called?
Hash cover on subjects
72. Storing and Querying Semantic Data in the Cloud 72Daniel Janke & Steffen Staab
Properties of Indices
Graph cover independence:
Ÿ How independent is the index from the graph cover strategy?
Storage consumption:
Ÿ How much storage space is required for the index
Access time:
Ÿ How fast can the location of an indexed element be retrieved?
Indexed elements:
Ÿ Which elements are indexed?
73. Storing and Querying Semantic Data in the Cloud 73Daniel Janke & Steffen Staab
Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Higher degree of aggregation
l Slower access
l Lower degree of aggregation
74. Storing and Querying Semantic Data in the Cloud 74Daniel Janke & Steffen Staab
Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Higher degree of aggregation
l Slower access
l Lower degree of aggregation
75. Storing and Querying Semantic Data in the Cloud 75Daniel Janke & Steffen Staab
Centralized Hash-Based Index
Ÿ Applicable only for hash covers
Ÿ No explicit index required
Ÿ Location of a triple can be recomputed by the hash function and the
number of chunks
Ÿ Examples: 4store, Trinity.RDF
How are the employees of WeST called?
hash(w:WeST) → compute node 2
e:employs ?
f:givenname ?
(w:WeST, e:employs) ?
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
76. Storing and Querying Semantic Data in the Cloud 76Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute occurrences
Ÿ No explicit index required
– No storage consumption
Cons:
Ÿ Only applicable for hash covers
Ÿ Only applicable for hashed elements (subject, property, object)
Centralized Hash-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
77. Storing and Querying Semantic Data in the Cloud 77Daniel Janke & Steffen Staab
Centralized Statistics-Based Index
Ÿ Collect occurrences of
– Subject, property, object labels
– Combinations of subject, property, object labels
– RDFs types
– Property sets of molecules
Ÿ Examples: DARQ, FedX, Sedge
Subject Property Object
c1 c2 c1 c2 c1 c2
w:WeST 0 2 0 0 0 0
e:employs 0 0 1 2 0 0
f:givenname 0 0 2 1 0 0
... ... ... ...
How are the employees of WeST called?
Chunk IDs
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
78. Storing and Querying Semantic Data in the Cloud 78Daniel Janke & Steffen Staab
Pros:
Ÿ Independent of graph cover strategy
Ÿ Can estimate number of results
Ÿ Fast access
Cons:
Ÿ Requires compression for storage
Ÿ Trade off:
– Collecting only a few statistics → small size → less useful
– Collecting many statistics → large size (possibly size of dataset)
→ more useful
Centralized Statistics-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
79. Storing and Querying Semantic Data in the Cloud 79Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index: TriAD
Summarization algorithm:
1) Each chunk represented by chunk vertex
2) Start and end vertices of edges are substituted by corresponding
chunk vertices
3) Duplicate edges are removed
How are the employees of WeST called?
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
80. Storing and Querying Semantic Data in the Cloud 80Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index: EAGRE
Summarization algorithm:
1) Determine property sets of all subjects
2) Group similar property sets
3) Store occurrences of each property set
4) Property sets become vertices
5) Replace start and end vertices of edges by their property set
vertices
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
81. Storing and Querying Semantic Data in the Cloud 81Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index: EAGRE
Summarization algorithm:
1) Determine property sets of all subjects
2) Group similar property sets
3) Store occurrences of each property set
4) Property sets become vertices
5) Replace start and end vertices of edges by their property set
vertices
How are the employees of WeST called?
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
82. Storing and Querying Semantic Data in the Cloud 82Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index
Pros:
Ÿ Independent of graph cover strategy
Ÿ Identification of subqueries that can be answered locally
Cons:
Ÿ All triples with same subject have to be assigned to the same
compute node
Ÿ High storage consumption
Ÿ Summary graph needs to be queried
Ÿ Only properties are considered
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
83. Storing and Querying Semantic Data in the Cloud 83Daniel Janke & Steffen Staab
Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Higher degree of aggregation
l Slower access
l Lower degree of aggregation
84. Storing and Querying Semantic Data in the Cloud 84Daniel Janke & Steffen Staab
Decentralized Hash-Based Index
Ÿ Version 1:
– Centralized hash-based index on each compute node
– Knowledge of all compute nodes required
– Examples: HDRS, Virtuoso Clustered Edition
Ÿ Version 2:
– Each compute node knows a forward table for a few neighbours
▪ Ring structure overlay (e.g., RDFPeers, PAGE)
▪ Tree structure overlay (e.g., Grid Vine, 3RDF)
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
85. Storing and Querying Semantic Data in the Cloud 85Daniel Janke & Steffen Staab
Ring Structure Overlay
Ÿ Compute nodes are ordered
Ÿ Each compute node knows
– Its direct neighbour
– A few distant neighbours
Ÿ When a request arrives
1)The compute node storing the
data is determined by the hash
function
2)Request is forwarded to the
(closest) compute node storing
the data
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
86. Storing and Querying Semantic Data in the Cloud 86Daniel Janke & Steffen Staab
Tree Structure Overlay
Ÿ C1
– stores all data whose hash
value starts with prefix 00
– Knows C2 is responsible for
prefix 01
– Knows C3 is responsible for
prefix 1
Ÿ When request arrives C1
– Computes hash value
– Forwards request based on the
known prefixes
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
87. Storing and Querying Semantic Data in the Cloud 87Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute occurrences
Ÿ Low storage consumption
Cons:
Ÿ Only applicable for hash covers
Ÿ Only applicable for hashed elements (subject, property, object)
Decentralized Hash-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
88. Storing and Querying Semantic Data in the Cloud 88Daniel Janke & Steffen Staab
Decentralized Schema-Based Index
Ÿ Applicable for type-based graph covers
Ÿ Use type hierarchy as tree structure overlay
Ÿ Example: SQPeer
rdfs:Ressource
rdf:Property
e:employs f:givennamef:Person
rdfs:Class
e:Institute
C
1
C
2
C
3
C
4
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
89. Storing and Querying Semantic Data in the Cloud 89Daniel Janke & Steffen Staab
Pros:
Ÿ Queries that contain types can be forwarded to corresponding
compute node(s)
Ÿ Low storage consumption
Cons:
Ÿ Efficiently applicable only for type-based graph covers
Ÿ Types of requested resources need to be identified
Ÿ Unbalanced index sizes
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Decentralized Schema-Based Index
Used in combination with other indices
90. Storing and Querying Semantic Data in the Cloud 90Daniel Janke & Steffen Staab
Summary Indices
Centralized Decentralized
Hash Statistics Summary
graph
Hash Schema
Applicable to
graph cover
strategies
Hash
covers
All All Hash
covers
Type-
based
covers
Storage
consumption
Low High High Low Low
Access time Fast Slow Slow Medium Medium
Indexed
elements
Hash
dependent
Various
aggregations
Properties Hash
dependent
Typed
elements
91. Storing and Querying Semantic Data in the Cloud 91Daniel Janke & Steffen Staab
Distributed Query Processing Strategies
How to distribute query processing?
92. Storing and Querying Semantic Data in the Cloud 92Daniel Janke & Steffen Staab
Terminology: SPARQL Query
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
How are the employees of WeST called?
Variable
Triple Pattern
93. Storing and Querying Semantic Data in the Cloud 93Daniel Janke & Steffen Staab
Terminology: Query Execution Tree
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
95. Storing and Querying Semantic Data in the Cloud 95Daniel Janke & Steffen Staab
Distributed Query Processing
General procedure
1) Split query into subquery that can be executed locally
2) Execute subqueries on compute nodes identified by index
3) Join results of subqueries
4) Return results
96. Storing and Querying Semantic Data in the Cloud 96Daniel Janke & Steffen Staab
Splitting Query into Subqueries
Ÿ Simplest case: each triple pattern forms a subquery
Ÿ Use knowledge about graph covers
– All triples with same subject are stored on the same compute
node
– Paths of length n can be executed locally
Ÿ Use index information
– Co-occurrences of subject-property or property-property
97. Storing and Querying Semantic Data in the Cloud 97Daniel Janke & Steffen Staab
Properties of Join Operations
Parallelisation:
Ÿ Is the join computation distributed among several or all compute
nodes?
Computational effort:
Ÿ How many comparisons are performed during the join
computation?
Ÿ How many subqueries result out of the join computation?
Data transfer:
Ÿ How many intermediate results are transferred to compute the join?
Blocking:
Ÿ Do subqueries need to be finished before the join can be
computed?
98. Storing and Querying Semantic Data in the Cloud 98Daniel Janke & Steffen Staab
Overview Join Processing
Joins
Centralized
Distributed
Hash join
Bind join
Replication-based join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on
a single compute node
Join is distributed over
several compute nodes
99. Storing and Querying Semantic Data in the Cloud 99Daniel Janke & Steffen Staab
Overview Join Processing
Joins
Centralized
Distributed
Hash join
Bind join
Replication-based join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on
a single compute node
Join is distributed over
several compute nodes
100. Storing and Querying Semantic Data in the Cloud 100Daniel Janke & Steffen Staab
Centralized Nested Loop Join
Compare each element of first list with every element of second list
Examples: SPLENDID, DARQ
Pros:
Ÿ Does not require an ordering
Ÿ Arbitrary join conditions possible
Cons:
Ÿ Inefficient
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
?v1
w:martin
w:daniel
?v1 ?name
w:martin “Martin”
g:wanja “Wanja”
w:daniel “Daniel”
101. Storing and Querying Semantic Data in the Cloud 101Daniel Janke & Steffen Staab
Centralized Merge Join
Ÿ Requires sorted intermediate result lists
Ÿ Compare one result r only with results that are <= r
Ÿ Example: Partout
Pros:
Ÿ Fast for ordered result sets
Cons:
Ÿ Slow for unordered result sets
Ÿ Intermediate result set size might lead to a bottleneck
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
w:daniel “Daniel”
w:martin “Martin”
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
102. Storing and Querying Semantic Data in the Cloud 102Daniel Janke & Steffen Staab
Centralized Hash Join
Ÿ Assign results to buckets based on their hashes
Ÿ Join a result only with corresponding bucket
Ÿ Examples: ANAPSID, LHD
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
...
?v1 ?name
w:daniel “Daniel”
...
?v1 ?name
w:martin “Martin”
...
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
A non-blocking symmetric version exists
103. Storing and Querying Semantic Data in the Cloud 103Daniel Janke & Steffen Staab
Pros:
Ÿ No ordering required
Ÿ On average almost constant time complexity
Cons:
Ÿ Intermediate result set size might lead to a bottleneck
Centralized Hash Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
104. Storing and Querying Semantic Data in the Cloud 104Daniel Janke & Steffen Staab
Bind Join
Ÿ Substitute variables of the second subquery based on results from first
subquery
Ÿ Second query is executed multiple times
Ÿ Examples: FedX, Avanalche, SemaGrow
?v1
w:martin
?v1 ?name
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
?v1
w:daniel
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
105. Storing and Querying Semantic Data in the Cloud 105Daniel Janke & Steffen Staab
Pros:
Ÿ Reduces the amount of intermediate results
Cons:
Ÿ Increases number of executed subqueries
Ÿ Possible bottlenecks:
– Large intermediate result set sizes
– Large number of subqueries
Bind Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
106. Storing and Querying Semantic Data in the Cloud 106Daniel Janke & Steffen Staab
Summary Centralized Joins
Nested Merge Hash Symmetric Bind
Computational
effort
High Medium -
extra effort
for ordering
Low Low Medium -
effort of
many
subqueries
# executed
queries
Low Low Low Low High
Blocking
operation
Yes Yes Yes No Yes
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
107. Storing and Querying Semantic Data in the Cloud 107Daniel Janke & Steffen Staab
Overview Join Processing
Joins
Centralized
Distributed
Hash join
Bind join
Replication-based join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on
a single compute node
Join is distributed over
several compute nodes
108. Storing and Querying Semantic Data in the Cloud 108Daniel Janke & Steffen Staab
Replication-Based Distributed Join
All results of first subquery are sent to all compute nodes on which the
second subquery is executed
Example: SemStore
Compute Node 2
Compute Node 1
Compute Node 2
?v1 ?name
w:martin “Martin”
?v1 ?name
w:daniel “Daniel”
?v1
w:daniel
w:martin
?v1
w:daniel
w:martin
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
109. Storing and Querying Semantic Data in the Cloud 109Daniel Janke & Steffen Staab
Pros:
Ÿ Not all compute nodes are necessary involved in joining
Ÿ Using data locality → Less transferred data
Cons:
Ÿ Intermediate result set size may become bottleneck if second
subquery is executed on a single compute node
Ÿ One subtree needs to be finished before join can be executed
Replication-Based Distributed Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
110. Storing and Querying Semantic Data in the Cloud 110Daniel Janke & Steffen Staab
Distributed Hash Join
Hash join in which each compute node serves as a bucket
Example: DiploCloud
Compute Node 2Compute Node 1
?v1
w:martin
w:daniel
?v1 ?name
w:martin “Martin”
g:wanja “Wanja”
?v1 ?name
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
g:wanja “Wanja”
?v1
w:martin
?v1
w:daniel
?v1 ?name
w:daniel “Daniel”
hash(w:martin)
hash(w:daniel)
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
111. Storing and Querying Semantic Data in the Cloud 111Daniel Janke & Steffen Staab
Pros:
Ÿ All compute nodes are involved in join processing
Ÿ Bottleneck is unlikely due to distribution of intermediate result set
over all compute nodes
Cons:
Ÿ No usage of data locality → high data transfer
Ÿ One subtree needs to be finished before join can be executed
Distributed Hash Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
112. Storing and Querying Semantic Data in the Cloud 112Daniel Janke & Steffen Staab
Distributed Merge Join
Ÿ Results of subqueries are ordered
Ÿ Each compute node is responsible for a range of results
Ÿ Examples: H2RDF+, SHARD, SparkRDF, SPARQLGX
Compute Node 2Compute Node 1
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
w:martin “Martin”
?v1 ?name
w:daniel “Daniel”
Range a:a-w:d Range w:e-z:z
?v1 ?name
g:wanja “Wanja”
w:daniel “Daniel”
?v1
w:daniel
?v1
w:martin
?v1 ?name
w:martin “Martin”
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
113. Storing and Querying Semantic Data in the Cloud 113Daniel Janke & Steffen Staab
Pros:
Ÿ All compute nodes are involved in join processing
Ÿ Bottleneck is unlikely due to distribution of intermediate result set
over all compute nodes
Cons:
Ÿ Results need to be ordered
Ÿ Agreement on result ranges required
Ÿ No usage of data locality → high data transfer
Ÿ One subtree needs to be finished before join can be executed
Distributed Merge Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
114. Storing and Querying Semantic Data in the Cloud 114Daniel Janke & Steffen Staab
Distributed Bind Join
Join algorithm:
1) Get results of first subquery
2) For each following bind join query,
1) Identify compute nodes with matches
2) Fork query execution to remote compute nodes
Examples: RDFPeers, GridVine, Atlas, TripleRush, Trinity.RDF
Compute Node 2
Compute Node 1
Compute Node 2
?v1 ?name
w:martin “Martin”
?v1 ?name
w:daniel “Daniel”
?v1
w:daniel
w:martin
?v1
w:martin
?v1
w:daniel
Fork
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
115. Storing and Querying Semantic Data in the Cloud 115Daniel Janke & Steffen Staab
Pros:
Ÿ Join computed without waiting for any subtree to be finished
Ÿ Usage of data locality → Less transferred data
Ÿ Results of last join operation do not need to be sent to other
compute nodes
Cons:
Ÿ Intermediate result set size may become bottleneck if second
subquery is executed on a single compute node
Distributed Bind Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
116. Storing and Querying Semantic Data in the Cloud 116Daniel Janke & Steffen Staab
Distributed Joins Summary
Centralized
Joins
Distributed
Replication
Distributed
Hash
Distributed
Merge
Distributed
Bind
Data Transfer High Low High High Low
Parallelisation Low Medium High High Medium
# Subqueries Low Low Low Low High
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
117. Storing and Querying Semantic Data in the Cloud 117Daniel Janke & Steffen Staab
Fault Tolerance
How to achieve fault tolerance?
118. Storing and Querying Semantic Data in the Cloud 118Daniel Janke & Steffen Staab
Mirroring
Ÿ There exist several identical copies of each compute node
Ÿ If one compute node fails, its copy continues working
Ÿ Example: Virtuoso Clustered Edition
Pros:
Ÿ Query workload can be distributed among all copies
Cons:
Ÿ Keeping copies up to date
Ÿ Replicas of different chunks are not combined to increase data
locality
Compute Node 1 Compute Node 2 Compute Node 1’ Compute Node 2’
119. Storing and Querying Semantic Data in the Cloud 119Daniel Janke & Steffen Staab
Data Replication
Ÿ All compute nodes are ordered in a ring
Ÿ Data from one compute node is replicated on neighbours
Ÿ If one compute node fails, data remains available on neighbours
Ÿ Example: 4store, RDFPeers
Pros:
Ÿ Data locality of initial graph cover is increased
Cons:
Ÿ Keeping copies up to data
Compute Node 1 Compute Node 2 Compute Node 3
1
1’
2
2’
3
3’
120. Storing and Querying Semantic Data in the Cloud 120Daniel Janke & Steffen Staab
Evaluation Methodology
How to evaluate?
121. Storing and Querying Semantic Data in the Cloud 121Daniel Janke & Steffen Staab
Properties of Evaluation Methodologies
Realism:
Do the measurement results reflect the performance of real RDF
stores?
Modularity:
Can alternative implementations of individual components be
evaluated?
Evaluation depth:
Is the system evaluated only as a whole or are the performance of the
individual components evaluated?
Difficulty:
How difficult is it to apply the evaluation methodology?
122. Storing and Querying Semantic Data in the Cloud 122Daniel Janke & Steffen Staab
Black Box Evaluation
Evaluation of RDF stores as a whole
Some problems (of many):
Ÿ How fast is your network?
Ÿ How large are your images?
Ÿ Which processor configuration do you use?
Ÿ What are the structures of your caches?
Do you evaluate the RDF store or your hardware configuration?
Dataset
QueriesQueriesQueries
123. Storing and Querying Semantic Data in the Cloud 123Daniel Janke & Steffen Staab
Black Box Evaluation
Evaluation of RDF stores as a whole
Pros:
Ÿ Easy to perform evaluation since no implementation knowledge is
required
Ÿ Measurements reflect the behaviour of a real RDF store
Cons:
Ÿ Only superficial evaluations possible
Ÿ No performance evaluation of individual components possible
Dataset
QueriesQueriesQueries
124. Storing and Querying Semantic Data in the Cloud 124Daniel Janke & Steffen Staab
Glass Box Evaluation
Ÿ Evaluation of RDF stores as a whole
Ÿ Collecting performance measurements of components by
– Using a profiling system like Granula
– Adapting source code to perform measurements
Dataset
QueriesQueriesQueries
125. Storing and Querying Semantic Data in the Cloud 125Daniel Janke & Steffen Staab
Glass Box Evaluation
Pros:
Ÿ In-depth performance evaluation possible
Ÿ Measurements reflect the behaviour of a real RDF store
Cons:
Ÿ Source code needs to be extended to collect measurements
Ÿ Individual components can hardly be exchanged by alternative
implementations
126. Storing and Querying Semantic Data in the Cloud 126Daniel Janke & Steffen Staab
Simulation-based Glass Box Evaluation
Evaluation of alternative implementations of a single component by
simulating the behaviour of a real RDF store
Pros:
Ÿ Performance evaluation of individual components possible
Ÿ Alternative implementations of individual components is possible
Cons:
Ÿ Evaluation environment (simulator) needs to be implemented
Ÿ Questionable whether performance measurements reflect behaviour of
real RDF store
Dataset
QueriesQueriesQueries
ComponentComponent
Component
127. Storing and Querying Semantic Data in the Cloud 127Daniel Janke & Steffen Staab
Glass Box Evaluation Platform
RDF store
Ÿ that allows the exchange of individual components by alternative
implementations
Ÿ Measures performance of individual components
Dataset
QueriesQueriesQueries Graph Cover
Creator
Graph Cover
Creator
Graph Cover
Creator
128. Storing and Querying Semantic Data in the Cloud 128Daniel Janke & Steffen Staab
Glass Box Evaluation
Pros:
Ÿ In-depth performance evaluation possible
Ÿ Alternative implementations of individual components can be
evaluated
Ÿ Measurements reflect the behaviour of a real RDF store
Cons:
Ÿ Development of glass box evaluation platform difficult
Ÿ Interdependencies might limit the exchangeability of components
129. Storing and Querying Semantic Data in the Cloud 129Daniel Janke & Steffen Staab
Evaluation Methodology Summary
Black box Glass box Simulation Glass box
platform
Realism High High Low Medium
Modularity Low Low High High
Evaluation depth Low High High High
Difficulty Easy Medium Medium Hard
130. Storing and Querying Semantic Data in the Cloud 130Daniel Janke & Steffen Staab
Conclusion & Open Challenges
131. Storing and Querying Semantic Data in the Cloud 131Daniel Janke & Steffen Staab
Conclusion
Challenges of RDF stores in the cloud:
1) How to design the architecture?
2) How to distribute the data?
3) How to identify compute nodes that store required data?
4) How to distribute query processing?
5) How to achieve fault tolerance?
6) How to evaluate?
132. Storing and Querying Semantic Data in the Cloud 132Daniel Janke & Steffen Staab
Example RDF Stores in the Cloud
Virtuoso Clustered
Edition
BlazeGraph GraphDB
Architecture Master-slave Master-slave Master-slave
Graph Cover
Strategy
Hash cover Distributed B+-tree Replication of
graph on all slaves
Index Centralized hash-
based index on each
compute node
Distributed B+-tree Not necessary
Query
Execution
Strategy
Distributed bind join Centralized join Centralized join
Fault Tolerance Mirroring None Mirroring
133. Storing and Querying Semantic Data in the Cloud 133Daniel Janke & Steffen Staab
Example RDF Stores in the Cloud
DiploCloud S2RDF Trinity.RDF
Architecture Master-slave Batch processing
framework
Master-slave
Graph Cover
Strategy
Workload-aware Vertical graph splits Hash cover
Index Centralized
Statistics-based index
None Distributed
chunk-integrated
summary graph
Query
Execution
Strategy
Centralized join
(for small result sets)
Distributed hash join
(otherwise)
Distributed joins Distributed bind join
Fault Tolerance None Based on batch
processing
framework
None
134. Storing and Querying Semantic Data in the Cloud 134Daniel Janke & Steffen Staab
Challenges Not Presented
Ÿ How to achieve transactional security?
Ÿ How to perform online analytical processing (OLAP) queries?
Ÿ How to process property paths?
Ÿ How to perform distributed reasoning?
Ÿ How to perform distributed stream processing?
135. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Thank you for your Attention!
Daniel Janke, Steffen Staab
136. Storing and Querying Semantic Data in the Cloud 136Daniel Janke & Steffen Staab
Image References
Ÿ https://openclipart.org/detail/155101/server
Ÿ https://openclipart.org/detail/213252/gear-icon
Ÿ https://openclipart.org/detail/204067/bpm-mail-symbol
Ÿ https://openclipart.org/detail/169757/check-and-cross-marks
Ÿ https://openclipart.org/detail/153577/stopwatch
137. Storing and Querying Semantic Data in the Cloud 137Daniel Janke & Steffen Staab
References
[Huang2011] Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF Graphs. PVLDB
4(11), 1123–1134 (2011)
[Peng2016] Peng, P., Zou, L., Özsu, M.T., Chen, L., Zhao, D.: Processing SPARQL Queries over Distributed
RDF Graphs. The VLDB Journal 25(2), 243–268 (apr 2016).
[Battré2007] Battré, D., Heine, F., Höing, A., Kao, O.: On Triple Dissemination, Forward-Chaining, and Load
Balancing in DHT Based RDF Stores. In: Moro, G., Bergamaschi, S., Joseph, S., Morin, J.H., Ouksel, A.M.
(eds.) Databases, Information Systems, and Peer-to-Peer Computing. pp. 343–354. Springer Berlin
Heidelberg, Berlin, Heidelberg (2007)
[Osorio1017] Osorio, M., Aranda, C.B.: Storage Balancing in P2P Based Distributed RDF Data Stores. In:
Proceedings of the Workshop on Decentralizing the Semantic Web 2017 co-located with 16th International
Semantic Web Conference (ISWC 2017) (2017).