Storing and Querying Semantic Data in the Cloud

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Storing and Querying Semantic Data
in the Cloud
Reasoning Web Summer School 2018 (RW 2018)
Daniel Janke & Steffen Staab
24.09.2018

Storing and Querying Semantic Data in the Cloud 2Daniel Janke & Steffen Staab
Amount of Available RDF Data Increases
Source: https://lod-cloud.net/

Why using RDF Stores in the Cloud?
Example 1: Wikidata
Ÿ Dataset size: 4.9 billion triples (as of April 2018)
Ÿ Stored in distributed BlazeGraph RDF store because
– Higher query throughput
– Higher availability
Example 2: BBC
Ÿ On average 1 million SPARQL queries per day (in 2010)
Ÿ Stored in distributed GraphDB RDF store because
– Higher query throughput
– Higher availability

Assumptions of this talk
1. There are exceptions for (almost) everything
2. You are always allowed to ask questions
3. You have some knowledge
Required
l RDF
l SPARQL
Helpful
l Cloud processing frameworks like Hadoop or Spark
l Query processing in relational databases
If not -> See 2.
Timeplan

How to deal with increasing volume of RDF?

Centralized RDF Stores
Ÿ Graph database for storing RDF graphs
(includes tasks like data storage, query processing, ...)
Ÿ All RDF store tasks are executed on a single computer

Terminology: RDF Graph
Ÿ Directed graph with labelled vertices and edges
Ÿ Labels of start vertex, edge and end vertex are an RDF triple
Ÿ RDF graph is a set of RDF triples
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knowsg:bello
r:type e:ownedBy
g:Dog
Triple
Subject
Property
Object

Terminology: SPARQL Query
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
How are the employees of WeST called?
Variable
Triple Pattern

Terminology: Query Execution Tree
}

Centralized Query Processing
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
?v1
w:martin
w:daniel
?v1 ?name
g:wanja “Wanja”
w:martin “Martin”
w:daniel “Daniel”
?v1 ?name
?name
“Martin”
“Daniel”

Centralized RDF Stores
Ÿ Graph database for storing RDF graphs
(includes tasks like data storage, query processing, ...)
Ÿ All RDF store tasks are executed on a single computer
Advantages
Ÿ Less complex than RDF stores running on several computers
Disadvantages
Ÿ Hardware of computer limits the size of processable RDF graph
Ÿ No fault tolerance

RDF Stores in the Cloud
Ÿ RDF store tasks are bundled into nodes
– Data storage tasks are bundled to storage nodes
– Query processing tasks are bundled to compute nodes
Ÿ Compute and storage nodes1
are distributed/replicated among several
computers
1 In the following, compute and storage nodes
are referred to as simply compute nodes.

How to place the data?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog

Where to find the required data?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog

How to distribute the query processing?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
?v1
w:martin
w:daniel
?v1 ?name
?v1 ?name
?name
“Martin”
“Daniel”
?v1 ?name
g:wanja “Wanja”

RDF Stores in the Cloud
Ÿ RDF store tasks are bundled into nodes
– Data storage tasks are bundled to storage nodes
– Query processing tasks are bundled to compute nodes
Ÿ Compute and storage nodes1
are distributed/replicated among several
computers
Advantages
Ÿ Scalable by adding new compute or storage nodes
– Scaling up the dataset size
– Scaling up the query throughput
Ÿ Possibly fault tolerant
Disadvantages
Ÿ Higher complexity
1 In the following, compute and storage nodes
are referred to as simply compute nodes.

Challenges of RDF Stores in the Cloud
1) How to design the architecture?
2) How to distribute the data?
3) How to identify compute nodes that store required data?
4) How to distribute query processing?
5) How to achieve fault tolerance?
6) How to evaluate?
Many ideas from 50 years of data engineering carry over
-> We focus on approaches more commonly used for RDF

#Related Work about RDF Stores
2)How to distribute the data?
3)How to identify compute nodes that store required data?
4)How to distribute query processing?
6) How to evaluate?
Rarely considered
on its own

Architecture Types
How to design the architecture?

Properties of Architecture Types
Implementation complexity:
Ÿ How difficult is the implementation?
Freedom of data placement:
Ÿ To which extent can the data placement be influenced?
Query overhead:
Ÿ Which query overhead is caused by the architecture?
Scalability:
Ÿ To which extent do the storage and query processing capabilities
increase if further compute nodes are added?
Fault tolerance:
Ÿ Do single point of failures exist?
Ÿ How easily can they be removed?

Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores

Architecture Types
Architecture
RDF stores using

RDF Stores Using
Cloud Computing Frameworks
Converts and
loads
RDF graph into
cloud computing
framework
Translates SPARQL
queries into task(s)
for cloud computing
framework
Architecture
Cloud computing
Distributed
Federated
Examples: SHARD, S2RDF, S2X, TripleRush, Jena-Hbase, Sempala, D-SPARQ

Cloud Computing Framework Types
RDF stores using
cloud computing
frameworks
Batch processing
frameworks
Graph processing
frameworks
NoSQL databases Column stores
Document stores
Architecture
Cloud computing
Distributed
Federated
Key-value stores
Distinction based on implementation
Architecture

Batch Processing Frameworks
Ÿ Example frameworks: Hadoop, Spark
Ÿ Queries need to be translated into one or several tasks
Ÿ Data exchange between compute nodes via file system
Cloud computing
Batch
Graph
NoSQL
Distributed file system
1. Read input data
2. Process data
3. Write results back

Graph Processing Frameworks
Ÿ Examples: GraphX, Signal/Collect
Ÿ Translation of queries in vertex algorithms
At each vertex:
1. Receive messages
2. Process messages
and update vertex
status
3. Send messages
Termination:
Status of all vertices do
not change any more
Cloud computing
Batch
Graph
NoSQL

Key-Value Stores
Ÿ Example: DynamoDB
Ÿ Distributed map that assigns keys to arbitrary values
Ÿ Values are atomic
Ÿ Distribution based on, e.g., hash of the key, key ranges, …
Ÿ Query translated to several lookups in the map and joins on the
master
g:Gesis
g:wanja
...
e:employs g:wanja, ...
f:knows w:daniel, ...
...
w:WeST
w:martin
...
e:employs w:martin, ...
f:knows g:wanja, ...
...
Cloud computing
Batch
Graph
NoSQL

Column Stores
Ÿ Examples: HBase, Cassandra, Accumulo, Impala
Ÿ Stores tabular data column-wise
Ÿ Maps column name and key to corresponding value
Ÿ Values are atomic
Ÿ Distributes key-value mappings based on keys for each column
separately
g:Gesis
w:WeST
g:wanja
w:martin, w:daniel
g:wanja
w:martin
w:daniel
w:daniel
g:wanja
w:martin
Column e:employs
Column f:knows
Cloud computing
Batch
Graph
NoSQL

Document Stores
Ÿ Examples: Couchbase, MongoDB
Ÿ Store documents with internal structure (e.g., JSON)
(i.e., non-atomic documents = more freedom to model content)
Ÿ Provide indices over documents
Ÿ Distribution based on a key within documents
{_id: “g:Gesis”,
e:employs: “g:wanja”}
{_id: “w:WeST”,
e:employs: [“w:daniel”, “w:martin”]}
{_id: “g:wanja”,
f:knows: “w:daniel”,
f:givenname: “Wanja”}
{_id: “w:martin”,
f:knows: “g:wanja”,
f:givenname: “Martin”}
Cloud computing
Batch
Graph
NoSQL

RDF Stores Using
Cloud Computing Frameworks
Pros:
Ÿ Low implementation complexity
Ÿ Fault tolerance provided by cloud computing framework
Ÿ Scalability provided by cloud computing framework
Ÿ Cloud computing framework is maintained and improved by a
community
Cons:
Ÿ Influence on data placement limited
Ÿ High overhead introduced by cloud computing framework
Ÿ Centralized join of data obtained by single lookups in NoSQL
databases might overload master
Architecture
Cloud computing
Distributed
Federated

Architecture Types
Architecture
RDF stores using

Federated RDF Stores Architecture
Cloud computing
Distributed
Federated
l Stores RDF data
l Administrated
independently
Coordinates query
execution:
l Decompose query
l Query RDF stores
l Join query results
Stores which data
is contained in
each RDF store
Caches data
retrieved from
previous queries
l Varied by index and cache
l Examples: DARQ, FedX, SPLENDID

Pros:
Ÿ Low implementation complexity
Ÿ Scalability by adding new RDF stores
Cons:
Ÿ No influence on data placement
Ÿ Query federator is a single point of failure
Ÿ Centralized join of results from different RDF stores may become a
bottleneck
Ÿ Identification of RDF stores contributing to a query may be costly
Architecture
Cloud computing
Distributed
Federated
Federated RDF Stores

Architecture Types
Architecture
RDF stores using

Distributed RDF Stores Architecture
Cloud computing
Distributed
Federated
Master-slave architecture
Peer-to-peer architecture
Architecture

Master-Slave Architecture Master-slave
Peer-to-peer
Architecture
Cloud computing
Distributed
Federated
Loading Graph:
1.Translate strings to fixed-length identifiers
2.Assigns triples to slaves
3.Stores which data is stored at which slave
4.Transfer triples to slaves
5.Store RDF triples locally
Querying:
1. Translate constant
strings to their integer
identifiers
2. Check occurrences of
constants
3. Decompose query and
send subqueries to
slaves
4. Execute subqueries
on local data
5. Join intermediate
results
6. Translate result ids
back to strings
L1, Q1, Q6
L2
L3, Q2
Q3, Q5
Q4, Q5
L5, Q4
Examples: GraphDB, BlazeGraph, TriAD, DiploCloud

Peer-to-Peer Architecture Master-slave
Peer-to-peer
Architecture
Cloud computing
Distributed
Federated
Responsibilities of master are copied to all slaves resulting in peer
nodes with identical architecture but varying data
Examples: RDFPeers, Edutella, Grid Vine, 3RDF

Pros:
Ÿ Full freedom on data placement
Ÿ Little query processing overhead
Ÿ Direct transfer of intermediate results
Ÿ Fault tolerance (in case of peer-to-peer)
Cons:
Ÿ High implementation complexity
Ÿ Master is a single point of failure
Ÿ Handling of dictionary, index and query coordination may lead to a
bottleneck at master
Architecture
Cloud computing
Distributed
Federated
Distributed RDF Stores

Architecture Summary
RDF stores using
cloud computing
frameworks
Federated
RDF stores
Distributed
RDF stores
Freedom of
data placement
Low/Medium – cloud
computing framework
decides about data
placement
Low – RDF stores
are administrated
independent of
federator
High – data
placement strategy
needs to be
implemented
Fault Tolerance High – master is
stateless and can be
replicated
Low – federator is
single point of
failure
High (peer-to-peer)
Low – master is
single point of failure
Scalability High/Medium –
possible
bottlenecks:
l Disk I/O
l Master-based joins
Medium – federator
can become
bottleneck
High (peer-to-peer)
Medium – if master
becomes bottleneck

Architecture Summary
RDF stores using
cloud computing
frameworks
Federated
RDF stores
Distributed
RDF stores
Query
overhead
High – initialisation of
cloud computing
framework
Medium –
identification of
required RDF
stores
Low – designed to
execute queries
efficiently
Implementation
complexity
Low – only
translation of RDF
dataset and SPARQL
queries
Medium –
dedicated querying,
indexing and
caching strategies
required
High – all
components needs
to be implemented

Data Placement Strategies
How to distribute the data?

Terminology: RDF Graph
Ÿ Directed graph with labelled vertices and edges
Ÿ Labels of start vertex, edge and end vertex are an RDF triple
Ÿ RDF graph is a set of RDF triples
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knowsg:bello
r:type e:ownedBy
g:Dog
Triple
Subject
Property
Object

Terminology: Graph Cover and Graph Chunk
Graph cover (aka sharding)
Assignment of each triple to at least one compute node
Graph chunk (aka shard)
Set of triples assigned to a single compute node
Compute Node 1 Compute Node 2
w:martin
“Martin“
g:wanja
“Wanja“ w:daniel “Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employsf:knows
f:knows
f:knows
g:bello
r:type
e:employs
e:ownedBy
g:Dog

Terminology: Path and Path Length
Path
A sequence of triples in which the object of a triple is the subject of the
succeeding triple
Path length
The number of triples in the path
w:martin g:wanja “Wanja“w:daniel
f:givennamef:knowsf:knows
Length = 3

Terminology: Molecule, Anchor Vertex and Diameter
Molecule
Ÿ Set of triples that are contained in some paths starting at a vertex
called anchor vertex
Ÿ If molecule contains a subject s than all triples with s as subject are
contained
(Directed) molecule diameter
Longest shorted path between anchor vertex and all objects contained
in the molecule
w:martin
“Martin“
g:wanja
“Wanja“
f:givenname
f:givenname
f:knows
w:daniel
f:knows
Anchor vertex
Diameter = 2

Properties of Graph Cover Strategies
Complexity:
Ÿ How complex is the creation of the graph cover?
Balancing:
Ÿ How balanced are the sizes of the resulting graph chunks?
Storage size:
Ÿ Is the sum of all graph chunks sizes larger than the original graph
size?
Path containment:
Ÿ How likely is it that a path can be traversed without leaving one
chunk?
Query parallelisation:
Ÿ How good can the workload of one query be parallelized among
several compute nodes?
Dynamics:

Overview Graph Cover Strategies
Graph Cover
Strategies
Static
Dynamic
Cloud-computing-based
Hash-based
Graph-clustering-based
Workload-aware
N-hop replication

Graph Cover
Strategies
Static
Dynamic
Hash-based
Workload-aware
N-hop replication

Cloud-Computing-Based
Graph Cover Strategies
Ÿ Data placement is mainly decided by cloud computing framework
Ÿ Influenced only by
– Splitting graph into files or tables
– Encoding of data within files or tables
Ÿ Goal: Reduce the processing effort of queries
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Molecule Graph Splits
Ÿ Split graph into molecules of directed diameter 1
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Ÿ Store molecules in key-value store (e.g., SHARD, Sempala)
Ÿ Store molecules in one or several files (e.g., D-SPARQ, RAPID+)
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
g:Gesis
g:wanja
e:employs gesis:wanja
f:knows w:daniel, f:givenname “Wanja”
w:WeST
w:martin
...
e:employs w:martin, e:employs w:daniel
f:knows g:wanja, f:givenname “Martin”
...
g:Gesis : (e:employs gesis:wanja)
g:wanja : (f:knows w:daniel), (f:givenname “Wanja”)
w:WeST : (e:employs w:martin), (e:employs w:daniel)
w:martin : (f:knows g:wanja), (f:givenname “Martin”)
...

Pros:
Ÿ Easy to compute
Ÿ Selection of required molecules easy, if subjects are given in the
context
Ÿ Subject-subject joins can be easily processed
Cons:
Ÿ If subject is not given in the context all molecules have to be
processed
Ÿ Extending molecules by incoming edges or longer diameters
increases dataset size
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Vertical Graph Splits
Ÿ Create a file/table for each property
Ÿ Store all triples with that property in the file/table
Ÿ Examples: Jena-HBase, SPARQLGX
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Pros:
Ÿ Easy to compute
Cons:
Ÿ Queries that match with a path of length l will match with at most l
files/tables, if the property is given in the context
Ÿ Files/tables of frequent properties like rdf:type can become
large
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Vertical Graph Splits

Hash-Based
Ÿ Assignment of triples based on a hash function
Ÿ Possible properties of hash functions
– Determinism
The same input will always produce the same output
– Uniformity
Inputs are evenly mapped over output range
– Non-invertible
Based on a hash value the input datum cannot be reconstructed
– Continuity
The order of the hash values reflect the order of the input values
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Hash Cover
Hash function applied on the subjects:
Result:
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Pros:
Ÿ Easy to compute
Ÿ Chunks are of almost equal size
Cons:
Ÿ Paths are more likely to contain triples that were assigned to
different compute nodes
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Hash Cover

Graph-Clustering-Based
Graph clustering
Ÿ Split graph into pairwise disjoint graph chunks, i.e., partitions (aka
shards)
Ÿ Usually vertices are assigned to partitions
Ÿ Partitions satisfy some clustering properties
Vertex-cut transformation:
Ÿ In RDF triples cannot be cut
Ÿ Assign triples to partition to which the subject was assigned to
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Minimal Edge-Cut Cover
Ÿ Number of cut edges should be reduced
Ÿ Number of vertices in each partition should be ideally the same
Ÿ After vertex-cut transformation:
Number of edges per partition is unbalanced
Ÿ Examples: [Huang2011], [Peng2016]
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Pros:
Ÿ Likelihood that a path only contains triples of the same compute node is
high
Ÿ #vertices per chunk is balanced
Cons:
Ÿ High computational effort (heuristic approaches are in O(|V|*log(|V|))
Ÿ #triples per chunk is unbalanced
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Minimal Edge-Cut Cover
4 vertices
7 triples
4 vertices
3 triples

Workload-Aware
General idea:
Assign triples based on a historic query workload
General procedure:
1. Generalize from actual queries to handle unseen queries
2. Identify triples that are required to answer generalized queries
3. Assign triples to compute nodes
– All triples required to produce all query results are assigned to
the same compute node
– Distribute triple sets for the individual results equally among all
compute nodes
Examples: WARP, DiploCloud
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Pros:
Ÿ Good query performance for queries similar to the ones in the
historic query workload
Cons:
Ÿ High computational effort
Ÿ Historic query workload required
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Workload-Aware

n-hop Replication
Ÿ Based on an initial graph cover with chunks
Ÿ Replicate triples such that all paths of length n
– Starting at a subject contained in chunk
– Consist of triples assigned to
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Example: VB-Partitioner

Pros:
Ÿ Paths of length <=n are guaranteed to belong to one chunk
Cons:
Ÿ Higher computational effort
Ÿ Dataset size increases
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
n-hop Replication

Summary of
Static Graph Cover Strategies
Cloud Hash Clustering Workload N-hop
Complexity Low Low High High Medium
Chunk sizes Imbalanced Balanced Imbalanced - -
Dataset size 100% 100% 100% >= 100% > 100%
Path
containment
Low Low High High Medium
Query
parallelization
Medium High Low Low/High -
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Graph Cover
Strategies
Static
Dynamic
Hash-based
Workload-aware
N-hop replication

Dynamic Graph Cover Strategies
Ÿ Adaptation of graph cover during runtime
Ÿ Types of dynamics
– Adaptation of graph cover to actual query workload
– If one chunk becomes overloaded due to insertions of new
triples, move triples to other chunks
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Adaptation to
Actual Query Workload
Ÿ Initial static graph cover
Ÿ Keep track how frequently
- triple patterns
- molecules
are queried together
Ÿ Replicate triples such that
– Data transfer is reduced
– Workload is equally distributed among compute nodes
Examples: PHD-Store, AdHash, Sedge
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Dynamic Redistribution of Triples
Ÿ If one compute node stores too many triples (in comparison to
others), redistribute triples based on their hash values
Ÿ If triples are stored in an ordered fashion, send one half to another
compute node
Examples: [Battré2007], [Osorio2017]
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop

Indices
How to identify compute nodes that store required data?

Example
Where is the information stored to answer the query:
Hash cover on subjects

Properties of Indices
Graph cover independence:
Ÿ How independent is the index from the graph cover strategy?
Storage consumption:
Ÿ How much storage space is required for the index
Access time:
Ÿ How fast can the location of an indexed element be retrieved?
Indexed elements:
Ÿ Which elements are indexed?

Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Higher degree of aggregation
l Slower access
l Lower degree of aggregation

Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Slower access

Centralized Hash-Based Index
Ÿ Applicable only for hash covers
Ÿ No explicit index required
Ÿ Location of a triple can be recomputed by the hash function and the
number of chunks
Ÿ Examples: 4store, Trinity.RDF
hash(w:WeST) → compute node 2
e:employs ?
f:givenname ?
(w:WeST, e:employs) ?
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Pros:
Ÿ Easy to compute occurrences
Ÿ No explicit index required
– No storage consumption
Cons:
Ÿ Only applicable for hash covers
Ÿ Only applicable for hashed elements (subject, property, object)
Centralized Hash-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Centralized Statistics-Based Index
Ÿ Collect occurrences of
– Subject, property, object labels
– Combinations of subject, property, object labels
– RDFs types
– Property sets of molecules
Ÿ Examples: DARQ, FedX, Sedge
Subject Property Object
c1 c2 c1 c2 c1 c2
w:WeST 0 2 0 0 0 0
e:employs 0 0 1 2 0 0
f:givenname 0 0 2 1 0 0
... ... ... ...
Chunk IDs
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Pros:
Ÿ Independent of graph cover strategy
Ÿ Can estimate number of results
Ÿ Fast access
Cons:
Ÿ Requires compression for storage
Ÿ Trade off:
– Collecting only a few statistics → small size → less useful
– Collecting many statistics → large size (possibly size of dataset)
→ more useful
Centralized Statistics-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Centralized Summary-Graph-Based
Index: TriAD
Summarization algorithm:
1) Each chunk represented by chunk vertex
2) Start and end vertices of edges are substituted by corresponding
chunk vertices
3) Duplicate edges are removed
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Index: EAGRE
1) Determine property sets of all subjects
2) Group similar property sets
3) Store occurrences of each property set
4) Property sets become vertices
5) Replace start and end vertices of edges by their property set
vertices
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Index
Pros:
Ÿ Independent of graph cover strategy
Ÿ Identification of subqueries that can be answered locally
Cons:
Ÿ All triples with same subject have to be assigned to the same
compute node
Ÿ High storage consumption
Ÿ Summary graph needs to be queried
Ÿ Only properties are considered
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Slower access

Decentralized Hash-Based Index
Ÿ Version 1:
– Centralized hash-based index on each compute node
– Knowledge of all compute nodes required
– Examples: HDRS, Virtuoso Clustered Edition
Ÿ Version 2:
– Each compute node knows a forward table for a few neighbours
▪ Ring structure overlay (e.g., RDFPeers, PAGE)
▪ Tree structure overlay (e.g., Grid Vine, 3RDF)
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Ring Structure Overlay
Ÿ Compute nodes are ordered
Ÿ Each compute node knows
– Its direct neighbour
– A few distant neighbours
Ÿ When a request arrives
1)The compute node storing the
data is determined by the hash
function
2)Request is forwarded to the
(closest) compute node storing
the data
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Tree Structure Overlay
Ÿ C1
– stores all data whose hash
value starts with prefix 00
– Knows C2 is responsible for
prefix 01
– Knows C3 is responsible for
prefix 1
Ÿ When request arrives C1
– Computes hash value
– Forwards request based on the
known prefixes
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Pros:
Ÿ Easy to compute occurrences
Ÿ Low storage consumption
Cons:
Ÿ Only applicable for hash covers
Ÿ Only applicable for hashed elements (subject, property, object)
Decentralized Hash-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Decentralized Schema-Based Index
Ÿ Applicable for type-based graph covers
Ÿ Use type hierarchy as tree structure overlay
Ÿ Example: SQPeer
rdfs:Ressource
rdf:Property
e:employs f:givennamef:Person
rdfs:Class
e:Institute
C
1
C
2
C
3
C
4
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema

Pros:
Ÿ Queries that contain types can be forwarded to corresponding
compute node(s)
Ÿ Low storage consumption
Cons:
Ÿ Efficiently applicable only for type-based graph covers
Ÿ Types of requested resources need to be identified
Ÿ Unbalanced index sizes
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Decentralized Schema-Based Index
Used in combination with other indices

Summary Indices
Centralized Decentralized
Hash Statistics Summary
graph
Hash Schema
Applicable to
graph cover
strategies
Hash
covers
All All Hash
covers
Type-
based
covers
Storage
consumption
Low High High Low Low
Access time Fast Slow Slow Medium Medium
Indexed
elements
Hash
dependent
Various
aggregations
Properties Hash
dependent
Typed
elements

Distributed Query Processing Strategies
How to distribute query processing?

Terminology: SPARQL Query
}
Variable
Triple Pattern

Terminology: Query Execution Tree
}

Centralized Query Processing
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
?v1
w:martin
w:daniel
?v1 ?name
g:wanja “Wanja”
?v1 ?name
?name
“Martin”
“Daniel”

Distributed Query Processing
General procedure
1) Split query into subquery that can be executed locally
2) Execute subqueries on compute nodes identified by index
3) Join results of subqueries
4) Return results

Splitting Query into Subqueries
Ÿ Simplest case: each triple pattern forms a subquery
Ÿ Use knowledge about graph covers
– All triples with same subject are stored on the same compute
node
– Paths of length n can be executed locally
Ÿ Use index information
– Co-occurrences of subject-property or property-property

Properties of Join Operations
Parallelisation:
Ÿ Is the join computation distributed among several or all compute
nodes?
Computational effort:
Ÿ How many comparisons are performed during the join
computation?
Ÿ How many subqueries result out of the join computation?
Data transfer:
Ÿ How many intermediate results are transferred to compute the join?
Blocking:
Ÿ Do subqueries need to be finished before the join can be
computed?

Overview Join Processing
Joins
Centralized
Distributed
Hash join
Bind join
Replication-based join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on
a single compute node
Join is distributed over
several compute nodes

Joins
Centralized
Distributed
Hash join
Bind join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on

Centralized Nested Loop Join
Compare each element of first list with every element of second list
Examples: SPLENDID, DARQ
Pros:
Ÿ Does not require an ordering
Ÿ Arbitrary join conditions possible
Cons:
Ÿ Inefficient
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
?v1
w:martin
w:daniel
?v1 ?name
g:wanja “Wanja”

Centralized Merge Join
Ÿ Requires sorted intermediate result lists
Ÿ Compare one result r only with results that are <= r
Ÿ Example: Partout
Pros:
Ÿ Fast for ordered result sets
Cons:
Ÿ Slow for unordered result sets
Ÿ Intermediate result set size might lead to a bottleneck
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Centralized Hash Join
Ÿ Assign results to buckets based on their hashes
Ÿ Join a result only with corresponding bucket
Ÿ Examples: ANAPSID, LHD
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
...
?v1 ?name
...
?v1 ?name
...
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
A non-blocking symmetric version exists

Pros:
Ÿ No ordering required
Ÿ On average almost constant time complexity
Cons:
Ÿ Intermediate result set size might lead to a bottleneck
Centralized Hash Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Bind Join
Ÿ Substitute variables of the second subquery based on results from first
subquery
Ÿ Second query is executed multiple times
Ÿ Examples: FedX, Avanalche, SemaGrow
?v1
w:martin
?v1 ?name
?v1 ?name
?v1
w:daniel
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Pros:
Ÿ Reduces the amount of intermediate results
Cons:
Ÿ Increases number of executed subqueries
Ÿ Possible bottlenecks:
– Large intermediate result set sizes
– Large number of subqueries
Bind Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Summary Centralized Joins
Nested Merge Hash Symmetric Bind
Computational
effort
High Medium -
extra effort
for ordering
Low Low Medium -
effort of
many
subqueries
# executed
queries
Low Low Low Low High
Blocking
operation
Yes Yes Yes No Yes
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Joins
Centralized
Distributed
Hash join
Bind join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on

Replication-Based Distributed Join
All results of first subquery are sent to all compute nodes on which the
second subquery is executed
Example: SemStore
Compute Node 2
Compute Node 1
Compute Node 2
?v1 ?name
?v1 ?name
?v1
w:daniel
w:martin
?v1
w:daniel
w:martin
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Pros:
Ÿ Not all compute nodes are necessary involved in joining
Ÿ Using data locality → Less transferred data
Cons:
Ÿ Intermediate result set size may become bottleneck if second
subquery is executed on a single compute node
Ÿ One subtree needs to be finished before join can be executed
Replication-Based Distributed Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Distributed Hash Join
Hash join in which each compute node serves as a bucket
Example: DiploCloud
Compute Node 2Compute Node 1
?v1
w:martin
w:daniel
?v1 ?name
g:wanja “Wanja”
?v1 ?name
?v1 ?name
g:wanja “Wanja”
?v1
w:martin
?v1
w:daniel
?v1 ?name
hash(w:martin)
hash(w:daniel)
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Pros:
Ÿ All compute nodes are involved in join processing
Ÿ Bottleneck is unlikely due to distribution of intermediate result set
over all compute nodes
Cons:
Ÿ No usage of data locality → high data transfer
Distributed Hash Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Distributed Merge Join
Ÿ Results of subqueries are ordered
Ÿ Each compute node is responsible for a range of results
Ÿ Examples: H2RDF+, SHARD, SparkRDF, SPARQLGX
Compute Node 2Compute Node 1
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
?v1 ?name
Range a:a-w:d Range w:e-z:z
?v1 ?name
g:wanja “Wanja”
?v1
w:daniel
?v1
w:martin
?v1 ?name
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Pros:
Ÿ All compute nodes are involved in join processing
Ÿ Bottleneck is unlikely due to distribution of intermediate result set
over all compute nodes
Cons:
Ÿ Results need to be ordered
Ÿ Agreement on result ranges required
Ÿ No usage of data locality → high data transfer
Distributed Merge Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Distributed Bind Join
Join algorithm:
1) Get results of first subquery
2) For each following bind join query,
1) Identify compute nodes with matches
2) Fork query execution to remote compute nodes
Examples: RDFPeers, GridVine, Atlas, TripleRush, Trinity.RDF
Compute Node 2
Compute Node 1
Compute Node 2
?v1 ?name
?v1 ?name
?v1
w:daniel
w:martin
?v1
w:martin
?v1
w:daniel
Fork
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Pros:
Ÿ Join computed without waiting for any subtree to be finished
Ÿ Usage of data locality → Less transferred data
Ÿ Results of last join operation do not need to be sent to other
compute nodes
Cons:
Ÿ Intermediate result set size may become bottleneck if second
subquery is executed on a single compute node
Distributed Bind Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Distributed Joins Summary
Centralized
Joins
Distributed
Replication
Distributed
Hash
Distributed
Merge
Distributed
Bind
Data Transfer High Low High High Low
Parallelisation Low Medium High High Medium
# Subqueries Low Low Low Low High
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested

Fault Tolerance
How to achieve fault tolerance?

Mirroring
Ÿ There exist several identical copies of each compute node
Ÿ If one compute node fails, its copy continues working
Ÿ Example: Virtuoso Clustered Edition
Pros:
Ÿ Query workload can be distributed among all copies
Cons:
Ÿ Keeping copies up to date
Ÿ Replicas of different chunks are not combined to increase data
locality
Compute Node 1 Compute Node 2 Compute Node 1’ Compute Node 2’

Data Replication
Ÿ All compute nodes are ordered in a ring
Ÿ Data from one compute node is replicated on neighbours
Ÿ If one compute node fails, data remains available on neighbours
Ÿ Example: 4store, RDFPeers
Pros:
Ÿ Data locality of initial graph cover is increased
Cons:
Ÿ Keeping copies up to data
Compute Node 1 Compute Node 2 Compute Node 3
1
1’
2
2’
3
3’

Evaluation Methodology
How to evaluate?

Properties of Evaluation Methodologies
Realism:
Do the measurement results reflect the performance of real RDF
stores?
Modularity:
Can alternative implementations of individual components be
evaluated?
Evaluation depth:
Is the system evaluated only as a whole or are the performance of the
individual components evaluated?
Difficulty:
How difficult is it to apply the evaluation methodology?

Black Box Evaluation
Evaluation of RDF stores as a whole
Some problems (of many):
Ÿ How fast is your network?
Ÿ How large are your images?
Ÿ Which processor configuration do you use?
Ÿ What are the structures of your caches?
Do you evaluate the RDF store or your hardware configuration?
Dataset
QueriesQueriesQueries

Black Box Evaluation
Evaluation of RDF stores as a whole
Pros:
Ÿ Easy to perform evaluation since no implementation knowledge is
required
Ÿ Measurements reflect the behaviour of a real RDF store
Cons:
Ÿ Only superficial evaluations possible
Ÿ No performance evaluation of individual components possible
Dataset

Glass Box Evaluation
Ÿ Evaluation of RDF stores as a whole
Ÿ Collecting performance measurements of components by
– Using a profiling system like Granula
– Adapting source code to perform measurements
Dataset

Pros:
Ÿ In-depth performance evaluation possible
Cons:
Ÿ Source code needs to be extended to collect measurements
Ÿ Individual components can hardly be exchanged by alternative
implementations

Simulation-based Glass Box Evaluation
Evaluation of alternative implementations of a single component by
simulating the behaviour of a real RDF store
Pros:
Ÿ Performance evaluation of individual components possible
Ÿ Alternative implementations of individual components is possible
Cons:
Ÿ Evaluation environment (simulator) needs to be implemented
Ÿ Questionable whether performance measurements reflect behaviour of
real RDF store
Dataset
ComponentComponent
Component

Glass Box Evaluation Platform
RDF store
Ÿ that allows the exchange of individual components by alternative
implementations
Ÿ Measures performance of individual components
Dataset
QueriesQueriesQueries Graph Cover
Creator
Graph Cover
Creator
Graph Cover
Creator

Pros:
Ÿ In-depth performance evaluation possible
Ÿ Alternative implementations of individual components can be
evaluated
Cons:
Ÿ Development of glass box evaluation platform difficult
Ÿ Interdependencies might limit the exchangeability of components

Evaluation Methodology Summary
Black box Glass box Simulation Glass box
platform
Realism High High Low Medium
Modularity Low Low High High
Evaluation depth Low High High High
Difficulty Easy Medium Medium Hard

Conclusion & Open Challenges

Conclusion
Challenges of RDF stores in the cloud:
2) How to distribute the data?
3) How to identify compute nodes that store required data?
4) How to distribute query processing?
6) How to evaluate?

Example RDF Stores in the Cloud
Virtuoso Clustered
Edition
BlazeGraph GraphDB
Architecture Master-slave Master-slave Master-slave
Graph Cover
Strategy
Hash cover Distributed B+-tree Replication of
graph on all slaves
Index Centralized hash-
based index on each
compute node
Distributed B+-tree Not necessary
Query
Execution
Strategy
Distributed bind join Centralized join Centralized join
Fault Tolerance Mirroring None Mirroring

Example RDF Stores in the Cloud
DiploCloud S2RDF Trinity.RDF
Architecture Master-slave Batch processing
framework
Master-slave
Graph Cover
Strategy
Workload-aware Vertical graph splits Hash cover
Index Centralized
Statistics-based index
None Distributed
chunk-integrated
summary graph
Query
Execution
Strategy
Centralized join
(for small result sets)
Distributed hash join
(otherwise)
Distributed joins Distributed bind join
Fault Tolerance None Based on batch
processing
framework
None

Challenges Not Presented
Ÿ How to achieve transactional security?
Ÿ How to perform online analytical processing (OLAP) queries?
Ÿ How to process property paths?
Ÿ How to perform distributed reasoning?
Ÿ How to perform distributed stream processing?

Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Thank you for your Attention!
Daniel Janke, Steffen Staab

Image References
Ÿ https://openclipart.org/detail/155101/server
Ÿ https://openclipart.org/detail/213252/gear-icon
Ÿ https://openclipart.org/detail/204067/bpm-mail-symbol
Ÿ https://openclipart.org/detail/169757/check-and-cross-marks
Ÿ https://openclipart.org/detail/153577/stopwatch

References
[Huang2011] Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF Graphs. PVLDB
4(11), 1123–1134 (2011)
[Peng2016] Peng, P., Zou, L., Özsu, M.T., Chen, L., Zhao, D.: Processing SPARQL Queries over Distributed
RDF Graphs. The VLDB Journal 25(2), 243–268 (apr 2016).
[Battré2007] Battré, D., Heine, F., Höing, A., Kao, O.: On Triple Dissemination, Forward-Chaining, and Load
Balancing in DHT Based RDF Stores. In: Moro, G., Bergamaschi, S., Joseph, S., Morin, J.H., Ouksel, A.M.
(eds.) Databases, Information Systems, and Peer-to-Peer Computing. pp. 343–354. Springer Berlin
Heidelberg, Berlin, Heidelberg (2007)
[Osorio1017] Osorio, M., Aranda, C.B.: Storage Balancing in P2P Based Distributed RDF Data Stores. In:
Proceedings of the Workshop on Decentralizing the Semantic Web 2017 co-located with 16th International
Semantic Web Conference (ISWC 2017) (2017).

Storing and Querying Semantic Data in the Cloud

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Storing and Querying Semantic Data in the Cloud

Similaire à Storing and Querying Semantic Data in the Cloud (20)

Plus de Steffen Staab

Plus de Steffen Staab (20)

Dernier

Dernier (20)

Storing and Querying Semantic Data in the Cloud