SlideShare une entreprise Scribd logo
1  sur  137
Télécharger pour lire hors ligne
Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Storing and Querying Semantic Data
in the Cloud
Reasoning Web Summer School 2018 (RW 2018)
Daniel Janke & Steffen Staab
24.09.2018
Storing and Querying Semantic Data in the Cloud 2Daniel Janke & Steffen Staab
Amount of Available RDF Data Increases
Source: https://lod-cloud.net/
Storing and Querying Semantic Data in the Cloud 3Daniel Janke & Steffen Staab
Why using RDF Stores in the Cloud?
Example 1: Wikidata
Ÿ Dataset size: 4.9 billion triples (as of April 2018)
Ÿ Stored in distributed BlazeGraph RDF store because
– Higher query throughput
– Higher availability
Example 2: BBC
Ÿ On average 1 million SPARQL queries per day (in 2010)
Ÿ Stored in distributed GraphDB RDF store because
– Higher query throughput
– Higher availability
Storing and Querying Semantic Data in the Cloud 4Daniel Janke & Steffen Staab
Assumptions of this talk
1. There are exceptions for (almost) everything
2. You are always allowed to ask questions
3. You have some knowledge
Required
l RDF
l SPARQL
Helpful
l Cloud processing frameworks like Hadoop or Spark
l Query processing in relational databases
If not -> See 2.
Timeplan
Storing and Querying Semantic Data in the Cloud 5Daniel Janke & Steffen Staab
How to deal with increasing volume of RDF?
Storing and Querying Semantic Data in the Cloud 6Daniel Janke & Steffen Staab
Centralized RDF Stores
Ÿ Graph database for storing RDF graphs
(includes tasks like data storage, query processing, ...)
Ÿ All RDF store tasks are executed on a single computer
Storing and Querying Semantic Data in the Cloud 7Daniel Janke & Steffen Staab
Terminology: RDF Graph
Ÿ Directed graph with labelled vertices and edges
Ÿ Labels of start vertex, edge and end vertex are an RDF triple
Ÿ RDF graph is a set of RDF triples
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knowsg:bello
r:type e:ownedBy
g:Dog
Triple
Subject
Property
Object
Storing and Querying Semantic Data in the Cloud 8Daniel Janke & Steffen Staab
Terminology: SPARQL Query
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
How are the employees of WeST called?
Variable
Triple Pattern
Storing and Querying Semantic Data in the Cloud 9Daniel Janke & Steffen Staab
Terminology: Query Execution Tree
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
Storing and Querying Semantic Data in the Cloud 10Daniel Janke & Steffen Staab
Centralized Query Processing
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
?v1
w:martin
w:daniel
?v1 ?name
g:wanja “Wanja”
w:martin “Martin”
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
w:daniel “Daniel”
?name
“Martin”
“Daniel”
Storing and Querying Semantic Data in the Cloud 11Daniel Janke & Steffen Staab
Centralized RDF Stores
Ÿ Graph database for storing RDF graphs
(includes tasks like data storage, query processing, ...)
Ÿ All RDF store tasks are executed on a single computer
Advantages
Ÿ Less complex than RDF stores running on several computers
Disadvantages
Ÿ Hardware of computer limits the size of processable RDF graph
Ÿ No fault tolerance
Storing and Querying Semantic Data in the Cloud 12Daniel Janke & Steffen Staab
RDF Stores in the Cloud
Ÿ RDF store tasks are bundled into nodes
– Data storage tasks are bundled to storage nodes
– Query processing tasks are bundled to compute nodes
Ÿ Compute and storage nodes1
are distributed/replicated among several
computers
1 In the following, compute and storage nodes
are referred to as simply compute nodes.
Storing and Querying Semantic Data in the Cloud 13Daniel Janke & Steffen Staab
How to place the data?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
Storing and Querying Semantic Data in the Cloud 14Daniel Janke & Steffen Staab
Where to find the required data?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
Storing and Querying Semantic Data in the Cloud 15Daniel Janke & Steffen Staab
How to distribute the query processing?
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
?v1
w:martin
w:daniel
?v1 ?name
w:martin “Martin”
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
w:daniel “Daniel”
?name
“Martin”
“Daniel”
?v1 ?name
g:wanja “Wanja”
Storing and Querying Semantic Data in the Cloud 16Daniel Janke & Steffen Staab
RDF Stores in the Cloud
Ÿ RDF store tasks are bundled into nodes
– Data storage tasks are bundled to storage nodes
– Query processing tasks are bundled to compute nodes
Ÿ Compute and storage nodes1
are distributed/replicated among several
computers
Advantages
Ÿ Scalable by adding new compute or storage nodes
– Scaling up the dataset size
– Scaling up the query throughput
Ÿ Possibly fault tolerant
Disadvantages
Ÿ Higher complexity
1 In the following, compute and storage nodes
are referred to as simply compute nodes.
Storing and Querying Semantic Data in the Cloud 17Daniel Janke & Steffen Staab
Challenges of RDF Stores in the Cloud
1) How to design the architecture?
2) How to distribute the data?
3) How to identify compute nodes that store required data?
4) How to distribute query processing?
5) How to achieve fault tolerance?
6) How to evaluate?
Many ideas from 50 years of data engineering carry over
-> We focus on approaches more commonly used for RDF
Storing and Querying Semantic Data in the Cloud 18Daniel Janke & Steffen Staab
#Related Work about RDF Stores
1) How to design the architecture?
2)How to distribute the data?
3)How to identify compute nodes that store required data?
4)How to distribute query processing?
5) How to achieve fault tolerance?
6) How to evaluate?
Rarely considered
on its own
Storing and Querying Semantic Data in the Cloud 19Daniel Janke & Steffen Staab
Architecture Types
How to design the architecture?
Storing and Querying Semantic Data in the Cloud 20Daniel Janke & Steffen Staab
Properties of Architecture Types
Implementation complexity:
Ÿ How difficult is the implementation?
Freedom of data placement:
Ÿ To which extent can the data placement be influenced?
Query overhead:
Ÿ Which query overhead is caused by the architecture?
Scalability:
Ÿ To which extent do the storage and query processing capabilities
increase if further compute nodes are added?
Fault tolerance:
Ÿ Do single point of failures exist?
Ÿ How easily can they be removed?
Storing and Querying Semantic Data in the Cloud 21Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
Storing and Querying Semantic Data in the Cloud 22Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
Storing and Querying Semantic Data in the Cloud 23Daniel Janke & Steffen Staab
RDF Stores Using
Cloud Computing Frameworks
Converts and
loads
RDF graph into
cloud computing
framework
Translates SPARQL
queries into task(s)
for cloud computing
framework
Architecture
Cloud computing
Distributed
Federated
Examples: SHARD, S2RDF, S2X, TripleRush, Jena-Hbase, Sempala, D-SPARQ
Storing and Querying Semantic Data in the Cloud 24Daniel Janke & Steffen Staab
Cloud Computing Framework Types
RDF stores using
cloud computing
frameworks
Batch processing
frameworks
Graph processing
frameworks
NoSQL databases Column stores
Document stores
Architecture
Cloud computing
Distributed
Federated
Key-value stores
Distinction based on implementation
Architecture
Storing and Querying Semantic Data in the Cloud 25Daniel Janke & Steffen Staab
Batch Processing Frameworks
Ÿ Example frameworks: Hadoop, Spark
Ÿ Queries need to be translated into one or several tasks
Ÿ Data exchange between compute nodes via file system
Cloud computing
Batch
Graph
NoSQL
Distributed file system
1. Read input data
2. Process data
3. Write results back
Storing and Querying Semantic Data in the Cloud 26Daniel Janke & Steffen Staab
Graph Processing Frameworks
Ÿ Examples: GraphX, Signal/Collect
Ÿ Translation of queries in vertex algorithms
At each vertex:
1. Receive messages
2. Process messages
and update vertex
status
3. Send messages
Termination:
Status of all vertices do
not change any more
Cloud computing
Batch
Graph
NoSQL
Storing and Querying Semantic Data in the Cloud 27Daniel Janke & Steffen Staab
Key-Value Stores
Ÿ Example: DynamoDB
Ÿ Distributed map that assigns keys to arbitrary values
Ÿ Values are atomic
Ÿ Distribution based on, e.g., hash of the key, key ranges, …
Ÿ Query translated to several lookups in the map and joins on the
master
g:Gesis
g:wanja
...
e:employs g:wanja, ...
f:knows w:daniel, ...
...
w:WeST
w:martin
...
e:employs w:martin, ...
f:knows g:wanja, ...
...
Cloud computing
Batch
Graph
NoSQL
Storing and Querying Semantic Data in the Cloud 28Daniel Janke & Steffen Staab
Column Stores
Ÿ Examples: HBase, Cassandra, Accumulo, Impala
Ÿ Stores tabular data column-wise
Ÿ Maps column name and key to corresponding value
Ÿ Values are atomic
Ÿ Distributes key-value mappings based on keys for each column
separately
g:Gesis
w:WeST
g:wanja
w:martin, w:daniel
g:wanja
w:martin
w:daniel
w:daniel
g:wanja
w:martin
Column e:employs
Column f:knows
Cloud computing
Batch
Graph
NoSQL
Storing and Querying Semantic Data in the Cloud 29Daniel Janke & Steffen Staab
Document Stores
Ÿ Examples: Couchbase, MongoDB
Ÿ Store documents with internal structure (e.g., JSON)
(i.e., non-atomic documents = more freedom to model content)
Ÿ Provide indices over documents
Ÿ Distribution based on a key within documents
{_id: “g:Gesis”,
e:employs: “g:wanja”}
{_id: “w:WeST”,
e:employs: [“w:daniel”, “w:martin”]}
{_id: “g:wanja”,
f:knows: “w:daniel”,
f:givenname: “Wanja”}
{_id: “w:martin”,
f:knows: “g:wanja”,
f:givenname: “Martin”}
Cloud computing
Batch
Graph
NoSQL
Storing and Querying Semantic Data in the Cloud 30Daniel Janke & Steffen Staab
RDF Stores Using
Cloud Computing Frameworks
Pros:
Ÿ Low implementation complexity
Ÿ Fault tolerance provided by cloud computing framework
Ÿ Scalability provided by cloud computing framework
Ÿ Cloud computing framework is maintained and improved by a
community
Cons:
Ÿ Influence on data placement limited
Ÿ High overhead introduced by cloud computing framework
Ÿ Centralized join of data obtained by single lookups in NoSQL
databases might overload master
Architecture
Cloud computing
Distributed
Federated
Storing and Querying Semantic Data in the Cloud 31Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
Storing and Querying Semantic Data in the Cloud 32Daniel Janke & Steffen Staab
Federated RDF Stores Architecture
Cloud computing
Distributed
Federated
l Stores RDF data
l Administrated
independently
Coordinates query
execution:
l Decompose query
l Query RDF stores
l Join query results
Stores which data
is contained in
each RDF store
Caches data
retrieved from
previous queries
l Varied by index and cache
l Examples: DARQ, FedX, SPLENDID
Storing and Querying Semantic Data in the Cloud 33Daniel Janke & Steffen Staab
Pros:
Ÿ Low implementation complexity
Ÿ Scalability by adding new RDF stores
Cons:
Ÿ No influence on data placement
Ÿ Query federator is a single point of failure
Ÿ Centralized join of results from different RDF stores may become a
bottleneck
Ÿ Identification of RDF stores contributing to a query may be costly
Architecture
Cloud computing
Distributed
Federated
Federated RDF Stores
Storing and Querying Semantic Data in the Cloud 34Daniel Janke & Steffen Staab
Architecture Types
Architecture
RDF stores using
cloud computing frameworks
Distributed RDF stores
Federated RDF stores
Storing and Querying Semantic Data in the Cloud 35Daniel Janke & Steffen Staab
Distributed RDF Stores Architecture
Cloud computing
Distributed
Federated
Distributed RDF stores
Master-slave architecture
Peer-to-peer architecture
Architecture
Storing and Querying Semantic Data in the Cloud 36Daniel Janke & Steffen Staab
Master-Slave Architecture Master-slave
Peer-to-peer
Architecture
Cloud computing
Distributed
Federated
Loading Graph:
1.Translate strings to fixed-length identifiers
2.Assigns triples to slaves
3.Stores which data is stored at which slave
4.Transfer triples to slaves
5.Store RDF triples locally
Querying:
1. Translate constant
strings to their integer
identifiers
2. Check occurrences of
constants
3. Decompose query and
send subqueries to
slaves
4. Execute subqueries
on local data
5. Join intermediate
results
6. Translate result ids
back to strings
L1, Q1, Q6
L2
L3, Q2
Q3, Q5
Q4, Q5
L5, Q4
Examples: GraphDB, BlazeGraph, TriAD, DiploCloud
Storing and Querying Semantic Data in the Cloud 37Daniel Janke & Steffen Staab
Peer-to-Peer Architecture Master-slave
Peer-to-peer
Architecture
Cloud computing
Distributed
Federated
Responsibilities of master are copied to all slaves resulting in peer
nodes with identical architecture but varying data
Examples: RDFPeers, Edutella, Grid Vine, 3RDF
Storing and Querying Semantic Data in the Cloud 38Daniel Janke & Steffen Staab
Pros:
Ÿ Full freedom on data placement
Ÿ Little query processing overhead
Ÿ Direct transfer of intermediate results
Ÿ Fault tolerance (in case of peer-to-peer)
Cons:
Ÿ High implementation complexity
Ÿ Master is a single point of failure
Ÿ Handling of dictionary, index and query coordination may lead to a
bottleneck at master
Architecture
Cloud computing
Distributed
Federated
Distributed RDF Stores
Storing and Querying Semantic Data in the Cloud 39Daniel Janke & Steffen Staab
Architecture Summary
RDF stores using
cloud computing
frameworks
Federated
RDF stores
Distributed
RDF stores
Freedom of
data placement
Low/Medium – cloud
computing framework
decides about data
placement
Low – RDF stores
are administrated
independent of
federator
High – data
placement strategy
needs to be
implemented
Fault Tolerance High – master is
stateless and can be
replicated
Low – federator is
single point of
failure
High (peer-to-peer)
Low – master is
single point of failure
Scalability High/Medium –
possible
bottlenecks:
l Disk I/O
l Master-based joins
Medium – federator
can become
bottleneck
High (peer-to-peer)
Medium – if master
becomes bottleneck
Storing and Querying Semantic Data in the Cloud 40Daniel Janke & Steffen Staab
Architecture Summary
RDF stores using
cloud computing
frameworks
Federated
RDF stores
Distributed
RDF stores
Query
overhead
High – initialisation of
cloud computing
framework
Medium –
identification of
required RDF
stores
Low – designed to
execute queries
efficiently
Implementation
complexity
Low – only
translation of RDF
dataset and SPARQL
queries
Medium –
dedicated querying,
indexing and
caching strategies
required
High – all
components needs
to be implemented
Storing and Querying Semantic Data in the Cloud 41Daniel Janke & Steffen Staab
Data Placement Strategies
How to distribute the data?
Storing and Querying Semantic Data in the Cloud 42Daniel Janke & Steffen Staab
Terminology: RDF Graph
Ÿ Directed graph with labelled vertices and edges
Ÿ Labels of start vertex, edge and end vertex are an RDF triple
Ÿ RDF graph is a set of RDF triples
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knowsg:bello
r:type e:ownedBy
g:Dog
Triple
Subject
Property
Object
Storing and Querying Semantic Data in the Cloud 43Daniel Janke & Steffen Staab
Terminology: Graph Cover and Graph Chunk
Graph cover (aka sharding)
Assignment of each triple to at least one compute node
Graph chunk (aka shard)
Set of triples assigned to a single compute node
Compute Node 1 Compute Node 2
w:martin
“Martin“
g:wanja
“Wanja“ w:daniel “Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employsf:knows
f:knows
f:knows
g:bello
r:type
e:employs
e:ownedBy
g:Dog
Storing and Querying Semantic Data in the Cloud 44Daniel Janke & Steffen Staab
Terminology: Path and Path Length
Path
A sequence of triples in which the object of a triple is the subject of the
succeeding triple
Path length
The number of triples in the path
w:martin g:wanja “Wanja“w:daniel
f:givennamef:knowsf:knows
Length = 3
Storing and Querying Semantic Data in the Cloud 45Daniel Janke & Steffen Staab
Terminology: Molecule, Anchor Vertex and Diameter
Molecule
Ÿ Set of triples that are contained in some paths starting at a vertex
called anchor vertex
Ÿ If molecule contains a subject s than all triples with s as subject are
contained
(Directed) molecule diameter
Longest shorted path between anchor vertex and all objects contained
in the molecule
w:martin
“Martin“
g:wanja
“Wanja“
f:givenname
f:givenname
f:knows
w:daniel
f:knows
Anchor vertex
Diameter = 2
Storing and Querying Semantic Data in the Cloud 46Daniel Janke & Steffen Staab
Properties of Graph Cover Strategies
Complexity:
Ÿ How complex is the creation of the graph cover?
Balancing:
Ÿ How balanced are the sizes of the resulting graph chunks?
Storage size:
Ÿ Is the sum of all graph chunks sizes larger than the original graph
size?
Path containment:
Ÿ How likely is it that a path can be traversed without leaving one
chunk?
Query parallelisation:
Ÿ How good can the workload of one query be parallelized among
several compute nodes?
Dynamics:
Storing and Querying Semantic Data in the Cloud 47Daniel Janke & Steffen Staab
Overview Graph Cover Strategies
Graph Cover
Strategies
Static
Dynamic
Cloud-computing-based
Hash-based
Graph-clustering-based
Workload-aware
N-hop replication
Storing and Querying Semantic Data in the Cloud 48Daniel Janke & Steffen Staab
Overview Graph Cover Strategies
Graph Cover
Strategies
Static
Dynamic
Cloud-computing-based
Hash-based
Graph-clustering-based
Workload-aware
N-hop replication
Storing and Querying Semantic Data in the Cloud 49Daniel Janke & Steffen Staab
Cloud-Computing-Based
Graph Cover Strategies
Ÿ Data placement is mainly decided by cloud computing framework
Ÿ Influenced only by
– Splitting graph into files or tables
– Encoding of data within files or tables
Ÿ Goal: Reduce the processing effort of queries
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 50Daniel Janke & Steffen Staab
Molecule Graph Splits
Ÿ Split graph into molecules of directed diameter 1
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 51Daniel Janke & Steffen Staab
Molecule Graph Splits
Ÿ Store molecules in key-value store (e.g., SHARD, Sempala)
Ÿ Store molecules in one or several files (e.g., D-SPARQ, RAPID+)
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
g:Gesis
g:wanja
e:employs gesis:wanja
f:knows w:daniel, f:givenname “Wanja”
w:WeST
w:martin
...
e:employs w:martin, e:employs w:daniel
f:knows g:wanja, f:givenname “Martin”
...
g:Gesis : (e:employs gesis:wanja)
g:wanja : (f:knows w:daniel), (f:givenname “Wanja”)
w:WeST : (e:employs w:martin), (e:employs w:daniel)
w:martin : (f:knows g:wanja), (f:givenname “Martin”)
...
Storing and Querying Semantic Data in the Cloud 52Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute
Ÿ Selection of required molecules easy, if subjects are given in the
context
Ÿ Subject-subject joins can be easily processed
Cons:
Ÿ If subject is not given in the context all molecules have to be
processed
Ÿ Extending molecules by incoming edges or longer diameters
increases dataset size
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Molecule Graph Splits
Storing and Querying Semantic Data in the Cloud 53Daniel Janke & Steffen Staab
Vertical Graph Splits
Ÿ Create a file/table for each property
Ÿ Store all triples with that property in the file/table
Ÿ Examples: Jena-HBase, SPARQLGX
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 54Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute
Cons:
Ÿ Queries that match with a path of length l will match with at most l
files/tables, if the property is given in the context
Ÿ Files/tables of frequent properties like rdf:type can become
large
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Vertical Graph Splits
Storing and Querying Semantic Data in the Cloud 55Daniel Janke & Steffen Staab
Hash-Based
Graph Cover Strategies
Ÿ Assignment of triples based on a hash function
Ÿ Possible properties of hash functions
– Determinism
The same input will always produce the same output
– Uniformity
Inputs are evenly mapped over output range
– Non-invertible
Based on a hash value the input datum cannot be reconstructed
– Continuity
The order of the hash values reflect the order of the input values
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 56Daniel Janke & Steffen Staab
Hash Cover
Hash function applied on the subjects:
Result:
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 57Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute
Ÿ Chunks are of almost equal size
Cons:
Ÿ Paths are more likely to contain triples that were assigned to
different compute nodes
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Hash Cover
Storing and Querying Semantic Data in the Cloud 58Daniel Janke & Steffen Staab
Graph-Clustering-Based
Graph Cover Strategies
Graph clustering
Ÿ Split graph into pairwise disjoint graph chunks, i.e., partitions (aka
shards)
Ÿ Usually vertices are assigned to partitions
Ÿ Partitions satisfy some clustering properties
Vertex-cut transformation:
Ÿ In RDF triples cannot be cut
Ÿ Assign triples to partition to which the subject was assigned to
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 59Daniel Janke & Steffen Staab
Minimal Edge-Cut Cover
Ÿ Number of cut edges should be reduced
Ÿ Number of vertices in each partition should be ideally the same
Ÿ After vertex-cut transformation:
Number of edges per partition is unbalanced
Ÿ Examples: [Huang2011], [Peng2016]
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 60Daniel Janke & Steffen Staab
Pros:
Ÿ Likelihood that a path only contains triples of the same compute node is
high
Ÿ #vertices per chunk is balanced
Cons:
Ÿ High computational effort (heuristic approaches are in O(|V|*log(|V|))
Ÿ #triples per chunk is unbalanced
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Minimal Edge-Cut Cover
4 vertices
7 triples
4 vertices
3 triples
Storing and Querying Semantic Data in the Cloud 61Daniel Janke & Steffen Staab
Workload-Aware
Graph Cover Strategies
General idea:
Assign triples based on a historic query workload
General procedure:
1. Generalize from actual queries to handle unseen queries
2. Identify triples that are required to answer generalized queries
3. Assign triples to compute nodes
– All triples required to produce all query results are assigned to
the same compute node
– Distribute triple sets for the individual results equally among all
compute nodes
Examples: WARP, DiploCloud
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 62Daniel Janke & Steffen Staab
Pros:
Ÿ Good query performance for queries similar to the ones in the
historic query workload
Cons:
Ÿ High computational effort
Ÿ Historic query workload required
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Workload-Aware
Graph Cover Strategies
Storing and Querying Semantic Data in the Cloud 63Daniel Janke & Steffen Staab
n-hop Replication
Ÿ Based on an initial graph cover with chunks
Ÿ Replicate triples such that all paths of length n
– Starting at a subject contained in chunk
– Consist of triples assigned to
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Example: VB-Partitioner
Storing and Querying Semantic Data in the Cloud 64Daniel Janke & Steffen Staab
Pros:
Ÿ Paths of length <=n are guaranteed to belong to one chunk
Cons:
Ÿ Higher computational effort
Ÿ Dataset size increases
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
n-hop Replication
Storing and Querying Semantic Data in the Cloud 65Daniel Janke & Steffen Staab
Summary of
Static Graph Cover Strategies
Cloud Hash Clustering Workload N-hop
Complexity Low Low High High Medium
Chunk sizes Imbalanced Balanced Imbalanced - -
Dataset size 100% 100% 100% >= 100% > 100%
Path
containment
Low Low High High Medium
Query
parallelization
Medium High Low Low/High -
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 66Daniel Janke & Steffen Staab
Overview Graph Cover Strategies
Graph Cover
Strategies
Static
Dynamic
Cloud-computing-based
Hash-based
Graph-clustering-based
Workload-aware
N-hop replication
Storing and Querying Semantic Data in the Cloud 67Daniel Janke & Steffen Staab
Dynamic Graph Cover Strategies
Ÿ Adaptation of graph cover during runtime
Ÿ Types of dynamics
– Adaptation of graph cover to actual query workload
– If one chunk becomes overloaded due to insertions of new
triples, move triples to other chunks
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 68Daniel Janke & Steffen Staab
Adaptation to
Actual Query Workload
Ÿ Initial static graph cover
Ÿ Keep track how frequently
- triple patterns
- molecules
are queried together
Ÿ Replicate triples such that
– Data transfer is reduced
– Workload is equally distributed among compute nodes
Examples: PHD-Store, AdHash, Sedge
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 69Daniel Janke & Steffen Staab
Dynamic Redistribution of Triples
Ÿ If one compute node stores too many triples (in comparison to
others), redistribute triples based on their hash values
Ÿ If triples are stored in an ordered fashion, send one half to another
compute node
Examples: [Battré2007], [Osorio2017]
Graph Cover
Strategies
Static
Dynamic
Cloud
Hash
Clustering
Workload
N-hop
Storing and Querying Semantic Data in the Cloud 70Daniel Janke & Steffen Staab
Indices
How to identify compute nodes that store required data?
Storing and Querying Semantic Data in the Cloud 71Daniel Janke & Steffen Staab
Example
Where is the information stored to answer the query:
How are the employees of WeST called?
Hash cover on subjects
Storing and Querying Semantic Data in the Cloud 72Daniel Janke & Steffen Staab
Properties of Indices
Graph cover independence:
Ÿ How independent is the index from the graph cover strategy?
Storage consumption:
Ÿ How much storage space is required for the index
Access time:
Ÿ How fast can the location of an indexed element be retrieved?
Indexed elements:
Ÿ Which elements are indexed?
Storing and Querying Semantic Data in the Cloud 73Daniel Janke & Steffen Staab
Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Higher degree of aggregation
l Slower access
l Lower degree of aggregation
Storing and Querying Semantic Data in the Cloud 74Daniel Janke & Steffen Staab
Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Higher degree of aggregation
l Slower access
l Lower degree of aggregation
Storing and Querying Semantic Data in the Cloud 75Daniel Janke & Steffen Staab
Centralized Hash-Based Index
Ÿ Applicable only for hash covers
Ÿ No explicit index required
Ÿ Location of a triple can be recomputed by the hash function and the
number of chunks
Ÿ Examples: 4store, Trinity.RDF
How are the employees of WeST called?
hash(w:WeST) → compute node 2
e:employs ?
f:givenname ?
(w:WeST, e:employs) ?
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 76Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute occurrences
Ÿ No explicit index required
– No storage consumption
Cons:
Ÿ Only applicable for hash covers
Ÿ Only applicable for hashed elements (subject, property, object)
Centralized Hash-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 77Daniel Janke & Steffen Staab
Centralized Statistics-Based Index
Ÿ Collect occurrences of
– Subject, property, object labels
– Combinations of subject, property, object labels
– RDFs types
– Property sets of molecules
Ÿ Examples: DARQ, FedX, Sedge
Subject Property Object
c1 c2 c1 c2 c1 c2
w:WeST 0 2 0 0 0 0
e:employs 0 0 1 2 0 0
f:givenname 0 0 2 1 0 0
... ... ... ...
How are the employees of WeST called?
Chunk IDs
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 78Daniel Janke & Steffen Staab
Pros:
Ÿ Independent of graph cover strategy
Ÿ Can estimate number of results
Ÿ Fast access
Cons:
Ÿ Requires compression for storage
Ÿ Trade off:
– Collecting only a few statistics → small size → less useful
– Collecting many statistics → large size (possibly size of dataset)
→ more useful
Centralized Statistics-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 79Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index: TriAD
Summarization algorithm:
1) Each chunk represented by chunk vertex
2) Start and end vertices of edges are substituted by corresponding
chunk vertices
3) Duplicate edges are removed
How are the employees of WeST called?
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 80Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index: EAGRE
Summarization algorithm:
1) Determine property sets of all subjects
2) Group similar property sets
3) Store occurrences of each property set
4) Property sets become vertices
5) Replace start and end vertices of edges by their property set
vertices
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 81Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index: EAGRE
Summarization algorithm:
1) Determine property sets of all subjects
2) Group similar property sets
3) Store occurrences of each property set
4) Property sets become vertices
5) Replace start and end vertices of edges by their property set
vertices
How are the employees of WeST called?
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 82Daniel Janke & Steffen Staab
Centralized Summary-Graph-Based
Index
Pros:
Ÿ Independent of graph cover strategy
Ÿ Identification of subqueries that can be answered locally
Cons:
Ÿ All triples with same subject have to be assigned to the same
compute node
Ÿ High storage consumption
Ÿ Summary graph needs to be queried
Ÿ Only properties are considered
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 83Daniel Janke & Steffen Staab
Overview Indices
Indices
Centralized
Decentralized
Hash-based
Statistics-based
Summary-graph-based
Hash-based
Schema-based
l Faster access
l Higher degree of aggregation
l Slower access
l Lower degree of aggregation
Storing and Querying Semantic Data in the Cloud 84Daniel Janke & Steffen Staab
Decentralized Hash-Based Index
Ÿ Version 1:
– Centralized hash-based index on each compute node
– Knowledge of all compute nodes required
– Examples: HDRS, Virtuoso Clustered Edition
Ÿ Version 2:
– Each compute node knows a forward table for a few neighbours
▪ Ring structure overlay (e.g., RDFPeers, PAGE)
▪ Tree structure overlay (e.g., Grid Vine, 3RDF)
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 85Daniel Janke & Steffen Staab
Ring Structure Overlay
Ÿ Compute nodes are ordered
Ÿ Each compute node knows
– Its direct neighbour
– A few distant neighbours
Ÿ When a request arrives
1)The compute node storing the
data is determined by the hash
function
2)Request is forwarded to the
(closest) compute node storing
the data
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 86Daniel Janke & Steffen Staab
Tree Structure Overlay
Ÿ C1
– stores all data whose hash
value starts with prefix 00
– Knows C2 is responsible for
prefix 01
– Knows C3 is responsible for
prefix 1
Ÿ When request arrives C1
– Computes hash value
– Forwards request based on the
known prefixes
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 87Daniel Janke & Steffen Staab
Pros:
Ÿ Easy to compute occurrences
Ÿ Low storage consumption
Cons:
Ÿ Only applicable for hash covers
Ÿ Only applicable for hashed elements (subject, property, object)
Decentralized Hash-Based Index Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 88Daniel Janke & Steffen Staab
Decentralized Schema-Based Index
Ÿ Applicable for type-based graph covers
Ÿ Use type hierarchy as tree structure overlay
Ÿ Example: SQPeer
rdfs:Ressource
rdf:Property
e:employs f:givennamef:Person
rdfs:Class
e:Institute
C
1
C
2
C
3
C
4
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Storing and Querying Semantic Data in the Cloud 89Daniel Janke & Steffen Staab
Pros:
Ÿ Queries that contain types can be forwarded to corresponding
compute node(s)
Ÿ Low storage consumption
Cons:
Ÿ Efficiently applicable only for type-based graph covers
Ÿ Types of requested resources need to be identified
Ÿ Unbalanced index sizes
Indices
Centralized
Decentralized
Hash
Statistics
Summary
Hash
Schema
Decentralized Schema-Based Index
Used in combination with other indices
Storing and Querying Semantic Data in the Cloud 90Daniel Janke & Steffen Staab
Summary Indices
Centralized Decentralized
Hash Statistics Summary
graph
Hash Schema
Applicable to
graph cover
strategies
Hash
covers
All All Hash
covers
Type-
based
covers
Storage
consumption
Low High High Low Low
Access time Fast Slow Slow Medium Medium
Indexed
elements
Hash
dependent
Various
aggregations
Properties Hash
dependent
Typed
elements
Storing and Querying Semantic Data in the Cloud 91Daniel Janke & Steffen Staab
Distributed Query Processing Strategies
How to distribute query processing?
Storing and Querying Semantic Data in the Cloud 92Daniel Janke & Steffen Staab
Terminology: SPARQL Query
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
How are the employees of WeST called?
Variable
Triple Pattern
Storing and Querying Semantic Data in the Cloud 93Daniel Janke & Steffen Staab
Terminology: Query Execution Tree
SELECT ?name WHERE {
<w:WeST> <e:employs> ?v1.
?v1 <f:givenname> ?name
}
Storing and Querying Semantic Data in the Cloud 94Daniel Janke & Steffen Staab
Centralized Query Processing
w:martin
“Martin“
g:wanja
“Wanja“
w:daniel
“Daniel“
w:WeST
g:Gesis
f:givenname
f:givenname
f:givenname
e:employs
e:employs
e:employs
f:knows
f:knows
f:knows
g:bello
r:type
e:ownedByg:Dog
?v1
w:martin
w:daniel
?v1 ?name
g:wanja “Wanja”
w:martin “Martin”
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
w:daniel “Daniel”
?name
“Martin”
“Daniel”
Storing and Querying Semantic Data in the Cloud 95Daniel Janke & Steffen Staab
Distributed Query Processing
General procedure
1) Split query into subquery that can be executed locally
2) Execute subqueries on compute nodes identified by index
3) Join results of subqueries
4) Return results
Storing and Querying Semantic Data in the Cloud 96Daniel Janke & Steffen Staab
Splitting Query into Subqueries
Ÿ Simplest case: each triple pattern forms a subquery
Ÿ Use knowledge about graph covers
– All triples with same subject are stored on the same compute
node
– Paths of length n can be executed locally
Ÿ Use index information
– Co-occurrences of subject-property or property-property
Storing and Querying Semantic Data in the Cloud 97Daniel Janke & Steffen Staab
Properties of Join Operations
Parallelisation:
Ÿ Is the join computation distributed among several or all compute
nodes?
Computational effort:
Ÿ How many comparisons are performed during the join
computation?
Ÿ How many subqueries result out of the join computation?
Data transfer:
Ÿ How many intermediate results are transferred to compute the join?
Blocking:
Ÿ Do subqueries need to be finished before the join can be
computed?
Storing and Querying Semantic Data in the Cloud 98Daniel Janke & Steffen Staab
Overview Join Processing
Joins
Centralized
Distributed
Hash join
Bind join
Replication-based join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on
a single compute node
Join is distributed over
several compute nodes
Storing and Querying Semantic Data in the Cloud 99Daniel Janke & Steffen Staab
Overview Join Processing
Joins
Centralized
Distributed
Hash join
Bind join
Replication-based join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on
a single compute node
Join is distributed over
several compute nodes
Storing and Querying Semantic Data in the Cloud 100Daniel Janke & Steffen Staab
Centralized Nested Loop Join
Compare each element of first list with every element of second list
Examples: SPLENDID, DARQ
Pros:
Ÿ Does not require an ordering
Ÿ Arbitrary join conditions possible
Cons:
Ÿ Inefficient
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
?v1
w:martin
w:daniel
?v1 ?name
w:martin “Martin”
g:wanja “Wanja”
w:daniel “Daniel”
Storing and Querying Semantic Data in the Cloud 101Daniel Janke & Steffen Staab
Centralized Merge Join
Ÿ Requires sorted intermediate result lists
Ÿ Compare one result r only with results that are <= r
Ÿ Example: Partout
Pros:
Ÿ Fast for ordered result sets
Cons:
Ÿ Slow for unordered result sets
Ÿ Intermediate result set size might lead to a bottleneck
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
w:daniel “Daniel”
w:martin “Martin”
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 102Daniel Janke & Steffen Staab
Centralized Hash Join
Ÿ Assign results to buckets based on their hashes
Ÿ Join a result only with corresponding bucket
Ÿ Examples: ANAPSID, LHD
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
...
?v1 ?name
w:daniel “Daniel”
...
?v1 ?name
w:martin “Martin”
...
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
A non-blocking symmetric version exists
Storing and Querying Semantic Data in the Cloud 103Daniel Janke & Steffen Staab
Pros:
Ÿ No ordering required
Ÿ On average almost constant time complexity
Cons:
Ÿ Intermediate result set size might lead to a bottleneck
Centralized Hash Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 104Daniel Janke & Steffen Staab
Bind Join
Ÿ Substitute variables of the second subquery based on results from first
subquery
Ÿ Second query is executed multiple times
Ÿ Examples: FedX, Avanalche, SemaGrow
?v1
w:martin
?v1 ?name
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
?v1
w:daniel
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 105Daniel Janke & Steffen Staab
Pros:
Ÿ Reduces the amount of intermediate results
Cons:
Ÿ Increases number of executed subqueries
Ÿ Possible bottlenecks:
– Large intermediate result set sizes
– Large number of subqueries
Bind Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 106Daniel Janke & Steffen Staab
Summary Centralized Joins
Nested Merge Hash Symmetric Bind
Computational
effort
High Medium -
extra effort
for ordering
Low Low Medium -
effort of
many
subqueries
# executed
queries
Low Low Low Low High
Blocking
operation
Yes Yes Yes No Yes
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 107Daniel Janke & Steffen Staab
Overview Join Processing
Joins
Centralized
Distributed
Hash join
Bind join
Replication-based join
Hash join
Merge join
Merge join
Nested-loop join
Bind join
Join is executed on
a single compute node
Join is distributed over
several compute nodes
Storing and Querying Semantic Data in the Cloud 108Daniel Janke & Steffen Staab
Replication-Based Distributed Join
All results of first subquery are sent to all compute nodes on which the
second subquery is executed
Example: SemStore
Compute Node 2
Compute Node 1
Compute Node 2
?v1 ?name
w:martin “Martin”
?v1 ?name
w:daniel “Daniel”
?v1
w:daniel
w:martin
?v1
w:daniel
w:martin
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 109Daniel Janke & Steffen Staab
Pros:
Ÿ Not all compute nodes are necessary involved in joining
Ÿ Using data locality → Less transferred data
Cons:
Ÿ Intermediate result set size may become bottleneck if second
subquery is executed on a single compute node
Ÿ One subtree needs to be finished before join can be executed
Replication-Based Distributed Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 110Daniel Janke & Steffen Staab
Distributed Hash Join
Hash join in which each compute node serves as a bucket
Example: DiploCloud
Compute Node 2Compute Node 1
?v1
w:martin
w:daniel
?v1 ?name
w:martin “Martin”
g:wanja “Wanja”
?v1 ?name
w:daniel “Daniel”
?v1 ?name
w:martin “Martin”
g:wanja “Wanja”
?v1
w:martin
?v1
w:daniel
?v1 ?name
w:daniel “Daniel”
hash(w:martin)
hash(w:daniel)
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 111Daniel Janke & Steffen Staab
Pros:
Ÿ All compute nodes are involved in join processing
Ÿ Bottleneck is unlikely due to distribution of intermediate result set
over all compute nodes
Cons:
Ÿ No usage of data locality → high data transfer
Ÿ One subtree needs to be finished before join can be executed
Distributed Hash Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 112Daniel Janke & Steffen Staab
Distributed Merge Join
Ÿ Results of subqueries are ordered
Ÿ Each compute node is responsible for a range of results
Ÿ Examples: H2RDF+, SHARD, SparkRDF, SPARQLGX
Compute Node 2Compute Node 1
?v1
w:daniel
w:martin
?v1 ?name
g:wanja “Wanja”
w:martin “Martin”
?v1 ?name
w:daniel “Daniel”
Range a:a-w:d Range w:e-z:z
?v1 ?name
g:wanja “Wanja”
w:daniel “Daniel”
?v1
w:daniel
?v1
w:martin
?v1 ?name
w:martin “Martin”
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 113Daniel Janke & Steffen Staab
Pros:
Ÿ All compute nodes are involved in join processing
Ÿ Bottleneck is unlikely due to distribution of intermediate result set
over all compute nodes
Cons:
Ÿ Results need to be ordered
Ÿ Agreement on result ranges required
Ÿ No usage of data locality → high data transfer
Ÿ One subtree needs to be finished before join can be executed
Distributed Merge Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 114Daniel Janke & Steffen Staab
Distributed Bind Join
Join algorithm:
1) Get results of first subquery
2) For each following bind join query,
1) Identify compute nodes with matches
2) Fork query execution to remote compute nodes
Examples: RDFPeers, GridVine, Atlas, TripleRush, Trinity.RDF
Compute Node 2
Compute Node 1
Compute Node 2
?v1 ?name
w:martin “Martin”
?v1 ?name
w:daniel “Daniel”
?v1
w:daniel
w:martin
?v1
w:martin
?v1
w:daniel
Fork
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 115Daniel Janke & Steffen Staab
Pros:
Ÿ Join computed without waiting for any subtree to be finished
Ÿ Usage of data locality → Less transferred data
Ÿ Results of last join operation do not need to be sent to other
compute nodes
Cons:
Ÿ Intermediate result set size may become bottleneck if second
subquery is executed on a single compute node
Distributed Bind Join
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 116Daniel Janke & Steffen Staab
Distributed Joins Summary
Centralized
Joins
Distributed
Replication
Distributed
Hash
Distributed
Merge
Distributed
Bind
Data Transfer High Low High High Low
Parallelisation Low Medium High High Medium
# Subqueries Low Low Low Low High
Joins
Centralized
Distributed
Hash
Bind
Replication
Hash
Bind
Merge
Merge
Nested
Storing and Querying Semantic Data in the Cloud 117Daniel Janke & Steffen Staab
Fault Tolerance
How to achieve fault tolerance?
Storing and Querying Semantic Data in the Cloud 118Daniel Janke & Steffen Staab
Mirroring
Ÿ There exist several identical copies of each compute node
Ÿ If one compute node fails, its copy continues working
Ÿ Example: Virtuoso Clustered Edition
Pros:
Ÿ Query workload can be distributed among all copies
Cons:
Ÿ Keeping copies up to date
Ÿ Replicas of different chunks are not combined to increase data
locality
Compute Node 1 Compute Node 2 Compute Node 1’ Compute Node 2’
Storing and Querying Semantic Data in the Cloud 119Daniel Janke & Steffen Staab
Data Replication
Ÿ All compute nodes are ordered in a ring
Ÿ Data from one compute node is replicated on neighbours
Ÿ If one compute node fails, data remains available on neighbours
Ÿ Example: 4store, RDFPeers
Pros:
Ÿ Data locality of initial graph cover is increased
Cons:
Ÿ Keeping copies up to data
Compute Node 1 Compute Node 2 Compute Node 3
1
1’
2
2’
3
3’
Storing and Querying Semantic Data in the Cloud 120Daniel Janke & Steffen Staab
Evaluation Methodology
How to evaluate?
Storing and Querying Semantic Data in the Cloud 121Daniel Janke & Steffen Staab
Properties of Evaluation Methodologies
Realism:
Do the measurement results reflect the performance of real RDF
stores?
Modularity:
Can alternative implementations of individual components be
evaluated?
Evaluation depth:
Is the system evaluated only as a whole or are the performance of the
individual components evaluated?
Difficulty:
How difficult is it to apply the evaluation methodology?
Storing and Querying Semantic Data in the Cloud 122Daniel Janke & Steffen Staab
Black Box Evaluation
Evaluation of RDF stores as a whole
Some problems (of many):
Ÿ How fast is your network?
Ÿ How large are your images?
Ÿ Which processor configuration do you use?
Ÿ What are the structures of your caches?
Do you evaluate the RDF store or your hardware configuration?
Dataset
QueriesQueriesQueries
Storing and Querying Semantic Data in the Cloud 123Daniel Janke & Steffen Staab
Black Box Evaluation
Evaluation of RDF stores as a whole
Pros:
Ÿ Easy to perform evaluation since no implementation knowledge is
required
Ÿ Measurements reflect the behaviour of a real RDF store
Cons:
Ÿ Only superficial evaluations possible
Ÿ No performance evaluation of individual components possible
Dataset
QueriesQueriesQueries
Storing and Querying Semantic Data in the Cloud 124Daniel Janke & Steffen Staab
Glass Box Evaluation
Ÿ Evaluation of RDF stores as a whole
Ÿ Collecting performance measurements of components by
– Using a profiling system like Granula
– Adapting source code to perform measurements
Dataset
QueriesQueriesQueries
Storing and Querying Semantic Data in the Cloud 125Daniel Janke & Steffen Staab
Glass Box Evaluation
Pros:
Ÿ In-depth performance evaluation possible
Ÿ Measurements reflect the behaviour of a real RDF store
Cons:
Ÿ Source code needs to be extended to collect measurements
Ÿ Individual components can hardly be exchanged by alternative
implementations
Storing and Querying Semantic Data in the Cloud 126Daniel Janke & Steffen Staab
Simulation-based Glass Box Evaluation
Evaluation of alternative implementations of a single component by
simulating the behaviour of a real RDF store
Pros:
Ÿ Performance evaluation of individual components possible
Ÿ Alternative implementations of individual components is possible
Cons:
Ÿ Evaluation environment (simulator) needs to be implemented
Ÿ Questionable whether performance measurements reflect behaviour of
real RDF store
Dataset
QueriesQueriesQueries
ComponentComponent
Component
Storing and Querying Semantic Data in the Cloud 127Daniel Janke & Steffen Staab
Glass Box Evaluation Platform
RDF store
Ÿ that allows the exchange of individual components by alternative
implementations
Ÿ Measures performance of individual components
Dataset
QueriesQueriesQueries Graph Cover
Creator
Graph Cover
Creator
Graph Cover
Creator
Storing and Querying Semantic Data in the Cloud 128Daniel Janke & Steffen Staab
Glass Box Evaluation
Pros:
Ÿ In-depth performance evaluation possible
Ÿ Alternative implementations of individual components can be
evaluated
Ÿ Measurements reflect the behaviour of a real RDF store
Cons:
Ÿ Development of glass box evaluation platform difficult
Ÿ Interdependencies might limit the exchangeability of components
Storing and Querying Semantic Data in the Cloud 129Daniel Janke & Steffen Staab
Evaluation Methodology Summary
Black box Glass box Simulation Glass box
platform
Realism High High Low Medium
Modularity Low Low High High
Evaluation depth Low High High High
Difficulty Easy Medium Medium Hard
Storing and Querying Semantic Data in the Cloud 130Daniel Janke & Steffen Staab
Conclusion & Open Challenges
Storing and Querying Semantic Data in the Cloud 131Daniel Janke & Steffen Staab
Conclusion
Challenges of RDF stores in the cloud:
1) How to design the architecture?
2) How to distribute the data?
3) How to identify compute nodes that store required data?
4) How to distribute query processing?
5) How to achieve fault tolerance?
6) How to evaluate?
Storing and Querying Semantic Data in the Cloud 132Daniel Janke & Steffen Staab
Example RDF Stores in the Cloud
Virtuoso Clustered
Edition
BlazeGraph GraphDB
Architecture Master-slave Master-slave Master-slave
Graph Cover
Strategy
Hash cover Distributed B+-tree Replication of
graph on all slaves
Index Centralized hash-
based index on each
compute node
Distributed B+-tree Not necessary
Query
Execution
Strategy
Distributed bind join Centralized join Centralized join
Fault Tolerance Mirroring None Mirroring
Storing and Querying Semantic Data in the Cloud 133Daniel Janke & Steffen Staab
Example RDF Stores in the Cloud
DiploCloud S2RDF Trinity.RDF
Architecture Master-slave Batch processing
framework
Master-slave
Graph Cover
Strategy
Workload-aware Vertical graph splits Hash cover
Index Centralized
Statistics-based index
None Distributed
chunk-integrated
summary graph
Query
Execution
Strategy
Centralized join
(for small result sets)
Distributed hash join
(otherwise)
Distributed joins Distributed bind join
Fault Tolerance None Based on batch
processing
framework
None
Storing and Querying Semantic Data in the Cloud 134Daniel Janke & Steffen Staab
Challenges Not Presented
Ÿ How to achieve transactional security?
Ÿ How to perform online analytical processing (OLAP) queries?
Ÿ How to process property paths?
Ÿ How to perform distributed reasoning?
Ÿ How to perform distributed stream processing?
Institute for Web Science and Technologies · University of Koblenz-Landau, Germany
Thank you for your Attention!
Daniel Janke, Steffen Staab
Storing and Querying Semantic Data in the Cloud 136Daniel Janke & Steffen Staab
Image References
Ÿ https://openclipart.org/detail/155101/server
Ÿ https://openclipart.org/detail/213252/gear-icon
Ÿ https://openclipart.org/detail/204067/bpm-mail-symbol
Ÿ https://openclipart.org/detail/169757/check-and-cross-marks
Ÿ https://openclipart.org/detail/153577/stopwatch
Storing and Querying Semantic Data in the Cloud 137Daniel Janke & Steffen Staab
References
[Huang2011] Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF Graphs. PVLDB
4(11), 1123–1134 (2011)
[Peng2016] Peng, P., Zou, L., Özsu, M.T., Chen, L., Zhao, D.: Processing SPARQL Queries over Distributed
RDF Graphs. The VLDB Journal 25(2), 243–268 (apr 2016).
[Battré2007] Battré, D., Heine, F., Höing, A., Kao, O.: On Triple Dissemination, Forward-Chaining, and Load
Balancing in DHT Based RDF Stores. In: Moro, G., Bergamaschi, S., Joseph, S., Morin, J.H., Ouksel, A.M.
(eds.) Databases, Information Systems, and Peer-to-Peer Computing. pp. 343–354. Springer Berlin
Heidelberg, Berlin, Heidelberg (2007)
[Osorio1017] Osorio, M., Aranda, C.B.: Storage Balancing in P2P Based Distributed RDF Data Stores. In:
Proceedings of the Workshop on Decentralizing the Semantic Web 2017 co-located with 16th International
Semantic Web Conference (ISWC 2017) (2017).

Contenu connexe

Tendances

Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Karanjeet Singh
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Simplilearn
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...Simplilearn
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesTony Hammond
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Simplilearn
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopIJTET Journal
 
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar  : Talend : The Non-Programmer's Swiss Knife for Big DataWebinar  : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar : Talend : The Non-Programmer's Swiss Knife for Big DataEdureka!
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...Gezim Sejdiu
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopDataWorks Summit
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
Is Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceIs Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceEdureka!
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudDhaval Thakker
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...Edureka!
 
Packages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージPackages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージHiroki K
 
LinkedGov extension for Google Refine
LinkedGov extension for Google RefineLinkedGov extension for Google Refine
LinkedGov extension for Google Refinedanpaulsmith
 

Tendances (20)

Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017Sparkler Presentation for Spark Summit East 2017
Sparkler Presentation for Spark Summit East 2017
 
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
 
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
What Is Hadoop? | What Is Big Data & Hadoop | Introduction To Hadoop | Hadoop...
 
The nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologiesThe nature.com ontologies portal: nature.com/ontologies
The nature.com ontologies portal: nature.com/ontologies
 
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
Hadoop Training | Hadoop Training For Beginners | Hadoop Architecture | Hadoo...
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache HadoopA Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
A Survey on Approaches for Frequent Item Set Mining on Apache Hadoop
 
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar  : Talend : The Non-Programmer's Swiss Knife for Big DataWebinar  : Talend : The Non-Programmer's Swiss Knife for Big Data
Webinar : Talend : The Non-Programmer's Swiss Knife for Big Data
 
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
The Best of Both Worlds: Unlocking the Power of (big) Knowledge Graphs with S...
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
IPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for HadoopIPython Notebook as a Unified Data Science Interface for Hadoop
IPython Notebook as a Unified Data Science Interface for Hadoop
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Is Hadoop a necessity for Data Science
Is Hadoop a necessity for Data ScienceIs Hadoop a necessity for Data Science
Is Hadoop a necessity for Data Science
 
Information Extraction and Linked Data Cloud
Information Extraction and Linked Data CloudInformation Extraction and Linked Data Cloud
Information Extraction and Linked Data Cloud
 
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
What is Hadoop | Introduction to Hadoop | Hadoop Tutorial | Hadoop Training |...
 
Packages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージPackages for data wrangling データ前処理のためのパッケージ
Packages for data wrangling データ前処理のためのパッケージ
 
Real-World NoSQL Schema Design
Real-World NoSQL Schema DesignReal-World NoSQL Schema Design
Real-World NoSQL Schema Design
 
LinkedGov extension for Google Refine
LinkedGov extension for Google RefineLinkedGov extension for Google Refine
LinkedGov extension for Google Refine
 

Similaire à Storing and Querying Semantic Data in the Cloud

Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and OntarioBigData_Europe
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?Samet KILICTAS
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordMark Wilkinson
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...datascienceiqss
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r publishedDipendra Kusi
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and RDatabricks
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Stefan Dietze
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data scienceAndy Petrella
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...Databricks
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open PlatformJongwook Woo
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and RDatabricks
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataWeCloudData
 
Population genomics is a data management problem
Population genomics is a data management problemPopulation genomics is a data management problem
Population genomics is a data management problemStavros Papadopoulos
 
Linked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryLinked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryRuben Schalk
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic WebIvan Herman
 
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Mark Wilkinson
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev Kumar
 

Similaire à Storing and Querying Semantic Data in the Cloud (20)

Release webinar: Sansa and Ontario
Release webinar: Sansa and OntarioRelease webinar: Sansa and Ontario
Release webinar: Sansa and Ontario
 
How Graph Databases used in Police Department?
How Graph Databases used in Police Department?How Graph Databases used in Police Department?
How Graph Databases used in Police Department?
 
Force11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, OxfordForce11 JDDCP workshop presentation, @ Force2015, Oxford
Force11 JDDCP workshop presentation, @ Force2015, Oxford
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
Data FAIRport Skunkworks: Common Repository Access Via Meta-Metadata Descript...
 
Big data analysis using spark r published
Big data analysis using spark r publishedBig data analysis using spark r published
Big data analysis using spark r published
 
Enabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and REnabling Exploratory Analysis of Large Data with Apache Spark and R
Enabling Exploratory Analysis of Large Data with Apache Spark and R
 
Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)Linked Data for Architecture, Engineering and Construction (AEC)
Linked Data for Architecture, Engineering and Construction (AEC)
 
Scala: the unpredicted lingua franca for data science
Scala: the unpredicted lingua franca  for data scienceScala: the unpredicted lingua franca  for data science
Scala: the unpredicted lingua franca for data science
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets with Jules ...
 
Big Data Trend with Open Platform
Big Data Trend with Open PlatformBig Data Trend with Open Platform
Big Data Trend with Open Platform
 
Enabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and REnabling exploratory data science with Spark and R
Enabling exploratory data science with Spark and R
 
Big Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudDataBig Data for Data Scientists - WeCloudData
Big Data for Data Scientists - WeCloudData
 
Population genomics is a data management problem
Population genomics is a data management problemPopulation genomics is a data management problem
Population genomics is a data management problem
 
LD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and toolsLD4KD 2015 - Demos and tools
LD4KD 2015 - Demos and tools
 
Linked Open Data Utrecht University Library
Linked Open Data Utrecht University LibraryLinked Open Data Utrecht University Library
Linked Open Data Utrecht University Library
 
Databases for Data Science
Databases for Data ScienceDatabases for Data Science
Databases for Data Science
 
State of the Semantic Web
State of the Semantic WebState of the Semantic Web
State of the Semantic Web
 
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
Data FAIRport Prototype & Demo - Presentation to Elsevier, Jul 10, 2015
 
Rajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developerRajeev kumar apache_spark &amp; scala developer
Rajeev kumar apache_spark &amp; scala developer
 

Plus de Steffen Staab

Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureSteffen Staab
 
Symbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine LearningSymbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine LearningSteffen Staab
 
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...Steffen Staab
 
Web Futures: Inclusive, Intelligent, Sustainable
Web Futures: Inclusive, Intelligent, SustainableWeb Futures: Inclusive, Intelligent, Sustainable
Web Futures: Inclusive, Intelligent, SustainableSteffen Staab
 
Concepts in Application Context ( How we may think conceptually )
Concepts in Application Context ( How we may think conceptually )Concepts in Application Context ( How we may think conceptually )
Concepts in Application Context ( How we may think conceptually )Steffen Staab
 
Ontologien und Semantic Web - Impulsvortrag Terminologietag
Ontologien und Semantic Web - Impulsvortrag TerminologietagOntologien und Semantic Web - Impulsvortrag Terminologietag
Ontologien und Semantic Web - Impulsvortrag TerminologietagSteffen Staab
 
Opinion Formation and Spreading
Opinion Formation and SpreadingOpinion Formation and Spreading
Opinion Formation and SpreadingSteffen Staab
 
10 Jahre Web Science
10 Jahre Web Science10 Jahre Web Science
10 Jahre Web ScienceSteffen Staab
 
(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contentsSteffen Staab
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad DataSteffen Staab
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 
Wwsss intro2016-final
Wwsss intro2016-finalWwsss intro2016-final
Wwsss intro2016-finalSteffen Staab
 
10 Years Web Science
10 Years Web Science10 Years Web Science
10 Years Web ScienceSteffen Staab
 
Semantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and PracticesSemantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and PracticesSteffen Staab
 
Closing Session ISWC 2015
Closing Session ISWC 2015Closing Session ISWC 2015
Closing Session ISWC 2015Steffen Staab
 
ISWC2015 Opening Session
ISWC2015 Opening SessionISWC2015 Opening Session
ISWC2015 Opening SessionSteffen Staab
 
Bias in the Social Web
Bias in the Social WebBias in the Social Web
Bias in the Social WebSteffen Staab
 

Plus de Steffen Staab (20)

Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sure
 
Symbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine LearningSymbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine Learning
 
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...
 
Web Futures: Inclusive, Intelligent, Sustainable
Web Futures: Inclusive, Intelligent, SustainableWeb Futures: Inclusive, Intelligent, Sustainable
Web Futures: Inclusive, Intelligent, Sustainable
 
Eyeing the Web
Eyeing the WebEyeing the Web
Eyeing the Web
 
Concepts in Application Context ( How we may think conceptually )
Concepts in Application Context ( How we may think conceptually )Concepts in Application Context ( How we may think conceptually )
Concepts in Application Context ( How we may think conceptually )
 
Semantics reloaded
Semantics reloadedSemantics reloaded
Semantics reloaded
 
Ontologien und Semantic Web - Impulsvortrag Terminologietag
Ontologien und Semantic Web - Impulsvortrag TerminologietagOntologien und Semantic Web - Impulsvortrag Terminologietag
Ontologien und Semantic Web - Impulsvortrag Terminologietag
 
Opinion Formation and Spreading
Opinion Formation and SpreadingOpinion Formation and Spreading
Opinion Formation and Spreading
 
The Web We Want
The Web We WantThe Web We Want
The Web We Want
 
10 Jahre Web Science
10 Jahre Web Science10 Jahre Web Science
10 Jahre Web Science
 
(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 
Wwsss intro2016-final
Wwsss intro2016-finalWwsss intro2016-final
Wwsss intro2016-final
 
10 Years Web Science
10 Years Web Science10 Years Web Science
10 Years Web Science
 
Semantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and PracticesSemantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and Practices
 
Closing Session ISWC 2015
Closing Session ISWC 2015Closing Session ISWC 2015
Closing Session ISWC 2015
 
ISWC2015 Opening Session
ISWC2015 Opening SessionISWC2015 Opening Session
ISWC2015 Opening Session
 
Bias in the Social Web
Bias in the Social WebBias in the Social Web
Bias in the Social Web
 

Dernier

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingShane Coughlan
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonApplitools
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?Alexandre Beguel
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLionel Briand
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesKrzysztofKkol1
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profileakrivarotava
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingShane Coughlan
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slidesvaideheekore1
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...Bert Jan Schrijver
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsJean Silva
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxRTS corp
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorTier1 app
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 

Dernier (20)

OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full RecordingOpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
OpenChain Education Work Group Monthly Meeting - 2024-04-10 - Full Recording
 
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + KobitonLeveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
Leveraging AI for Mobile App Testing on Real Devices | Applitools + Kobiton
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?SAM Training Session - How to use EXCEL ?
SAM Training Session - How to use EXCEL ?
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Large Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and RepairLarge Language Models for Test Case Evolution and Repair
Large Language Models for Test Case Evolution and Repair
 
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilitiesAmazon Bedrock in Action - presentation of the Bedrock's capabilities
Amazon Bedrock in Action - presentation of the Bedrock's capabilities
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
SoftTeco - Software Development Company Profile
SoftTeco - Software Development Company ProfileSoftTeco - Software Development Company Profile
SoftTeco - Software Development Company Profile
 
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full RecordingOpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
OpenChain AI Study Group - Europe and Asia Recap - 2024-04-11 - Full Recording
 
Introduction to Firebase Workshop Slides
Introduction to Firebase Workshop SlidesIntroduction to Firebase Workshop Slides
Introduction to Firebase Workshop Slides
 
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
JavaLand 2024 - Going serverless with Quarkus GraalVM native images and AWS L...
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
Strategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero resultsStrategies for using alternative queries to mitigate zero results
Strategies for using alternative queries to mitigate zero results
 
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptxReal-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
Real-time Tracking and Monitoring with Cargo Cloud Solutions.pptx
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Effectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryErrorEffectively Troubleshoot 9 Types of OutOfMemoryError
Effectively Troubleshoot 9 Types of OutOfMemoryError
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 

Storing and Querying Semantic Data in the Cloud

  • 1. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany Storing and Querying Semantic Data in the Cloud Reasoning Web Summer School 2018 (RW 2018) Daniel Janke & Steffen Staab 24.09.2018
  • 2. Storing and Querying Semantic Data in the Cloud 2Daniel Janke & Steffen Staab Amount of Available RDF Data Increases Source: https://lod-cloud.net/
  • 3. Storing and Querying Semantic Data in the Cloud 3Daniel Janke & Steffen Staab Why using RDF Stores in the Cloud? Example 1: Wikidata Ÿ Dataset size: 4.9 billion triples (as of April 2018) Ÿ Stored in distributed BlazeGraph RDF store because – Higher query throughput – Higher availability Example 2: BBC Ÿ On average 1 million SPARQL queries per day (in 2010) Ÿ Stored in distributed GraphDB RDF store because – Higher query throughput – Higher availability
  • 4. Storing and Querying Semantic Data in the Cloud 4Daniel Janke & Steffen Staab Assumptions of this talk 1. There are exceptions for (almost) everything 2. You are always allowed to ask questions 3. You have some knowledge Required l RDF l SPARQL Helpful l Cloud processing frameworks like Hadoop or Spark l Query processing in relational databases If not -> See 2. Timeplan
  • 5. Storing and Querying Semantic Data in the Cloud 5Daniel Janke & Steffen Staab How to deal with increasing volume of RDF?
  • 6. Storing and Querying Semantic Data in the Cloud 6Daniel Janke & Steffen Staab Centralized RDF Stores Ÿ Graph database for storing RDF graphs (includes tasks like data storage, query processing, ...) Ÿ All RDF store tasks are executed on a single computer
  • 7. Storing and Querying Semantic Data in the Cloud 7Daniel Janke & Steffen Staab Terminology: RDF Graph Ÿ Directed graph with labelled vertices and edges Ÿ Labels of start vertex, edge and end vertex are an RDF triple Ÿ RDF graph is a set of RDF triples w:martin “Martin“ g:wanja “Wanja“ w:daniel “Daniel“ w:WeST g:Gesis f:givenname f:givenname f:givenname e:employs e:employs e:employs f:knows f:knows f:knowsg:bello r:type e:ownedBy g:Dog Triple Subject Property Object
  • 8. Storing and Querying Semantic Data in the Cloud 8Daniel Janke & Steffen Staab Terminology: SPARQL Query SELECT ?name WHERE { <w:WeST> <e:employs> ?v1. ?v1 <f:givenname> ?name } How are the employees of WeST called? Variable Triple Pattern
  • 9. Storing and Querying Semantic Data in the Cloud 9Daniel Janke & Steffen Staab Terminology: Query Execution Tree SELECT ?name WHERE { <w:WeST> <e:employs> ?v1. ?v1 <f:givenname> ?name }
  • 10. Storing and Querying Semantic Data in the Cloud 10Daniel Janke & Steffen Staab Centralized Query Processing w:martin “Martin“ g:wanja “Wanja“ w:daniel “Daniel“ w:WeST g:Gesis f:givenname f:givenname f:givenname e:employs e:employs e:employs f:knows f:knows f:knows g:bello r:type e:ownedByg:Dog ?v1 w:martin w:daniel ?v1 ?name g:wanja “Wanja” w:martin “Martin” w:daniel “Daniel” ?v1 ?name w:martin “Martin” w:daniel “Daniel” ?name “Martin” “Daniel”
  • 11. Storing and Querying Semantic Data in the Cloud 11Daniel Janke & Steffen Staab Centralized RDF Stores Ÿ Graph database for storing RDF graphs (includes tasks like data storage, query processing, ...) Ÿ All RDF store tasks are executed on a single computer Advantages Ÿ Less complex than RDF stores running on several computers Disadvantages Ÿ Hardware of computer limits the size of processable RDF graph Ÿ No fault tolerance
  • 12. Storing and Querying Semantic Data in the Cloud 12Daniel Janke & Steffen Staab RDF Stores in the Cloud Ÿ RDF store tasks are bundled into nodes – Data storage tasks are bundled to storage nodes – Query processing tasks are bundled to compute nodes Ÿ Compute and storage nodes1 are distributed/replicated among several computers 1 In the following, compute and storage nodes are referred to as simply compute nodes.
  • 13. Storing and Querying Semantic Data in the Cloud 13Daniel Janke & Steffen Staab How to place the data? w:martin “Martin“ g:wanja “Wanja“ w:daniel “Daniel“ w:WeST g:Gesis f:givenname f:givenname f:givenname e:employs e:employs e:employs f:knows f:knows f:knows g:bello r:type e:ownedByg:Dog
  • 14. Storing and Querying Semantic Data in the Cloud 14Daniel Janke & Steffen Staab Where to find the required data? w:martin “Martin“ g:wanja “Wanja“ w:daniel “Daniel“ w:WeST g:Gesis f:givenname f:givenname f:givenname e:employs e:employs e:employs f:knows f:knows f:knows g:bello r:type e:ownedByg:Dog
  • 15. Storing and Querying Semantic Data in the Cloud 15Daniel Janke & Steffen Staab How to distribute the query processing? w:martin “Martin“ g:wanja “Wanja“ w:daniel “Daniel“ w:WeST g:Gesis f:givenname f:givenname f:givenname e:employs e:employs e:employs f:knows f:knows f:knows g:bello r:type e:ownedByg:Dog ?v1 w:martin w:daniel ?v1 ?name w:martin “Martin” w:daniel “Daniel” ?v1 ?name w:martin “Martin” w:daniel “Daniel” ?name “Martin” “Daniel” ?v1 ?name g:wanja “Wanja”
  • 16. Storing and Querying Semantic Data in the Cloud 16Daniel Janke & Steffen Staab RDF Stores in the Cloud Ÿ RDF store tasks are bundled into nodes – Data storage tasks are bundled to storage nodes – Query processing tasks are bundled to compute nodes Ÿ Compute and storage nodes1 are distributed/replicated among several computers Advantages Ÿ Scalable by adding new compute or storage nodes – Scaling up the dataset size – Scaling up the query throughput Ÿ Possibly fault tolerant Disadvantages Ÿ Higher complexity 1 In the following, compute and storage nodes are referred to as simply compute nodes.
  • 17. Storing and Querying Semantic Data in the Cloud 17Daniel Janke & Steffen Staab Challenges of RDF Stores in the Cloud 1) How to design the architecture? 2) How to distribute the data? 3) How to identify compute nodes that store required data? 4) How to distribute query processing? 5) How to achieve fault tolerance? 6) How to evaluate? Many ideas from 50 years of data engineering carry over -> We focus on approaches more commonly used for RDF
  • 18. Storing and Querying Semantic Data in the Cloud 18Daniel Janke & Steffen Staab #Related Work about RDF Stores 1) How to design the architecture? 2)How to distribute the data? 3)How to identify compute nodes that store required data? 4)How to distribute query processing? 5) How to achieve fault tolerance? 6) How to evaluate? Rarely considered on its own
  • 19. Storing and Querying Semantic Data in the Cloud 19Daniel Janke & Steffen Staab Architecture Types How to design the architecture?
  • 20. Storing and Querying Semantic Data in the Cloud 20Daniel Janke & Steffen Staab Properties of Architecture Types Implementation complexity: Ÿ How difficult is the implementation? Freedom of data placement: Ÿ To which extent can the data placement be influenced? Query overhead: Ÿ Which query overhead is caused by the architecture? Scalability: Ÿ To which extent do the storage and query processing capabilities increase if further compute nodes are added? Fault tolerance: Ÿ Do single point of failures exist? Ÿ How easily can they be removed?
  • 21. Storing and Querying Semantic Data in the Cloud 21Daniel Janke & Steffen Staab Architecture Types Architecture RDF stores using cloud computing frameworks Distributed RDF stores Federated RDF stores
  • 22. Storing and Querying Semantic Data in the Cloud 22Daniel Janke & Steffen Staab Architecture Types Architecture RDF stores using cloud computing frameworks Distributed RDF stores Federated RDF stores
  • 23. Storing and Querying Semantic Data in the Cloud 23Daniel Janke & Steffen Staab RDF Stores Using Cloud Computing Frameworks Converts and loads RDF graph into cloud computing framework Translates SPARQL queries into task(s) for cloud computing framework Architecture Cloud computing Distributed Federated Examples: SHARD, S2RDF, S2X, TripleRush, Jena-Hbase, Sempala, D-SPARQ
  • 24. Storing and Querying Semantic Data in the Cloud 24Daniel Janke & Steffen Staab Cloud Computing Framework Types RDF stores using cloud computing frameworks Batch processing frameworks Graph processing frameworks NoSQL databases Column stores Document stores Architecture Cloud computing Distributed Federated Key-value stores Distinction based on implementation Architecture
  • 25. Storing and Querying Semantic Data in the Cloud 25Daniel Janke & Steffen Staab Batch Processing Frameworks Ÿ Example frameworks: Hadoop, Spark Ÿ Queries need to be translated into one or several tasks Ÿ Data exchange between compute nodes via file system Cloud computing Batch Graph NoSQL Distributed file system 1. Read input data 2. Process data 3. Write results back
  • 26. Storing and Querying Semantic Data in the Cloud 26Daniel Janke & Steffen Staab Graph Processing Frameworks Ÿ Examples: GraphX, Signal/Collect Ÿ Translation of queries in vertex algorithms At each vertex: 1. Receive messages 2. Process messages and update vertex status 3. Send messages Termination: Status of all vertices do not change any more Cloud computing Batch Graph NoSQL
  • 27. Storing and Querying Semantic Data in the Cloud 27Daniel Janke & Steffen Staab Key-Value Stores Ÿ Example: DynamoDB Ÿ Distributed map that assigns keys to arbitrary values Ÿ Values are atomic Ÿ Distribution based on, e.g., hash of the key, key ranges, … Ÿ Query translated to several lookups in the map and joins on the master g:Gesis g:wanja ... e:employs g:wanja, ... f:knows w:daniel, ... ... w:WeST w:martin ... e:employs w:martin, ... f:knows g:wanja, ... ... Cloud computing Batch Graph NoSQL
  • 28. Storing and Querying Semantic Data in the Cloud 28Daniel Janke & Steffen Staab Column Stores Ÿ Examples: HBase, Cassandra, Accumulo, Impala Ÿ Stores tabular data column-wise Ÿ Maps column name and key to corresponding value Ÿ Values are atomic Ÿ Distributes key-value mappings based on keys for each column separately g:Gesis w:WeST g:wanja w:martin, w:daniel g:wanja w:martin w:daniel w:daniel g:wanja w:martin Column e:employs Column f:knows Cloud computing Batch Graph NoSQL
  • 29. Storing and Querying Semantic Data in the Cloud 29Daniel Janke & Steffen Staab Document Stores Ÿ Examples: Couchbase, MongoDB Ÿ Store documents with internal structure (e.g., JSON) (i.e., non-atomic documents = more freedom to model content) Ÿ Provide indices over documents Ÿ Distribution based on a key within documents {_id: “g:Gesis”, e:employs: “g:wanja”} {_id: “w:WeST”, e:employs: [“w:daniel”, “w:martin”]} {_id: “g:wanja”, f:knows: “w:daniel”, f:givenname: “Wanja”} {_id: “w:martin”, f:knows: “g:wanja”, f:givenname: “Martin”} Cloud computing Batch Graph NoSQL
  • 30. Storing and Querying Semantic Data in the Cloud 30Daniel Janke & Steffen Staab RDF Stores Using Cloud Computing Frameworks Pros: Ÿ Low implementation complexity Ÿ Fault tolerance provided by cloud computing framework Ÿ Scalability provided by cloud computing framework Ÿ Cloud computing framework is maintained and improved by a community Cons: Ÿ Influence on data placement limited Ÿ High overhead introduced by cloud computing framework Ÿ Centralized join of data obtained by single lookups in NoSQL databases might overload master Architecture Cloud computing Distributed Federated
  • 31. Storing and Querying Semantic Data in the Cloud 31Daniel Janke & Steffen Staab Architecture Types Architecture RDF stores using cloud computing frameworks Distributed RDF stores Federated RDF stores
  • 32. Storing and Querying Semantic Data in the Cloud 32Daniel Janke & Steffen Staab Federated RDF Stores Architecture Cloud computing Distributed Federated l Stores RDF data l Administrated independently Coordinates query execution: l Decompose query l Query RDF stores l Join query results Stores which data is contained in each RDF store Caches data retrieved from previous queries l Varied by index and cache l Examples: DARQ, FedX, SPLENDID
  • 33. Storing and Querying Semantic Data in the Cloud 33Daniel Janke & Steffen Staab Pros: Ÿ Low implementation complexity Ÿ Scalability by adding new RDF stores Cons: Ÿ No influence on data placement Ÿ Query federator is a single point of failure Ÿ Centralized join of results from different RDF stores may become a bottleneck Ÿ Identification of RDF stores contributing to a query may be costly Architecture Cloud computing Distributed Federated Federated RDF Stores
  • 34. Storing and Querying Semantic Data in the Cloud 34Daniel Janke & Steffen Staab Architecture Types Architecture RDF stores using cloud computing frameworks Distributed RDF stores Federated RDF stores
  • 35. Storing and Querying Semantic Data in the Cloud 35Daniel Janke & Steffen Staab Distributed RDF Stores Architecture Cloud computing Distributed Federated Distributed RDF stores Master-slave architecture Peer-to-peer architecture Architecture
  • 36. Storing and Querying Semantic Data in the Cloud 36Daniel Janke & Steffen Staab Master-Slave Architecture Master-slave Peer-to-peer Architecture Cloud computing Distributed Federated Loading Graph: 1.Translate strings to fixed-length identifiers 2.Assigns triples to slaves 3.Stores which data is stored at which slave 4.Transfer triples to slaves 5.Store RDF triples locally Querying: 1. Translate constant strings to their integer identifiers 2. Check occurrences of constants 3. Decompose query and send subqueries to slaves 4. Execute subqueries on local data 5. Join intermediate results 6. Translate result ids back to strings L1, Q1, Q6 L2 L3, Q2 Q3, Q5 Q4, Q5 L5, Q4 Examples: GraphDB, BlazeGraph, TriAD, DiploCloud
  • 37. Storing and Querying Semantic Data in the Cloud 37Daniel Janke & Steffen Staab Peer-to-Peer Architecture Master-slave Peer-to-peer Architecture Cloud computing Distributed Federated Responsibilities of master are copied to all slaves resulting in peer nodes with identical architecture but varying data Examples: RDFPeers, Edutella, Grid Vine, 3RDF
  • 38. Storing and Querying Semantic Data in the Cloud 38Daniel Janke & Steffen Staab Pros: Ÿ Full freedom on data placement Ÿ Little query processing overhead Ÿ Direct transfer of intermediate results Ÿ Fault tolerance (in case of peer-to-peer) Cons: Ÿ High implementation complexity Ÿ Master is a single point of failure Ÿ Handling of dictionary, index and query coordination may lead to a bottleneck at master Architecture Cloud computing Distributed Federated Distributed RDF Stores
  • 39. Storing and Querying Semantic Data in the Cloud 39Daniel Janke & Steffen Staab Architecture Summary RDF stores using cloud computing frameworks Federated RDF stores Distributed RDF stores Freedom of data placement Low/Medium – cloud computing framework decides about data placement Low – RDF stores are administrated independent of federator High – data placement strategy needs to be implemented Fault Tolerance High – master is stateless and can be replicated Low – federator is single point of failure High (peer-to-peer) Low – master is single point of failure Scalability High/Medium – possible bottlenecks: l Disk I/O l Master-based joins Medium – federator can become bottleneck High (peer-to-peer) Medium – if master becomes bottleneck
  • 40. Storing and Querying Semantic Data in the Cloud 40Daniel Janke & Steffen Staab Architecture Summary RDF stores using cloud computing frameworks Federated RDF stores Distributed RDF stores Query overhead High – initialisation of cloud computing framework Medium – identification of required RDF stores Low – designed to execute queries efficiently Implementation complexity Low – only translation of RDF dataset and SPARQL queries Medium – dedicated querying, indexing and caching strategies required High – all components needs to be implemented
  • 41. Storing and Querying Semantic Data in the Cloud 41Daniel Janke & Steffen Staab Data Placement Strategies How to distribute the data?
  • 42. Storing and Querying Semantic Data in the Cloud 42Daniel Janke & Steffen Staab Terminology: RDF Graph Ÿ Directed graph with labelled vertices and edges Ÿ Labels of start vertex, edge and end vertex are an RDF triple Ÿ RDF graph is a set of RDF triples w:martin “Martin“ g:wanja “Wanja“ w:daniel “Daniel“ w:WeST g:Gesis f:givenname f:givenname f:givenname e:employs e:employs e:employs f:knows f:knows f:knowsg:bello r:type e:ownedBy g:Dog Triple Subject Property Object
  • 43. Storing and Querying Semantic Data in the Cloud 43Daniel Janke & Steffen Staab Terminology: Graph Cover and Graph Chunk Graph cover (aka sharding) Assignment of each triple to at least one compute node Graph chunk (aka shard) Set of triples assigned to a single compute node Compute Node 1 Compute Node 2 w:martin “Martin“ g:wanja “Wanja“ w:daniel “Daniel“ w:WeST g:Gesis f:givenname f:givenname f:givenname e:employs e:employsf:knows f:knows f:knows g:bello r:type e:employs e:ownedBy g:Dog
  • 44. Storing and Querying Semantic Data in the Cloud 44Daniel Janke & Steffen Staab Terminology: Path and Path Length Path A sequence of triples in which the object of a triple is the subject of the succeeding triple Path length The number of triples in the path w:martin g:wanja “Wanja“w:daniel f:givennamef:knowsf:knows Length = 3
  • 45. Storing and Querying Semantic Data in the Cloud 45Daniel Janke & Steffen Staab Terminology: Molecule, Anchor Vertex and Diameter Molecule Ÿ Set of triples that are contained in some paths starting at a vertex called anchor vertex Ÿ If molecule contains a subject s than all triples with s as subject are contained (Directed) molecule diameter Longest shorted path between anchor vertex and all objects contained in the molecule w:martin “Martin“ g:wanja “Wanja“ f:givenname f:givenname f:knows w:daniel f:knows Anchor vertex Diameter = 2
  • 46. Storing and Querying Semantic Data in the Cloud 46Daniel Janke & Steffen Staab Properties of Graph Cover Strategies Complexity: Ÿ How complex is the creation of the graph cover? Balancing: Ÿ How balanced are the sizes of the resulting graph chunks? Storage size: Ÿ Is the sum of all graph chunks sizes larger than the original graph size? Path containment: Ÿ How likely is it that a path can be traversed without leaving one chunk? Query parallelisation: Ÿ How good can the workload of one query be parallelized among several compute nodes? Dynamics:
  • 47. Storing and Querying Semantic Data in the Cloud 47Daniel Janke & Steffen Staab Overview Graph Cover Strategies Graph Cover Strategies Static Dynamic Cloud-computing-based Hash-based Graph-clustering-based Workload-aware N-hop replication
  • 48. Storing and Querying Semantic Data in the Cloud 48Daniel Janke & Steffen Staab Overview Graph Cover Strategies Graph Cover Strategies Static Dynamic Cloud-computing-based Hash-based Graph-clustering-based Workload-aware N-hop replication
  • 49. Storing and Querying Semantic Data in the Cloud 49Daniel Janke & Steffen Staab Cloud-Computing-Based Graph Cover Strategies Ÿ Data placement is mainly decided by cloud computing framework Ÿ Influenced only by – Splitting graph into files or tables – Encoding of data within files or tables Ÿ Goal: Reduce the processing effort of queries Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 50. Storing and Querying Semantic Data in the Cloud 50Daniel Janke & Steffen Staab Molecule Graph Splits Ÿ Split graph into molecules of directed diameter 1 Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 51. Storing and Querying Semantic Data in the Cloud 51Daniel Janke & Steffen Staab Molecule Graph Splits Ÿ Store molecules in key-value store (e.g., SHARD, Sempala) Ÿ Store molecules in one or several files (e.g., D-SPARQ, RAPID+) Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop g:Gesis g:wanja e:employs gesis:wanja f:knows w:daniel, f:givenname “Wanja” w:WeST w:martin ... e:employs w:martin, e:employs w:daniel f:knows g:wanja, f:givenname “Martin” ... g:Gesis : (e:employs gesis:wanja) g:wanja : (f:knows w:daniel), (f:givenname “Wanja”) w:WeST : (e:employs w:martin), (e:employs w:daniel) w:martin : (f:knows g:wanja), (f:givenname “Martin”) ...
  • 52. Storing and Querying Semantic Data in the Cloud 52Daniel Janke & Steffen Staab Pros: Ÿ Easy to compute Ÿ Selection of required molecules easy, if subjects are given in the context Ÿ Subject-subject joins can be easily processed Cons: Ÿ If subject is not given in the context all molecules have to be processed Ÿ Extending molecules by incoming edges or longer diameters increases dataset size Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop Molecule Graph Splits
  • 53. Storing and Querying Semantic Data in the Cloud 53Daniel Janke & Steffen Staab Vertical Graph Splits Ÿ Create a file/table for each property Ÿ Store all triples with that property in the file/table Ÿ Examples: Jena-HBase, SPARQLGX Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 54. Storing and Querying Semantic Data in the Cloud 54Daniel Janke & Steffen Staab Pros: Ÿ Easy to compute Cons: Ÿ Queries that match with a path of length l will match with at most l files/tables, if the property is given in the context Ÿ Files/tables of frequent properties like rdf:type can become large Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop Vertical Graph Splits
  • 55. Storing and Querying Semantic Data in the Cloud 55Daniel Janke & Steffen Staab Hash-Based Graph Cover Strategies Ÿ Assignment of triples based on a hash function Ÿ Possible properties of hash functions – Determinism The same input will always produce the same output – Uniformity Inputs are evenly mapped over output range – Non-invertible Based on a hash value the input datum cannot be reconstructed – Continuity The order of the hash values reflect the order of the input values Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 56. Storing and Querying Semantic Data in the Cloud 56Daniel Janke & Steffen Staab Hash Cover Hash function applied on the subjects: Result: Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 57. Storing and Querying Semantic Data in the Cloud 57Daniel Janke & Steffen Staab Pros: Ÿ Easy to compute Ÿ Chunks are of almost equal size Cons: Ÿ Paths are more likely to contain triples that were assigned to different compute nodes Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop Hash Cover
  • 58. Storing and Querying Semantic Data in the Cloud 58Daniel Janke & Steffen Staab Graph-Clustering-Based Graph Cover Strategies Graph clustering Ÿ Split graph into pairwise disjoint graph chunks, i.e., partitions (aka shards) Ÿ Usually vertices are assigned to partitions Ÿ Partitions satisfy some clustering properties Vertex-cut transformation: Ÿ In RDF triples cannot be cut Ÿ Assign triples to partition to which the subject was assigned to Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 59. Storing and Querying Semantic Data in the Cloud 59Daniel Janke & Steffen Staab Minimal Edge-Cut Cover Ÿ Number of cut edges should be reduced Ÿ Number of vertices in each partition should be ideally the same Ÿ After vertex-cut transformation: Number of edges per partition is unbalanced Ÿ Examples: [Huang2011], [Peng2016] Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 60. Storing and Querying Semantic Data in the Cloud 60Daniel Janke & Steffen Staab Pros: Ÿ Likelihood that a path only contains triples of the same compute node is high Ÿ #vertices per chunk is balanced Cons: Ÿ High computational effort (heuristic approaches are in O(|V|*log(|V|)) Ÿ #triples per chunk is unbalanced Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop Minimal Edge-Cut Cover 4 vertices 7 triples 4 vertices 3 triples
  • 61. Storing and Querying Semantic Data in the Cloud 61Daniel Janke & Steffen Staab Workload-Aware Graph Cover Strategies General idea: Assign triples based on a historic query workload General procedure: 1. Generalize from actual queries to handle unseen queries 2. Identify triples that are required to answer generalized queries 3. Assign triples to compute nodes – All triples required to produce all query results are assigned to the same compute node – Distribute triple sets for the individual results equally among all compute nodes Examples: WARP, DiploCloud Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 62. Storing and Querying Semantic Data in the Cloud 62Daniel Janke & Steffen Staab Pros: Ÿ Good query performance for queries similar to the ones in the historic query workload Cons: Ÿ High computational effort Ÿ Historic query workload required Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop Workload-Aware Graph Cover Strategies
  • 63. Storing and Querying Semantic Data in the Cloud 63Daniel Janke & Steffen Staab n-hop Replication Ÿ Based on an initial graph cover with chunks Ÿ Replicate triples such that all paths of length n – Starting at a subject contained in chunk – Consist of triples assigned to Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop Example: VB-Partitioner
  • 64. Storing and Querying Semantic Data in the Cloud 64Daniel Janke & Steffen Staab Pros: Ÿ Paths of length <=n are guaranteed to belong to one chunk Cons: Ÿ Higher computational effort Ÿ Dataset size increases Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop n-hop Replication
  • 65. Storing and Querying Semantic Data in the Cloud 65Daniel Janke & Steffen Staab Summary of Static Graph Cover Strategies Cloud Hash Clustering Workload N-hop Complexity Low Low High High Medium Chunk sizes Imbalanced Balanced Imbalanced - - Dataset size 100% 100% 100% >= 100% > 100% Path containment Low Low High High Medium Query parallelization Medium High Low Low/High - Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 66. Storing and Querying Semantic Data in the Cloud 66Daniel Janke & Steffen Staab Overview Graph Cover Strategies Graph Cover Strategies Static Dynamic Cloud-computing-based Hash-based Graph-clustering-based Workload-aware N-hop replication
  • 67. Storing and Querying Semantic Data in the Cloud 67Daniel Janke & Steffen Staab Dynamic Graph Cover Strategies Ÿ Adaptation of graph cover during runtime Ÿ Types of dynamics – Adaptation of graph cover to actual query workload – If one chunk becomes overloaded due to insertions of new triples, move triples to other chunks Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 68. Storing and Querying Semantic Data in the Cloud 68Daniel Janke & Steffen Staab Adaptation to Actual Query Workload Ÿ Initial static graph cover Ÿ Keep track how frequently - triple patterns - molecules are queried together Ÿ Replicate triples such that – Data transfer is reduced – Workload is equally distributed among compute nodes Examples: PHD-Store, AdHash, Sedge Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 69. Storing and Querying Semantic Data in the Cloud 69Daniel Janke & Steffen Staab Dynamic Redistribution of Triples Ÿ If one compute node stores too many triples (in comparison to others), redistribute triples based on their hash values Ÿ If triples are stored in an ordered fashion, send one half to another compute node Examples: [Battré2007], [Osorio2017] Graph Cover Strategies Static Dynamic Cloud Hash Clustering Workload N-hop
  • 70. Storing and Querying Semantic Data in the Cloud 70Daniel Janke & Steffen Staab Indices How to identify compute nodes that store required data?
  • 71. Storing and Querying Semantic Data in the Cloud 71Daniel Janke & Steffen Staab Example Where is the information stored to answer the query: How are the employees of WeST called? Hash cover on subjects
  • 72. Storing and Querying Semantic Data in the Cloud 72Daniel Janke & Steffen Staab Properties of Indices Graph cover independence: Ÿ How independent is the index from the graph cover strategy? Storage consumption: Ÿ How much storage space is required for the index Access time: Ÿ How fast can the location of an indexed element be retrieved? Indexed elements: Ÿ Which elements are indexed?
  • 73. Storing and Querying Semantic Data in the Cloud 73Daniel Janke & Steffen Staab Overview Indices Indices Centralized Decentralized Hash-based Statistics-based Summary-graph-based Hash-based Schema-based l Faster access l Higher degree of aggregation l Slower access l Lower degree of aggregation
  • 74. Storing and Querying Semantic Data in the Cloud 74Daniel Janke & Steffen Staab Overview Indices Indices Centralized Decentralized Hash-based Statistics-based Summary-graph-based Hash-based Schema-based l Faster access l Higher degree of aggregation l Slower access l Lower degree of aggregation
  • 75. Storing and Querying Semantic Data in the Cloud 75Daniel Janke & Steffen Staab Centralized Hash-Based Index Ÿ Applicable only for hash covers Ÿ No explicit index required Ÿ Location of a triple can be recomputed by the hash function and the number of chunks Ÿ Examples: 4store, Trinity.RDF How are the employees of WeST called? hash(w:WeST) → compute node 2 e:employs ? f:givenname ? (w:WeST, e:employs) ? Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 76. Storing and Querying Semantic Data in the Cloud 76Daniel Janke & Steffen Staab Pros: Ÿ Easy to compute occurrences Ÿ No explicit index required – No storage consumption Cons: Ÿ Only applicable for hash covers Ÿ Only applicable for hashed elements (subject, property, object) Centralized Hash-Based Index Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 77. Storing and Querying Semantic Data in the Cloud 77Daniel Janke & Steffen Staab Centralized Statistics-Based Index Ÿ Collect occurrences of – Subject, property, object labels – Combinations of subject, property, object labels – RDFs types – Property sets of molecules Ÿ Examples: DARQ, FedX, Sedge Subject Property Object c1 c2 c1 c2 c1 c2 w:WeST 0 2 0 0 0 0 e:employs 0 0 1 2 0 0 f:givenname 0 0 2 1 0 0 ... ... ... ... How are the employees of WeST called? Chunk IDs Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 78. Storing and Querying Semantic Data in the Cloud 78Daniel Janke & Steffen Staab Pros: Ÿ Independent of graph cover strategy Ÿ Can estimate number of results Ÿ Fast access Cons: Ÿ Requires compression for storage Ÿ Trade off: – Collecting only a few statistics → small size → less useful – Collecting many statistics → large size (possibly size of dataset) → more useful Centralized Statistics-Based Index Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 79. Storing and Querying Semantic Data in the Cloud 79Daniel Janke & Steffen Staab Centralized Summary-Graph-Based Index: TriAD Summarization algorithm: 1) Each chunk represented by chunk vertex 2) Start and end vertices of edges are substituted by corresponding chunk vertices 3) Duplicate edges are removed How are the employees of WeST called? Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 80. Storing and Querying Semantic Data in the Cloud 80Daniel Janke & Steffen Staab Centralized Summary-Graph-Based Index: EAGRE Summarization algorithm: 1) Determine property sets of all subjects 2) Group similar property sets 3) Store occurrences of each property set 4) Property sets become vertices 5) Replace start and end vertices of edges by their property set vertices Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 81. Storing and Querying Semantic Data in the Cloud 81Daniel Janke & Steffen Staab Centralized Summary-Graph-Based Index: EAGRE Summarization algorithm: 1) Determine property sets of all subjects 2) Group similar property sets 3) Store occurrences of each property set 4) Property sets become vertices 5) Replace start and end vertices of edges by their property set vertices How are the employees of WeST called? Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 82. Storing and Querying Semantic Data in the Cloud 82Daniel Janke & Steffen Staab Centralized Summary-Graph-Based Index Pros: Ÿ Independent of graph cover strategy Ÿ Identification of subqueries that can be answered locally Cons: Ÿ All triples with same subject have to be assigned to the same compute node Ÿ High storage consumption Ÿ Summary graph needs to be queried Ÿ Only properties are considered Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 83. Storing and Querying Semantic Data in the Cloud 83Daniel Janke & Steffen Staab Overview Indices Indices Centralized Decentralized Hash-based Statistics-based Summary-graph-based Hash-based Schema-based l Faster access l Higher degree of aggregation l Slower access l Lower degree of aggregation
  • 84. Storing and Querying Semantic Data in the Cloud 84Daniel Janke & Steffen Staab Decentralized Hash-Based Index Ÿ Version 1: – Centralized hash-based index on each compute node – Knowledge of all compute nodes required – Examples: HDRS, Virtuoso Clustered Edition Ÿ Version 2: – Each compute node knows a forward table for a few neighbours ▪ Ring structure overlay (e.g., RDFPeers, PAGE) ▪ Tree structure overlay (e.g., Grid Vine, 3RDF) Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 85. Storing and Querying Semantic Data in the Cloud 85Daniel Janke & Steffen Staab Ring Structure Overlay Ÿ Compute nodes are ordered Ÿ Each compute node knows – Its direct neighbour – A few distant neighbours Ÿ When a request arrives 1)The compute node storing the data is determined by the hash function 2)Request is forwarded to the (closest) compute node storing the data Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 86. Storing and Querying Semantic Data in the Cloud 86Daniel Janke & Steffen Staab Tree Structure Overlay Ÿ C1 – stores all data whose hash value starts with prefix 00 – Knows C2 is responsible for prefix 01 – Knows C3 is responsible for prefix 1 Ÿ When request arrives C1 – Computes hash value – Forwards request based on the known prefixes Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 87. Storing and Querying Semantic Data in the Cloud 87Daniel Janke & Steffen Staab Pros: Ÿ Easy to compute occurrences Ÿ Low storage consumption Cons: Ÿ Only applicable for hash covers Ÿ Only applicable for hashed elements (subject, property, object) Decentralized Hash-Based Index Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 88. Storing and Querying Semantic Data in the Cloud 88Daniel Janke & Steffen Staab Decentralized Schema-Based Index Ÿ Applicable for type-based graph covers Ÿ Use type hierarchy as tree structure overlay Ÿ Example: SQPeer rdfs:Ressource rdf:Property e:employs f:givennamef:Person rdfs:Class e:Institute C 1 C 2 C 3 C 4 Indices Centralized Decentralized Hash Statistics Summary Hash Schema
  • 89. Storing and Querying Semantic Data in the Cloud 89Daniel Janke & Steffen Staab Pros: Ÿ Queries that contain types can be forwarded to corresponding compute node(s) Ÿ Low storage consumption Cons: Ÿ Efficiently applicable only for type-based graph covers Ÿ Types of requested resources need to be identified Ÿ Unbalanced index sizes Indices Centralized Decentralized Hash Statistics Summary Hash Schema Decentralized Schema-Based Index Used in combination with other indices
  • 90. Storing and Querying Semantic Data in the Cloud 90Daniel Janke & Steffen Staab Summary Indices Centralized Decentralized Hash Statistics Summary graph Hash Schema Applicable to graph cover strategies Hash covers All All Hash covers Type- based covers Storage consumption Low High High Low Low Access time Fast Slow Slow Medium Medium Indexed elements Hash dependent Various aggregations Properties Hash dependent Typed elements
  • 91. Storing and Querying Semantic Data in the Cloud 91Daniel Janke & Steffen Staab Distributed Query Processing Strategies How to distribute query processing?
  • 92. Storing and Querying Semantic Data in the Cloud 92Daniel Janke & Steffen Staab Terminology: SPARQL Query SELECT ?name WHERE { <w:WeST> <e:employs> ?v1. ?v1 <f:givenname> ?name } How are the employees of WeST called? Variable Triple Pattern
  • 93. Storing and Querying Semantic Data in the Cloud 93Daniel Janke & Steffen Staab Terminology: Query Execution Tree SELECT ?name WHERE { <w:WeST> <e:employs> ?v1. ?v1 <f:givenname> ?name }
  • 94. Storing and Querying Semantic Data in the Cloud 94Daniel Janke & Steffen Staab Centralized Query Processing w:martin “Martin“ g:wanja “Wanja“ w:daniel “Daniel“ w:WeST g:Gesis f:givenname f:givenname f:givenname e:employs e:employs e:employs f:knows f:knows f:knows g:bello r:type e:ownedByg:Dog ?v1 w:martin w:daniel ?v1 ?name g:wanja “Wanja” w:martin “Martin” w:daniel “Daniel” ?v1 ?name w:martin “Martin” w:daniel “Daniel” ?name “Martin” “Daniel”
  • 95. Storing and Querying Semantic Data in the Cloud 95Daniel Janke & Steffen Staab Distributed Query Processing General procedure 1) Split query into subquery that can be executed locally 2) Execute subqueries on compute nodes identified by index 3) Join results of subqueries 4) Return results
  • 96. Storing and Querying Semantic Data in the Cloud 96Daniel Janke & Steffen Staab Splitting Query into Subqueries Ÿ Simplest case: each triple pattern forms a subquery Ÿ Use knowledge about graph covers – All triples with same subject are stored on the same compute node – Paths of length n can be executed locally Ÿ Use index information – Co-occurrences of subject-property or property-property
  • 97. Storing and Querying Semantic Data in the Cloud 97Daniel Janke & Steffen Staab Properties of Join Operations Parallelisation: Ÿ Is the join computation distributed among several or all compute nodes? Computational effort: Ÿ How many comparisons are performed during the join computation? Ÿ How many subqueries result out of the join computation? Data transfer: Ÿ How many intermediate results are transferred to compute the join? Blocking: Ÿ Do subqueries need to be finished before the join can be computed?
  • 98. Storing and Querying Semantic Data in the Cloud 98Daniel Janke & Steffen Staab Overview Join Processing Joins Centralized Distributed Hash join Bind join Replication-based join Hash join Merge join Merge join Nested-loop join Bind join Join is executed on a single compute node Join is distributed over several compute nodes
  • 99. Storing and Querying Semantic Data in the Cloud 99Daniel Janke & Steffen Staab Overview Join Processing Joins Centralized Distributed Hash join Bind join Replication-based join Hash join Merge join Merge join Nested-loop join Bind join Join is executed on a single compute node Join is distributed over several compute nodes
  • 100. Storing and Querying Semantic Data in the Cloud 100Daniel Janke & Steffen Staab Centralized Nested Loop Join Compare each element of first list with every element of second list Examples: SPLENDID, DARQ Pros: Ÿ Does not require an ordering Ÿ Arbitrary join conditions possible Cons: Ÿ Inefficient Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested ?v1 w:martin w:daniel ?v1 ?name w:martin “Martin” g:wanja “Wanja” w:daniel “Daniel”
  • 101. Storing and Querying Semantic Data in the Cloud 101Daniel Janke & Steffen Staab Centralized Merge Join Ÿ Requires sorted intermediate result lists Ÿ Compare one result r only with results that are <= r Ÿ Example: Partout Pros: Ÿ Fast for ordered result sets Cons: Ÿ Slow for unordered result sets Ÿ Intermediate result set size might lead to a bottleneck ?v1 w:daniel w:martin ?v1 ?name g:wanja “Wanja” w:daniel “Daniel” w:martin “Martin” Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 102. Storing and Querying Semantic Data in the Cloud 102Daniel Janke & Steffen Staab Centralized Hash Join Ÿ Assign results to buckets based on their hashes Ÿ Join a result only with corresponding bucket Ÿ Examples: ANAPSID, LHD ?v1 w:daniel w:martin ?v1 ?name g:wanja “Wanja” ... ?v1 ?name w:daniel “Daniel” ... ?v1 ?name w:martin “Martin” ... Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested A non-blocking symmetric version exists
  • 103. Storing and Querying Semantic Data in the Cloud 103Daniel Janke & Steffen Staab Pros: Ÿ No ordering required Ÿ On average almost constant time complexity Cons: Ÿ Intermediate result set size might lead to a bottleneck Centralized Hash Join Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 104. Storing and Querying Semantic Data in the Cloud 104Daniel Janke & Steffen Staab Bind Join Ÿ Substitute variables of the second subquery based on results from first subquery Ÿ Second query is executed multiple times Ÿ Examples: FedX, Avanalche, SemaGrow ?v1 w:martin ?v1 ?name w:daniel “Daniel” ?v1 ?name w:martin “Martin” ?v1 w:daniel Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 105. Storing and Querying Semantic Data in the Cloud 105Daniel Janke & Steffen Staab Pros: Ÿ Reduces the amount of intermediate results Cons: Ÿ Increases number of executed subqueries Ÿ Possible bottlenecks: – Large intermediate result set sizes – Large number of subqueries Bind Join Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 106. Storing and Querying Semantic Data in the Cloud 106Daniel Janke & Steffen Staab Summary Centralized Joins Nested Merge Hash Symmetric Bind Computational effort High Medium - extra effort for ordering Low Low Medium - effort of many subqueries # executed queries Low Low Low Low High Blocking operation Yes Yes Yes No Yes Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 107. Storing and Querying Semantic Data in the Cloud 107Daniel Janke & Steffen Staab Overview Join Processing Joins Centralized Distributed Hash join Bind join Replication-based join Hash join Merge join Merge join Nested-loop join Bind join Join is executed on a single compute node Join is distributed over several compute nodes
  • 108. Storing and Querying Semantic Data in the Cloud 108Daniel Janke & Steffen Staab Replication-Based Distributed Join All results of first subquery are sent to all compute nodes on which the second subquery is executed Example: SemStore Compute Node 2 Compute Node 1 Compute Node 2 ?v1 ?name w:martin “Martin” ?v1 ?name w:daniel “Daniel” ?v1 w:daniel w:martin ?v1 w:daniel w:martin Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 109. Storing and Querying Semantic Data in the Cloud 109Daniel Janke & Steffen Staab Pros: Ÿ Not all compute nodes are necessary involved in joining Ÿ Using data locality → Less transferred data Cons: Ÿ Intermediate result set size may become bottleneck if second subquery is executed on a single compute node Ÿ One subtree needs to be finished before join can be executed Replication-Based Distributed Join Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 110. Storing and Querying Semantic Data in the Cloud 110Daniel Janke & Steffen Staab Distributed Hash Join Hash join in which each compute node serves as a bucket Example: DiploCloud Compute Node 2Compute Node 1 ?v1 w:martin w:daniel ?v1 ?name w:martin “Martin” g:wanja “Wanja” ?v1 ?name w:daniel “Daniel” ?v1 ?name w:martin “Martin” g:wanja “Wanja” ?v1 w:martin ?v1 w:daniel ?v1 ?name w:daniel “Daniel” hash(w:martin) hash(w:daniel) Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 111. Storing and Querying Semantic Data in the Cloud 111Daniel Janke & Steffen Staab Pros: Ÿ All compute nodes are involved in join processing Ÿ Bottleneck is unlikely due to distribution of intermediate result set over all compute nodes Cons: Ÿ No usage of data locality → high data transfer Ÿ One subtree needs to be finished before join can be executed Distributed Hash Join Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 112. Storing and Querying Semantic Data in the Cloud 112Daniel Janke & Steffen Staab Distributed Merge Join Ÿ Results of subqueries are ordered Ÿ Each compute node is responsible for a range of results Ÿ Examples: H2RDF+, SHARD, SparkRDF, SPARQLGX Compute Node 2Compute Node 1 ?v1 w:daniel w:martin ?v1 ?name g:wanja “Wanja” w:martin “Martin” ?v1 ?name w:daniel “Daniel” Range a:a-w:d Range w:e-z:z ?v1 ?name g:wanja “Wanja” w:daniel “Daniel” ?v1 w:daniel ?v1 w:martin ?v1 ?name w:martin “Martin” Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 113. Storing and Querying Semantic Data in the Cloud 113Daniel Janke & Steffen Staab Pros: Ÿ All compute nodes are involved in join processing Ÿ Bottleneck is unlikely due to distribution of intermediate result set over all compute nodes Cons: Ÿ Results need to be ordered Ÿ Agreement on result ranges required Ÿ No usage of data locality → high data transfer Ÿ One subtree needs to be finished before join can be executed Distributed Merge Join Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 114. Storing and Querying Semantic Data in the Cloud 114Daniel Janke & Steffen Staab Distributed Bind Join Join algorithm: 1) Get results of first subquery 2) For each following bind join query, 1) Identify compute nodes with matches 2) Fork query execution to remote compute nodes Examples: RDFPeers, GridVine, Atlas, TripleRush, Trinity.RDF Compute Node 2 Compute Node 1 Compute Node 2 ?v1 ?name w:martin “Martin” ?v1 ?name w:daniel “Daniel” ?v1 w:daniel w:martin ?v1 w:martin ?v1 w:daniel Fork Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 115. Storing and Querying Semantic Data in the Cloud 115Daniel Janke & Steffen Staab Pros: Ÿ Join computed without waiting for any subtree to be finished Ÿ Usage of data locality → Less transferred data Ÿ Results of last join operation do not need to be sent to other compute nodes Cons: Ÿ Intermediate result set size may become bottleneck if second subquery is executed on a single compute node Distributed Bind Join Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 116. Storing and Querying Semantic Data in the Cloud 116Daniel Janke & Steffen Staab Distributed Joins Summary Centralized Joins Distributed Replication Distributed Hash Distributed Merge Distributed Bind Data Transfer High Low High High Low Parallelisation Low Medium High High Medium # Subqueries Low Low Low Low High Joins Centralized Distributed Hash Bind Replication Hash Bind Merge Merge Nested
  • 117. Storing and Querying Semantic Data in the Cloud 117Daniel Janke & Steffen Staab Fault Tolerance How to achieve fault tolerance?
  • 118. Storing and Querying Semantic Data in the Cloud 118Daniel Janke & Steffen Staab Mirroring Ÿ There exist several identical copies of each compute node Ÿ If one compute node fails, its copy continues working Ÿ Example: Virtuoso Clustered Edition Pros: Ÿ Query workload can be distributed among all copies Cons: Ÿ Keeping copies up to date Ÿ Replicas of different chunks are not combined to increase data locality Compute Node 1 Compute Node 2 Compute Node 1’ Compute Node 2’
  • 119. Storing and Querying Semantic Data in the Cloud 119Daniel Janke & Steffen Staab Data Replication Ÿ All compute nodes are ordered in a ring Ÿ Data from one compute node is replicated on neighbours Ÿ If one compute node fails, data remains available on neighbours Ÿ Example: 4store, RDFPeers Pros: Ÿ Data locality of initial graph cover is increased Cons: Ÿ Keeping copies up to data Compute Node 1 Compute Node 2 Compute Node 3 1 1’ 2 2’ 3 3’
  • 120. Storing and Querying Semantic Data in the Cloud 120Daniel Janke & Steffen Staab Evaluation Methodology How to evaluate?
  • 121. Storing and Querying Semantic Data in the Cloud 121Daniel Janke & Steffen Staab Properties of Evaluation Methodologies Realism: Do the measurement results reflect the performance of real RDF stores? Modularity: Can alternative implementations of individual components be evaluated? Evaluation depth: Is the system evaluated only as a whole or are the performance of the individual components evaluated? Difficulty: How difficult is it to apply the evaluation methodology?
  • 122. Storing and Querying Semantic Data in the Cloud 122Daniel Janke & Steffen Staab Black Box Evaluation Evaluation of RDF stores as a whole Some problems (of many): Ÿ How fast is your network? Ÿ How large are your images? Ÿ Which processor configuration do you use? Ÿ What are the structures of your caches? Do you evaluate the RDF store or your hardware configuration? Dataset QueriesQueriesQueries
  • 123. Storing and Querying Semantic Data in the Cloud 123Daniel Janke & Steffen Staab Black Box Evaluation Evaluation of RDF stores as a whole Pros: Ÿ Easy to perform evaluation since no implementation knowledge is required Ÿ Measurements reflect the behaviour of a real RDF store Cons: Ÿ Only superficial evaluations possible Ÿ No performance evaluation of individual components possible Dataset QueriesQueriesQueries
  • 124. Storing and Querying Semantic Data in the Cloud 124Daniel Janke & Steffen Staab Glass Box Evaluation Ÿ Evaluation of RDF stores as a whole Ÿ Collecting performance measurements of components by – Using a profiling system like Granula – Adapting source code to perform measurements Dataset QueriesQueriesQueries
  • 125. Storing and Querying Semantic Data in the Cloud 125Daniel Janke & Steffen Staab Glass Box Evaluation Pros: Ÿ In-depth performance evaluation possible Ÿ Measurements reflect the behaviour of a real RDF store Cons: Ÿ Source code needs to be extended to collect measurements Ÿ Individual components can hardly be exchanged by alternative implementations
  • 126. Storing and Querying Semantic Data in the Cloud 126Daniel Janke & Steffen Staab Simulation-based Glass Box Evaluation Evaluation of alternative implementations of a single component by simulating the behaviour of a real RDF store Pros: Ÿ Performance evaluation of individual components possible Ÿ Alternative implementations of individual components is possible Cons: Ÿ Evaluation environment (simulator) needs to be implemented Ÿ Questionable whether performance measurements reflect behaviour of real RDF store Dataset QueriesQueriesQueries ComponentComponent Component
  • 127. Storing and Querying Semantic Data in the Cloud 127Daniel Janke & Steffen Staab Glass Box Evaluation Platform RDF store Ÿ that allows the exchange of individual components by alternative implementations Ÿ Measures performance of individual components Dataset QueriesQueriesQueries Graph Cover Creator Graph Cover Creator Graph Cover Creator
  • 128. Storing and Querying Semantic Data in the Cloud 128Daniel Janke & Steffen Staab Glass Box Evaluation Pros: Ÿ In-depth performance evaluation possible Ÿ Alternative implementations of individual components can be evaluated Ÿ Measurements reflect the behaviour of a real RDF store Cons: Ÿ Development of glass box evaluation platform difficult Ÿ Interdependencies might limit the exchangeability of components
  • 129. Storing and Querying Semantic Data in the Cloud 129Daniel Janke & Steffen Staab Evaluation Methodology Summary Black box Glass box Simulation Glass box platform Realism High High Low Medium Modularity Low Low High High Evaluation depth Low High High High Difficulty Easy Medium Medium Hard
  • 130. Storing and Querying Semantic Data in the Cloud 130Daniel Janke & Steffen Staab Conclusion & Open Challenges
  • 131. Storing and Querying Semantic Data in the Cloud 131Daniel Janke & Steffen Staab Conclusion Challenges of RDF stores in the cloud: 1) How to design the architecture? 2) How to distribute the data? 3) How to identify compute nodes that store required data? 4) How to distribute query processing? 5) How to achieve fault tolerance? 6) How to evaluate?
  • 132. Storing and Querying Semantic Data in the Cloud 132Daniel Janke & Steffen Staab Example RDF Stores in the Cloud Virtuoso Clustered Edition BlazeGraph GraphDB Architecture Master-slave Master-slave Master-slave Graph Cover Strategy Hash cover Distributed B+-tree Replication of graph on all slaves Index Centralized hash- based index on each compute node Distributed B+-tree Not necessary Query Execution Strategy Distributed bind join Centralized join Centralized join Fault Tolerance Mirroring None Mirroring
  • 133. Storing and Querying Semantic Data in the Cloud 133Daniel Janke & Steffen Staab Example RDF Stores in the Cloud DiploCloud S2RDF Trinity.RDF Architecture Master-slave Batch processing framework Master-slave Graph Cover Strategy Workload-aware Vertical graph splits Hash cover Index Centralized Statistics-based index None Distributed chunk-integrated summary graph Query Execution Strategy Centralized join (for small result sets) Distributed hash join (otherwise) Distributed joins Distributed bind join Fault Tolerance None Based on batch processing framework None
  • 134. Storing and Querying Semantic Data in the Cloud 134Daniel Janke & Steffen Staab Challenges Not Presented Ÿ How to achieve transactional security? Ÿ How to perform online analytical processing (OLAP) queries? Ÿ How to process property paths? Ÿ How to perform distributed reasoning? Ÿ How to perform distributed stream processing?
  • 135. Institute for Web Science and Technologies · University of Koblenz-Landau, Germany Thank you for your Attention! Daniel Janke, Steffen Staab
  • 136. Storing and Querying Semantic Data in the Cloud 136Daniel Janke & Steffen Staab Image References Ÿ https://openclipart.org/detail/155101/server Ÿ https://openclipart.org/detail/213252/gear-icon Ÿ https://openclipart.org/detail/204067/bpm-mail-symbol Ÿ https://openclipart.org/detail/169757/check-and-cross-marks Ÿ https://openclipart.org/detail/153577/stopwatch
  • 137. Storing and Querying Semantic Data in the Cloud 137Daniel Janke & Steffen Staab References [Huang2011] Huang, J., Abadi, D.J., Ren, K.: Scalable SPARQL Querying of Large RDF Graphs. PVLDB 4(11), 1123–1134 (2011) [Peng2016] Peng, P., Zou, L., Özsu, M.T., Chen, L., Zhao, D.: Processing SPARQL Queries over Distributed RDF Graphs. The VLDB Journal 25(2), 243–268 (apr 2016). [Battré2007] Battré, D., Heine, F., Höing, A., Kao, O.: On Triple Dissemination, Forward-Chaining, and Load Balancing in DHT Based RDF Stores. In: Moro, G., Bergamaschi, S., Joseph, S., Morin, J.H., Ouksel, A.M. (eds.) Databases, Information Systems, and Peer-to-Peer Computing. pp. 343–354. Springer Berlin Heidelberg, Berlin, Heidelberg (2007) [Osorio1017] Osorio, M., Aranda, C.B.: Storage Balancing in P2P Based Distributed RDF Data Stores. In: Proceedings of the Workshop on Decentralizing the Semantic Web 2017 co-located with 16th International Semantic Web Conference (ISWC 2017) (2017).