SlideShare une entreprise Scribd logo
1  sur  175
Semantic Data Management in
Graph Databases
-Tutorial at ESWC 2014-
Maria-Esther
Vidal
Edna
Ruckhaus
Maribel
Acosta
Cosmin
Basca
USB
1
Alejandro
Flores
2
Network of Friends in a High School
Relationships among artists in Last.fm
http://sixdegrees.hu/last.fm/
Network structure of music genres and
their stylistic origin
http://www.infosysblogs.com/web2/2013/01/network_structure_of_music_gen.html
Network structure of Patent Citations
http://www.infosysblogs.com/web2/2013/07/
Graphs …
3
A significant increase of graph data in the form of social & biological information.
Tasks to be Solved …
Patterns of connections
between people to understand
functioning of society.
Topological properties
of graphs can be used to identify
patterns that reveal phenomena,
anomalies and potentially lead to
a discovery.
Tasks to be Solved … (2)
Degree
4
The importance of the information relies on the relations more or equal
than on the entities.
In-degree
Out-Degree Hubs
http://www.infosysblogs.com/web2/2013/01/network_structure_of_music_gen.html
Basic Graph Operations
Basic Graph operations that need to be efficiently
performed:
 Graph traversal.
 Measuring path length.
 Finding the most interesting central node.
 Graph Invariants.
5
NoSQL-based Approaches
These approaches can handle unstructured,
unpredictable or messy data.
6
Wide-column
stores
BigTable
model of
Google
Cassandra
Document
Stores
Semi-structured
Data
MongoDB
Key-value
stores
Berkeley DB
Graph
Databases
Graph-based
oriented data
Agenda Morning Session
7
Basic Concepts & Background (60 min)
The Graph Data Management Paradigm ( 35 min.)
Coffee Break (30 min.)
Existing Graph Database Engines (75 min.)
Questions & Discussion (15 min.)
Lunch Break (90 min.)
1
2
3
4
Agenda Afternoon Session
8
RDF-Based Engines (45 min)
Empirical Evaluation & Hands-On Description( 45 min.)
Coffee Break (30 min.)
Hands-On (75 min.)
Questions & Discussion (15 min.)
1
2
3
4
BASIC CONCEPTS &
BACKGROUND
9
1
Abstract Data Type Graph
G=(V, E,Σ,L) is a graph:
 V is a finite set of nodes or vertices,
e.g. V={Term, forOffice, Organization,…}
 E is a finite set of edges representing
binary relationship between elements in V,
e.g. E={(forOffice,Term)
(forOffice,Organization),(Office,Organization)…}
 Σis a set of labels,
e.g., Σ={domain, range, sc, type, …}
 L is a function: V x V  Σ,
10
e.g., L={((forOffice,Term),domain), ((forOffice,Organization),range)… }
Abstract Data Type Multi-Graph
G=(V, E,Σ,L) is a multi-graph:
 V is a finite set of nodes or vertices,
e.g. V={Term, forOffice, Organization,…}
 E is a set of edges representing binary
relationship between elements in V,
e.g. E={(forOffice,Term)
(forOffice,Organization),(Office,Organization)…}
 Σis a set of labels,
e.g., Σ={domain, range, sc, type, …}
 L is a function: V x V  PowerSet(Σ),
11
e.g., L={((forOffice,Term),{domain}), ((forOffice,Organization),{range}),
((_id0,AZ),{forOffice, forOrganization})… }
Basic Operations
Given a graph G, the following are operations over G:
 AddNode(G,x): adds node x to the graph G.
 DeleteNode(G,x): deletes the node x from graph G.
 Adjacent(G,x,y): tests if there is an edge from x to y.
 Neighbors(G,x): nodes y s.t. there is an edge from x to y.
 AdjacentEdges(G,x,y): set of labels of edges from x to y.
 Add(G,x,y,l): adds an edge between x and y with label l.
 Delete(G,x,y,l): deletes an edge between x and y with label l.
 Reach(G,x,y): tests if there a path from x to y.
 Path(G,x,y): a (shortest) path from x to y.
 2-hop(G,x): set of nodes y s.t. there is a path of length 2 from x to y, or from
y to x.
 n-hop(G,x): set of nodes y s.t. there is a path of length n from x to y, or from
y to x.
12
Implementation of Graphs
13
Adjacency
List
For each node a list
of neighbors.
If the graph is
directed,
adjacency list of i
contains only the
outgoing nodes of
i.
Cheaper for
obtaining the
neighbors of a
node.
Not suitable for
checking if there
is an edge
between two
nodes.
Incidence
List
Vertices and edges
are stored as records
or objects.
Each vertex stores
incident edges.
Each edge stores
incident nodes.
Incidence
Matrix
Bidimensional
representation of
graph.
Rows represent
Vertices.
Columns
represent edges
An entry of 1
represents that
the Source
Vertex is incident
to the Edge.
Adjacency
Matrix
Bidimensional
representation of
graph.
Rows represent
Source Vertices.
Columns represent
Destination Vertices.
Each entry with 1
represents that
there is an edge
from the source
node to the
destination node.
Compressed
Adjacency
Matrix
Differential
encoding
between two
consecutive
nodes
[Sakr and Pardede 2012]
Adjacency List
14
V1
V2
V3
V4
(V1,{L2}) (V3,{L3}
)
(V1,{L1})
Properties:
 Storage: O(|V|+|E|+|L|)
 Adjacent(G,x,y): O(|V|)
 Neighbors(G,x): O(|V|)
 AdjacentEdges(G,x,y): O(|V|+|E|)
 Add(G,x,y,l): O(|V|+|E|)
 Delete(G,x,y,l): O(|V|+|E|)
L2 L3
L1
V1 V2 V3
V4
Implementation of Graphs
15
Adjacency
List
For each node a list
of neighbors.
If the graph is
directed, adjacency
list of i contains
only the outgoing
nodes of i.
Cheaper for
obtaining the
neighbors of a
node.
Not suitable for
checking if there
is an edge
between two
nodes.
Incidence
List
Vertices and edges
are stored as records
of objects.
Each vertex stores
incident edges.
Each edge stores
incident nodes.
Adjacency
Matrix
Bidimensional
representation of
graph.
Rows represent
Source Vertices.
Columns represent
Destination Vertices.
Each entry with 1
represents that
there is an edge
from the source
node to the
destination node.
Incidence
Matrix
Bi-dimensional
representation
of graph.
Rows
represent
Vertices.
Columns
represent
edges
An entry of 1 represents
that the Source Vertex is
incident to the Edge.
Compressed
Adjacency
Matrix
Differential
encoding
between two
consecutive
nodes
[Sakr and Pardede 2012]
Incidence List
16
Properties:
 Storage: O(|V|+|E|+|L|)
 Adjacent(G,x,y): O(|E|)
 Neighbors(G,x): O(|E|)
 AdjacentEdges(G,x,y): O(|E|)
 Add(G,x,y,l): O(|E|)
 Delete(G,x,y,l): O(|E|)
(source,L2) (source,L3)
(source,L1
)
(destination,L3)
(V4,V1)
(V2,V1)
(V2,V3)
(destination,L
2)
(destination,L
1)
V1
V2
V3
V4
L1
L2
L3
L2 L3
L1
V1 V2 V3
V4
Implementation of Graphs
17
Adjacency
List
For each node a
list of neighbors.
If the graph is
directed,
adjacency list of i
contains only the
outgoing nodes of
i.
Cheaper for
obtaining the
neighbors of a
node.
Not suitable for
checking if there
is an edge
between two
nodes.
Incidence
List
Vertices and edges
are stored as
records of objects.
Each vertex stores
incident edges.
Each edge stores
incident nodes.
Compressed
Adjacency
List
Differential
encoding
between two
consecutive
nodes
Adjacency
Matrix
Bidimensional
graph
representation.
Rows represent
source vertices.
Columns represent
destination vertices.
Each non-null
entry represents
that there is an
edge from the
source node to
the destination
node.
Incidence
Matrix
Bidimensiona
l graph
representatio
n.
Rows
represent
Vertices.
Columns
represent
edges
A non-null entry
represents that the
source vertex is incident
to the Edge.
[Sakr and Pardede 2012]
Compressed Adjacency List
Gap Encoding
18
Node Neighbors
1 20,23,24,25,27
2 30,32,33,34
3 40,42,45,46,47,48
…..
Node Successors
1 38,2,0,0,1
2 56,1,0,0
3 74,1,2,0,0,0
…..
v(x) =
2x if x ³ 0
2 x -1 if x < 0
ì
í
ï
îï
The ordered adjacent list:
A(x)=(a1,a2,a3,…,an)
is encoded as
(v(a1-x),a2-a1-1,a3-a2-1,…, an-an-1 -1)
Where:
Properties:
 Storage: O(|V|+|E|+|L|)
 Adjacent(G,x,y): O(|V|)
 Neighbors(G,x): O(|V|)
 AdjacentEdges(G,x,y):
O(|V|+|E|)
 Add(G,x,y,l): O(|V|+|E|)
 Delete(G,x,y,l): O(|V|+|E|)
Original
AdjacencyList
Compressed
AdjacencyList
Nodes
Adjacency
List
For each node a
list of neighbors.
If the graph is
directed,
adjacency list of i
contains only the
outgoing nodes of
i.
Cheaper for
obtaining the
neighbors of a
node.
Not suitable for
checking if there
is an edge
between two
nodes.
Incidence
List
Vertices and edges
are stored as
records of objects.
Each vertex stores
incident edges.
Each edge stores
incident nodes.
Compressed
Adjacency
List
Differential
encoding
between two
consecutive
nodes
Adjacency
Matrix
Bidimensional
graph
representation.
Rows represent
source vertices.
Columns represent
destination vertices.
Each non-null
entry represents
that there is an
edge from the
source node to
the destination
node.
Incidence
Matrix
Bidimensiona
l graph
representatio
n.
Rows
represent
Vertices.
Columns
represent
edges
A non-null entry
represents that the
source vertex is incident
to the Edge.
Implementation of Graphs
19
[Sakr and Pardede 2012]
20
{L2} {L3}
{L1}
V1 V2 V3 V4
V1
V2
V3
V4
Adjacency Matrix
L2 L3
L1
V1 V2 V3
V4
Properties:
 Storage: O(|V|2)
 Adjacent(G,x,y): O(1)
 Neighbors(G,x): O(|V|)
 AdjacentEdges(G,x,y): O(|E|)
 Add(G,x,y,l): O(|E|)
 Delete(G,x,y,l): O(|E|)
Implementation of Graphs
21
Adjacency
List
For each node a
list of neighbors.
If the graph is
directed,
adjacency list of i
contains only the
outgoing nodes
of i.
Cheaper for
obtaining the
neighbors of a
node.
Not suitable for
checking if there
is an edge
between two
nodes.
Incidence
List
Vertices and edges
are stored as
records of objects.
Each vertex stores
incident edges.
Each edge stores
incident nodes.
Compressed
Adjacency
List
Differential
encoding
between two
consecutive
nodes
Adjacency
Matrix
Bidimensional
graph
representation.
Rows represent
source vertices.
Columns represent
destination
vertices.
Each non-null
entry represents
that there is an
edge from the
source node to
the destination
node.
Incidence
Matrix
Bidimensiona
l graph
representatio
n.
Rows
represent
Vertices.
Columns
represent
edges
A non-null entry
represents that the
source vertex is incident
to the Edge.
[Sakr and Pardede 2012]
22
destination destination
source source
destination
source
L1 L2 L3
V1
V2
V3
V4
Incidence Matrix
L2 L3
L1
V1 V2 V3
V4
Properties:
 Storage: O(|V|x|E|)
 Adjacent(G,x,y): O(|E|)
 Neighbors(G,x): O(|V|x|E|)
 AdjacentEdges(G,x,y): O(|E|)
 Add(G,x,y,l): O(|V|)
 Delete(G,x,y,l): O(|V|)
Traversal Search
Breadth First
Search
Expands shallowest
unexpanded nodes
first.
Unexpanded
nodes are
stored in
queue.
Depth First
Search
Expands deepest
unexpanded nodes
first.
Unexpanded nodes
are stored in a stack.
23
24
7
8
2
1
3
4
5
6
0
2
1
3
4
5
6
0
1 Starting Node
First Level Visited Nodes
Second Level Visited Nodes
Third Level Visited Nodes
Breadth First Search
Notation:
25
7
8
2
1
3
4
5
6
0
2
1
3
4
5
6
0
Depth First Search
1 Starting Node
First Level Visited Nodes
Second Level Visited Nodes
Third Level Visited Nodes
Notation:
THE GRAPH DATA
MANAGEMENT PARADIGM
26
2
The Graph Data Management
Paradigm
Graph database models:
 Data models for schema and instances are
graph models or generalizations of them.
 Data manipulation is expressed as graph-
based operations.
27
Survey of Graph Database Models
28
A graph data model
represents data and
schema:
 Graphs or
 More general, the
notion of graphs:
Hypergraphs,
Digraphs.
[Renzo and Gutiérrez 2008]
Survey of Graph Database Models
29
[Renzo and Gutiérrez 2008]
Types of relationship supported by graph data models:
• Properties,
• Mono- or multi-
valued.
• Groups of real-
world objects.
• Structures to
represent
neighborhoods
of an entity.
Attributes Entities
Neighborhood
relations
• Part-of,
composed-by,
n-ary
associations.
• Subclasses and
superclasses,
• Relations of
instantiations.
• Recursively
specified
relations.
Standard
abstractions
Derivation and
inheritance
Nested
relations
Representative Approaches
30
No Index
Only works
for “toy”
graphs.
Vertices are
on the scale
of 1K.
Node/Edge
Index
Create index
in different
nodes/edges
Hexastore,
RDF3x,
BitMat,Neoj4,
Frequent
Subgraph Index
Indices on
frequently
queried
subgraphs
Reachability and
Neighborhood
Indices
2-hop
labeling
schemas for
index
structures.
Every two
nodes within
distance L.
COFFEE BREAK
31
32
https://www.github.com/AlejandroFloresV/Tutoria
l-GraphDB
GRAPH DATABASES
ENGINES
33
3
Graph Databases Advantages
 Natural modeling of highly connected data.
 Special graph storage structure
 Efficient schemaless graph algorithms.
 Support for query languages.
 Operators to query the graph structure.
34
35
Graph Databases
http://www.neo4j.org
http://www.hypergraphdb.org
http://www.sparsity-technologies.com/
Existing General Graph Database
Engines
36
Sparksee
Java library for
management of
persistent and
temporary
graphs.
Implementation
relies on
bitmaps and
secondary
structures (B+-
tree)
HyperGraphDB
Implements the
hyper graph
data model.
Neo4j
Network
oriented model
where relations
are first-class
objects.
Native disk-
based storage
manager for
graphs.
Framework for
graph traversal.
Graphium
An Abstract
Graph based
database
engine.
Built on top of
Neo4j,
Sparksee, and
HyperGraphDB
Framework for
RDF-based
graphs and
Data Mining
tasks.
Most graph databases implement an API instead of a query language.
Sparksee(1)
 Labeled attribute multi-graph.
 All node and edge objects belong to a type.
 Edges have a direction:
 Tail: corresponds to the source of the edge.
 Head: corresponds to the destination of the edge.
 Node and edge objects can have one or
more attributes.
 There are no restrictions on the
number of edges between two nodes.
 Loops are allowed.
 Attributes are single valued.
37
[Martínez-Basan et al. 2007]
http://www.sparsity-technologies.com/dex_tutorials
Sparksee (2)
 Attributes can have different indexes:
 Basic attributes.
 Indexed attributes.
 Unique attributes.
 Edges can index neighborhoods.
 Neighborhood index.
 The persistent database of Sparksee is a single file.
 Sparksee can manage very large graphs.
38
[Martínez-Basan et al. 2007]
http://www.sparsity-technologies.com/dex_tutorials
Sparksee Architecture
39
[Martínez-Basan et al. 2007]
http://www.sparsity-technologies.com/dex_tutorials
Sparksee Internal Representation
(1)
40
[Martínez-Basan et al. 2007]
Sparksee Approach: Map + Bitmaps  Links
Link: bidirectional association between values and OIDs (keys)
 N is a collection of nodes and E is a collection of directed edges.
 The same value can be associated to multiple OIDs (keys)
 Given a value  a set of OIDs (keys)
 Given an OID  the value
http://www.sparsity-technologies.com/
Sparksee Internal Representation
(2)
41
[Martínez-Basan et al. 2007]
 A Sparksee Graph G=<n,e,o,t,h>
 n= M<string,T> attributes and
properties of nodes.
 e= M<string,T> attributes and
properties of edges.
 o= L<tid,nid> list of objects,
nodes or edges, for each type.
 t= L<nid,eid> list of edges where
each node is the tail.
 h= L<nid,eid> list of edges where
each node is the head.
http://www.sparsity-technologies.com/
Sparksee Internal Representation
(3)
42
[Martínez-Basan et al. 2007]
 A Sparksee Graph is a combination
of Bitmaps:
 Bitmap for each node or edge
type.
 One link for each attribute.
 Two links for each type: Out-
going and in-going edges.
 Maps are B+trees
 A compressed UTF-8 storage
for UNICODE string.
http://www.sparsity-technologies.com/
Sparksee API
43
Operations of the Abstract Data Type Graph:
 Graph construction.
 Definition of node and edge types.
 Creation of node and edge objects.
 Definition and use of attributes.
 Query nodes and edges.
 Management of edges and nodes.
http://www.sparsity-technologies.com/s
Sparksee API (2)
Graph construction
44
SparkseeConfig cfg = new SparkseeConfig();
cfg.setCacheMaxSize(2048); // 2 GB
Sparksee sparksee= new Sparksee(cfg);
Database db = sparksee.create(”Publications.dex",
”PUBLICATIONSDex");
Session sess = db.newSession();
Graph graph = sess.getGraph();
……
sess.close();
db.close();
sparksee.close();
Operations on the graph
database will be performed
on the object graph
Wrote Cites
Wrote
Paper1 Paper2
Paper3
Peter Smith
45
Conference
ESWC
Paper4
Conference
Cites
Paper5
ISWC
Conference
Conference
Cites
Conference
Juan Perez
Wrote
John Smith
Wrote
Wrote
Wrote
Cites
Wrote
Example
Sparksee API (3)
Adding nodes to the graph
46
int authorTypeId = graph.newNodeType(”AUTHOR");
int paperTypeId = graph.newNodeType(”PAPER");
int conferenceTypeId = graph.newNodeType(”CONFERENCE");
long author1 = graph.newNode(authorTypeId);
long author2 = graph.newNode(authorTypeId);
long author3 = graph.newNode(authorTypeId);
long paper1 = graph.newNode(paperTypeId);
long paper2 = graph.newNode(paperTypeId);
long paper3 = graph.newNode(paperTypeId);
long paper4 = graph.newNode(paperTypeId);
long paper5 = graph.newNode(paperTypeId);
long conference1 = graph.newNode(conferenceTypeId);
long conference2 = graph.newNode(conferenceTypeId);
http://sparsity-technologies.com/downloads/javadoc-java/com/sparsity/sparksee/gdb/Graph.html
Sparksee API (4)
Adding edges to the graph
47
int wroteTypeId = graph.newEdgeType(”WROTE", true,true);
int isWrittenTypeId = graph.newEdgeType(”Is-Written", true,true);
int citeTypeId = graph.newEdgeType(”CITE", true,true);
int isCitedByTypeId = graph.newEdgeType(”IsCitedBy", true,true);
int conferenceTypeId = graph.newEdgeType(”IsConference", true,true);
long wrote1 = graph.newEdge(wroteTypeId, author1, paper1);
long wrote2 = graph.newEdge(wroteTypeId, author1, paper3);
long wrote3 = graph.newEdge(wroteTypeId, author1, paper5);
long wrote4 = graph.newEdge(wroteTypeId, author2, paper3);
long wrote5 = graph.newEdge(wroteTypeId, author2, paper4);
long wrote4 = graph.newEdge(wroteTypeId, author3, paper2);
long wrote5 = graph.newEdge(wroteTypeId, author3, paper4);
newEdgeType(java.lang.String name, boolean directed,boolean neighbors)
Creates a new edge If TRUE, this creates a directed edge type, otherwise this creates a undirected
edge type.neighbors - [in] If TRUE, this indexes neighbor nodes, otherwise not.
Returns:Unique Sparksee type identifier.
Sparksee API (5)
Indexing
 Basic attributes: There is no index associated with the attribute.
 Indexed attributes: There is an index automatically maintained by the
system associated with the attribute.
 Unique attributes: The same as for indexed attributes but with an added
integrity restriction: two different objects cannot have the same value, with
the exception of the null value.
 Sparksee operations: Accessing the graph through an attribute will
automatically use the defined index, significantly improving the performance
of the operation.
 A specific index can also be defined to improve certain navigational
operations, for example, Neighborhood.
 There is the AttributeKind enum class which includes basic, indexed and
unique attributes.
48
Sparksee API (6)
Adding attributes to nodes of the graph
49
nameAttrId = graph.newAttribute(authorTypeId, "Name",
DataType.String, AttributeKind.Indexed);
conferenceNameAttrId = graph.newAttribute(conferenceTypeId,
”ConferenceName", DataType.String, AttributeKind.Indexed);
Value v = new Value();
graph.setAttribute(author1, nameAttrId, v.setString(Peter Smith"));
graph.setAttribute(author2, nameAttrId, v.setString(”Juan Perez"));
graph.setAttribute(author2, nameAttrId, v.setString(”John Smith"));
graph.setAttribute(conference1, conferenceNameAttrId,
v.setString(“ESWC”)));
graph.setAttribute(conference2, conferenceNameAttrId,
v.setString(“ISWC"));
Sparksee API (7)
Searching in the graph
50
Value v = new Value();
// retrieve all ’AUTHOR’ node objects
Objects authorObjs1 = graph.select(authorTypeId);
...// retrieve Peter Smith from the graph, which is a ”AUTHOR"
Objects authorObjs2 = graph.select(nameAttrId, Condition.Equal,
v.setString(”Peter Smith"));
...// retrieve all ’author' node objects having ”Smith" in the name.
It would retrieve
// Peter Smith and John Smith
Objects authorObjs3 = graph.select(nameAttrId, Condition.Like,
v.setString(”Smith"));
...
authorObjs1.close();
authorObjs2.close();
authorObjs3.close();
Sparksee API (8)
Searching in the graph
51
Graph graph = sess.getGraph();
Value v = new Value();
...
// retrieve all ’author' node objects having a value for the 'name'
attribute
// satisfying the ’^J[^]*n$'
Objects authorObjs = graph.select(nameAttrId,
Condition.RegExp, v.setString(”^J[^]*n$"));
...
authorObjs.close();
Sparksee API (9)
Searching in the graph
52
Graph graph = sess.getGraph();
sess.begin();
Value v = new Value();
int authorTypeId = graph.findType(”AUTHOR");
int nameAttrId = graph.findAttribute(authorTypeId, "NAME");
v.setString(”Peter Smith");
Long peterSmith = graph.findObject(nameAttrId, v);
sess.commit();
Sparksee API (10)
Traversing the graph
 Navigational direction can be restricted through the
edges
 The EdgesDirection enum class:
 Outgoing: Only out-going edges, starting from the source node,
will be valid for the navigation.
 Ingoing: Only in-coming edges will be valid for the navigation.
 Any: Both out-going and in-going edges will be valid for the
navigation.
53
Graph graph = sess.getGraph();
...
Objects authorObjs1 = graph.select(authorTypeId);
...// retrieve Peter Smith from the graph, which is a ”AUTHOR"
Objects node1 = graph.select(nameAttrId, Condition.Equal,
v.setString(”Peter Smith"));
Objects node2 = graph.select(nameAttrId, Condition.Like,
v.setString(”Paper"));
Int wroteTypeId = graph.findType(”WROTE");
Int citeTypeId = graph.findType(”CITE");
...// retrieve all in-comings WROTE
Objects edges = graph.explode(node1, wroteTypeId,
EdgesDirection.Ingoing);
...// retrieve all nodes through CITE
Objects cites = graph.neighbors(node2, citeTypeId,
EdgesDirection.Any);
...
edges.close();
cites.close();
Sparksee API (11)
Traversing the graph
54
Explode-based methods visit
the edges of a given node
Neighbor-based methods visit
the neighbor nodes of a given
node identifier
Sparksee API (12)
Traversing the graph
55
Graph graph = sess.getGraph();
...
Objects node2 = graph.select(nameAttrId, Condition.Like,
v.setString(”Paper"));
int citeTypeId = graph.findType(”CITE");
// 1-hop cites
Objects cites1 = graph.neighbors(node2, citeTypeId,
EdgesDirection.Any);
// cites of cites (2-hop)
Objects cites2 = graph.neighbors(cites1, citeTypeId,
EdgesDirection.Any);
…..
edges.close();
cites.close();
Sparksee API Summary
56
Operation Description
CreateGraph Creates a empty graph
DropGraph Removes a graph
RegisterType Registers a new node or edge type
GetTypes Returns the list of all node or edge types
FindType Returns the id of a node or edge type
NewNode Creates a new empty node of a given node type
AddNode Copies a node
AddEdge Adds a new edge of some edge type
DropObject Removes a node or an edge
Scan Returns all nodes or edges of a given type
Select Returns the nodes or edges of a given type which have an attribute that
evaluates true to a comparison operator over a value
Related Returns all neighbors of a node following edges members of an edge type
Neighbors Returns all neighbors
Neo4J
 Labeled attribute multigraph.
 Nodes and edges can have properties.
 There are no restrictions on the number of edges
between two nodes.
 Loops are allowed.
 Attributes are single valued.
 Different types of indexes: Nodes & relationships.
 Different types of traversal strategies.
 API for Java, Python.
57
[Robinson et al. 2013]
http://neo4j.org/
Neo4J Architecture
58
[Robinson et al. 2013]
http://neo4j.org/
Neo4J Logical View
59
[Robinson et al. 2013]
http://neo4j.org/
Wrote Cites
Wrote
Paper1 Paper2
Paper3
Peter Smith
60
Conference
ESWC
Paper4
Conference
Cites
Paper5
ISWC
Conference
Conference
Cites
Conference
Juan Perez
Wrote
John Smith
Wrote
Wrote
Wrote
Cites
Wrote
Example
Neo4J API (1)
Creating a graph and adding nodes
61
db = new GraphDatabaseFactory().newEmbeddedDatabase( DB_PATH );
Node author1 = db.createNode();
Node author2 = db.createNode();
Node author3 = db.createNode();
Node paper1 = db.createNode();
Node paper2 = db.createNode();
Node paper3 = db.createNode();
Node paper4 = db.createNode();
Node paper5 = db.createNode();
Node conference1 = db.createNode();
Node conference2= db.createNode();
Neo4J API (2)
Adding attributes to nodes
author1.setProperty(“firstname”,”Peter”);
author1.setProperty(“lastname”,”Smith”);
author2.setProperty(“firstname”,”Juan”);
author2.setProperty(“lastname”,”Perez”);
author3.setProperty(“firstname”,”John”);
author3.setProperty(“lastname”,”Smith”);
conference1.setProperty(“name”,”ESWC”);
conference2.setProperty(“name”,”ISWC”);
62
Neo4J API (3)
Adding relationships
63
RelationshipType relWrote = DynamicRelationshipType.withName(”WROTE”);
Relationship rel1=author1.createRelationshipTo(paper1,relWrote);
Relationship rel2=author1.createRelationshipTo(paper5,relWrote);
Relationship rel3=author1.createRelationshipTo(paper3,relWrote);
…
RelationshipType relCites = DynamicRelationshipType.withName(”CITES”);
Relationship rel8=paper1.createRelationshipTo(paper2,relCites);
Relationship rel9=paper3.createRelationshipTo(paper1,relCites);
Relationship rel10=paper3.createRelationshipTo(paper4,relCites);
Relationship rel11=paper4.createRelationshipTo(paper2,relCites);
…
RelationshipType relConf = DynamicRelationshipType.withName(”Conference”);
Relationship rel12=paper1.createRelationshipTo(conference1,relConf);
Relationship rel13=paper2.createRelationshipTo(conference2,relConf);
Relationship rel14=paper3.createRelationshipTo(conference1,relConf);
Relationship rel15=paper4.createRelationshipTo(conference1,relConf);
Relationship rel16=paper5.createRelationshipTo(conference2,relConf);
Neo4J API (4)
Creating indexes
64
IndexManager index = db.index();
Index<Node> authorIndex = index.forNodes( ”authors" );
//”authors” index name
Index<Node> paperIndex = index.forNodes( ”papers" );
Index<Node> conferenceIndex = index.forNodes( ”conferences" );
RelationshipIndex wroteIndex = index.forRelationships( ”write" );
RelationshipIndex citeIndex = index.forRelationships( ”cite" );
RelationshipIndex conferenceRelIndex = index.forRelationships(
”conferenceRel" );
author1.setProperty(“firstname”,”Peter”);
author1.setProperty(“lastname”,”Smith”);
authorIndex.add(author1,lastname,author1.getProperty(lastname));
author2.setProperty(“firstname”,”Juan”);
author2.setProperty(“lastname”,”Perez”);
authorIndex.add(author2,lastname,author2.getProperty(lastname));
author3.setProperty(“firstname”,”John ”);
author3.setProperty(“lastname”,”Smith”);
authorIndex.add(author3,lastname,author3.getProperty(lastname));
conference1.setProperty(“name”,”ESWC”);
conferenceIndex.add(conference1,name,conference1.getProperty(name));
conference2.setProperty(“name”,”ISWC”);
conferenceIndex.add(conference2,name,conference2.getProperty(name));
Neo4J API (5)
Indexing nodes
65
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.graphdb.index.Index;
import org.neo4j.graphdb.index.IndexManager;
import org.neo4j.graphdb.factory.*;
import org.neo4j.graphdb.traversal.*;
import org.neo4j.tooling.*;
import org.neo4j.kernel.*;
import org.neo4j.helpers.collection.*;
g = new GraphDatabaseFactory().
newEmbeddedDatabaseBuilder(PATH).
loadPropertiesFromFile("conf/neo4j.properties").
newGraphDatabase();
globalOP = GlobalGraphOperations.at(g);
indexManager = g.index();
index = indexManager.forNodes("id");
Neo4J API (6) Traversing Graphs
66
Wrote Cites
Wrote
Paper1 Paper2
Paper3
Peter Smith
67
Conference
ESWC
Paper4
Conference
Cites
Paper5
ISWC
Conference
Conference
Cites
Conference
Juan Perez
Wrote
John Smith
Wrote
Wrote
Wrote
Cites
Wrote
Query: Papers written by Peter Smith.
Node v;
Relationship rel;
Iterator<Relationship> it;
v = index.get("id","Peter Smith").getSingle();
if (v == null) return;
DynamicRelationshipType wrote = DynamicRelationshipType.
withName(”wrote");
it = v.getRelationships(wrote,Direction.OUTGOING).iterator();
while (it.hasNext()) {
rel = it.next();
System.out.println(rel.getEndNode().getProperty("id"));
}
g.shutdown();
}
Neo4J API (6) Traversing Graphs
68
Wrote Cites
Wrote
Paper1 Paper2
Paper3
Peter Smith
69
Conference
ESWC
Paper4
Conference
Cites
Paper5
ISWC
Conference
Conference
Cites
Conference
Juan Perez
Wrote
John Smith
Wrote
Wrote
Wrote
Cites
Wrote
Query: Papers written by Peter Smith.
Wrote Cites
Wrote
Paper1 Paper2
Paper3
Peter Smith
70
Conference
ESWC
Paper4
Conference
Cites
Paper5
ISWC
Conference
Conference
Cites
Conference
Juan Perez
Wrote
John Smith
Wrote
Wrote
Wrote
Cites
Wrote
Query: Papers published at ESWC.
import org.neo4j.graphdb.index.IndexHits;
import org.neo4j.graphdb.index.Index;
import org.neo4j.graphdb.index.IndexManager;
import org.neo4j.graphdb.factory.*;
import org.neo4j.graphdb.traversal.*;
import org.neo4j.tooling.*;
import org.neo4j.kernel.*;
import org.neo4j.helpers.collection.*;
g = new GraphDatabaseFactory().
newEmbeddedDatabaseBuilder(args[0]).
loadPropertiesFromFile("conf/neo4j.properties").
newGraphDatabase();
globalOP = GlobalGraphOperations.at(g);
indexManager = g.index();
index = indexManager.forNodes("id");
Neo4J API (6) Traversing Graphs
71
Node v;
Relationship rel;
Iterator<Relationship> it;
v = index.get("id",”ESWC").getSingle();
if (v == null) return;
DynamicRelationshipType conference = DynamicRelationshipType.
withName(“Conference");
it = v.getRelationships(conference,Direction.INCOMING).iterator();
while (it.hasNext()) {
rel = it.next();
System.out.println(rel.getStartNode().getProperty("id"));
}
g.shutdown();
}
Neo4J API (6) Traversing Graphs
72
Wrote Cites
Wrote
Paper1 Paper2
Paper3
Peter Smith
73
Conference
ESWC
Paper4
Conference
Cites
Paper5
ISWC
Conference
Conference
Cites
Conference
Juan Perez
Wrote
John Smith
Wrote
Wrote
Wrote
Cites
Wrote
Query: Papers published at ESWC.
Neo4J: Cypher Query Language (1)
 START:
 Specifies one or more starting points in the graph.
 Starting points can be defined as index lookups or just by
an element ID.
 MATCH:
 Pattern matching to match based on the starting point(s)
 WHERE:
 Filtering criteria.
 RETURN:
 What is projected out from the evaluation of the query.
http://www.neo4j.org/learn/cypher
74
Neo4J: Cypher Query Language (2)
Query: Is Peter Smith connected to John Smith.
75
START author=node:authors(
'firstname:”Peter" AND lastname:”Smith"')
m=node:authors(
'firstname:”John" AND lastname:”Smith"’)
MATCH (author)-[R*]->(m)
RETURN count(R)
MATCH (author:authors {firstname:”Peter" AND lastname:”Smith"})-
[R*]->(m:author:authors {firstname:”John" AND lastname:”Smith"})
RETURN count(R)
Neo4J: Cypher Query Language (2)
Query: Papers written by Peter Smith.
76
START author=node:authors(
'firstname:”Peter" AND lastname:”Smith"')
MATCH (author)-[:WROTE]->(papers)
RETURN papers
MATCH (author:authors{
firstname:”Peter" AND lastname:”Smith”}-[:WROTE]->(papers)
RETURN papers
Neo4J: Cypher Query Language (3)
Query: Papers cited by a paper written by Peter
Smith.
77
START author=node:authors(
'firstname:”Peter" AND lastname:”Smith"')
MATCH (author)-[:WROTE]->()-[:CITES](papers)
RETURN papers
Neo4J: Cypher Query Language (3)
Query: Papers cited by a paper written by Peter
Smith that have at most 20 cites.
78
START author=node:authors(
'firstname:”Peter" AND lastname:”Smith"')
MATCH (author)-[:WROTE]->()-[:CITES]->()-[:CITES]->(papers)
WITH COUNT(papers) as cites
WHERE cites < 21
RETURN papers
Neo4J: Cypher Query Language (4)
Query: Papers cited by a paper written by Peter
Smith or cited by papers cited by a paper written
by Peter Smith.
79
START author=node:authors(
'firstname:”Peter" AND lastname:”Smith"')
MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers)
RETURN papers
Neo4J: Cypher Query Language (5)
Query: Number of papers cited by a paper written
by Peter Smith or cited by papers cited by a paper
written by Peter Smith.
80
START author=node:authors(
'firstname:”Peter" AND lastname:”Smith"')
MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers)
RETURN count(papers)
Neo4J: Cypher Query Language (6)
Query: Number of papers cited by a paper written
by Peter Smith or cited by papers cited by a paper
written by Peter Smith, and have been published in
ESWC.
81
START author=node:authors(
'firstname:”Peter" AND lastname:”Smith"')
MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers)
WHERE papers.conference! =‘ESWC’
RETURN count(papers)
Exclamation mark in papers.conference ensures that papers
without the conference attribute are not included in the output.
Neo4J: Cypher Query Language (7)
Query: Number of papers cited by a paper written
by Peter Smith or cited by papers cited by a paper
written by Peter Smith, have been published in
ESWC and have at most 4 co-authors .
82
START author=node:authors(
'firstname:”Peter" AND lastname:”Smith"')
MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers)-
[:Was-Written]->authorFinal
WHERE papers.conference! =‘ESWC’
WITH author, count(authorFinal) as authorFinalCount
WHERE authorFinalCount < 4
RETURN author, authorFinalCount
Exclamation mark in papers.conference ensures that papers
without the conference attribute are not included in the output.
Neo4J API-Cypher
Query: Peter Smith Neighborhood
83
static Neo4j TestGraph;
String Query = "start n=node:node_auto_index(idNode='PeterSmith')" +
"match n-[r]->m return m.idNode";
ExecutionResult result = TestGraph.query(Query);
boolean res = false;
for ( Map<String, Object> row : result ){
for ( Map.Entry<String, Object> column : row.entrySet() ){
print(column.getValue().toString());
res = true;
}
}
if (!res) print(" No result");
Operation Description
GraphDatabaseFactory Creates a empty graph
registerShutdownHook Removes a graph
createNode Creates a new empty node of a given node type
AddNode Copies a node
createRelationshipTo Adds a new edge of some edge type
forNodes Index for a Node
IndexManager Index Manager
forRelationships Index for a Relationship
add Add an object to an index
TraversalDescription Different types of strategies to traverse a graph
Traversal
Branch Selector
preorderDepthFirst, postorderDepthFirst,
preorderBreadthFirst, postorderBreadthFirst
Neo4J API Summary
84
Graphium (http://graphium.ldc.usb.ve)
 Abstract Graph Database Engine built on top of: Neo4j, Sparksee,
and HyperGraphDB.
 Labeled attribute multigraph.
 Nodes and edges can have properties.
 There are no restrictions on the number of edges between two
nodes.
 Loops are allowed.
 Attributes are single valued.
 Different types of traversal strategies.
 Different data mining strategies.
 Java Graph-based and RDF-based API.
85
[ Flores et al. 2013, De Abreu et al. 2013]
Graphium (http://graphium.ldc.usb.ve)
86
[ Flores et al. 2013, De Abreu et al. 2013]
Example: Property Graph Model
• Nodes and edges may have properties
• Properties: Key-value pairs
The Beatles
Let it be
Revolver
Help!
created
Year: 1970
Duration: 35:16
Year: 1965
Year: 1966
Duration: 35:01
Homepage:
thebeatles.com
Origin: Liverpool
Source: “Scaling Up Linked Data”.
EUCLID project.
Graphium Architecture
87
GRAPHIUM
Neo4j Sparksee HyperGraphDB
Graph-based API RDF-based API
Data Mining Traversal API Graph
Invariants
Wrote Cites
Wrote
Paper1 Paper2
Paper3
Peter Smith
88
Conference
ESWC
Paper4
Conference
Cites
Paper5
ISWC
Conference
Conference
Cites
Conference
Juan Perez
Wrote
John Smith
Wrote
Wrote
Wrote
Cites
Wrote
Example
Graphium Graph-based API (1)
Creating a graph
89
GraphDB db;
db = new Neo4j();
// or…
db = new DEXDB(); // Sparksee was previously known as DEX
// or…
db = new HyperGraphDB();
Graphium Graph-based API (2)
Adding nodes
db.addNode(”Peter Smith”);
db.addNode(”Juan Perez”);
db.addNode(”John Smith”);
db.addNode(”ESWC”);
db.addNode(”ISWC”);
db.addNode(”Paper1”);
db.addNode(”Paper2”);
db.addNode(”Paper3”);
db.addNode(”Paper4”);
db.addNode(”Paper5”);
90
Graphium Graph-based API (3)
Adding relationships
91
db.addEdge(new Edge(“WROTE”, “Peter Smith”, “Paper1”));
db.addEdge(new Edge(“WROTE”, “Peter Smith”, “Paper3”));
db.addEdge(new Edge(“WROTE”, “Peter Smith”, “Paper5”));
…
db.addEdge(new Edge(“CITES”, “Paper1”, “Paper2”));
db.addEdge(new Edge(“CITES”, “Paper3”, “Paper1”));
db.addEdge(new Edge(“CITES”, “Paper3”, “Paper4”));
db.addEdge(new Edge(“CITES”, “Paper4”, “Paper2”));
…
db.addEdge(new Edge(“Conference”, “Paper1”, “ESWC”));
db.addEdge(new Edge(“Conference”, “Paper2”, “ISWC”));
db.addEdge(new Edge(“Conference”, “Paper3”, “ESWC”));
db.addEdge(new Edge(“Conference”, “Paper4”, “ESWC”));
db.addEdge(new Edge(“Conference”, “Paper5”, “ISWC”));
Graphium Graph-based API (4)
Adjacency: “Papers cited by Paper3”
92
GraphIterator<String> it;
it = db.adj(“Paper3”, “CITES”);
while (it.hasNext()) {
… it.next() …
}
it.close();
Graphium Graph-based API (5)
93
Traversing the graph: “2-Hops from Peter Smith”
GraphIterator<String> it;
it = db.kHops(“Peter Smith”, 2);
while (it.hasNext()) {
… it.next() …
}
it.close();
Graph Mining Tasks
These tasks evaluate the engine performance
while executing the following graph mining:
 Graph summarization.
 Dense subgraph.
 Shortest path.
94
4 5 76
1 2 3
8
9
0
4 5 6 7
1 2 3
Add(8,1)
Add(9,3)
Add(0.3)
Del(4,1)
8 9
0
Corrections
Cost = 1(superedge) + 4(corrections)= 5
Graph Summarization
[Navlakha et al. 2008]
Original
Graph
Summarized
Graph
95
A
B
C
D
E
F
G
H
1
2 1
1
1
3
3
2
3
1
1
1
2
[Khuller et al. 2010]
Dense Subgraph
96
A
B
C
D
E
F
G
H
1
2 1
1
1
3
3
2
3
1
1
1
2
Given a subset of nodes E’ compute the densest subgraph containing
them, i.e., density of E’ is maximized:
density(E') =
weight(a,b)
| E'|(a,b) in E'
å
[Khuller et al. 2010]
Dense Subgraph
97
Shortest Path
• Given two nodes finds the shortest path
between them.
• Implemented using Iterative Depth First Search
(DFID).
• Ten node pairs are taken at random from each
graph, and then the Shortest Path algorithm is
executed on each pair of nodes
98
LUNCH BREAK
99
RDF GRAPH ENGINES
100
4
RDF-Based Graph Database Engines
101
Native storage
Hexastore
RDF3X
BitMat
DBMS-backed
storage
Hexastore*
Bhyper
Graphium
RDF-based
In the beginnings …
102
… there were dinosaurs
In the beginnings …
103
• Traditionally:
– Use Relational DBMS’ to store data
– Early Approach: one large SPO table
(physical)
• Pros: simple schema, fast updates (when no index
present), fast single triple pattern queries
• Cons: not suitable for multi-join queries, results in
multiple self-joins
In the beginnings …
104
• Traditionally:
– Use Relational DBMS’ to store data
– Further Approaches: property tables
(multiple flavors)
• Pros: fast retrieval of subject / object pairs for
given predicates
• Cons: not easy to solve queries where predicate is
unbound & harder to maintain schema (a table for
each known predicate)
RDF-tailored management approaches
105
• Same conceptual approach: one large SPO table
(but with RDF-specific physical organization):
– Triples are indexed in all 3!=6 possible permutations:
SPO, PSO, OSP, OPS, PSO, POS
– Logical extension of the vertical partitioning proposed
in [Abadi et al 2008] (the PSO index = generalization
of the SO table)
– Each index (i.e. SPO) stores partial triple pattern
cardinalities
– Resulting in 6 tree-based indexes, each 3 levels deep
• First proposed in Hexastore
[Weiss et al. 2008]
Hexastore / Conceptual Model
106
• 2 main operations: (input = tp: partially bound triple pattern, i.e.,
<bound_subject> <some predicate> ?o)
• cardinality = getCardinality(tp)
• term_set = getSet(tp)
• Resulting term_set is sorted
Si
pni...p1i p2i
o1ik
...
o1i2
o1i1
o2il
...
o2i2
o2i1
onij
...
oni2
oni1
Hexastore / Conceptual Model
107
• Pros:
• The optimal index permutation & level is used (i.e. <bound_s> ?p ?o : SPO level
1)
• Ability to make use of fast merge-joins
• Cons:
• Higher upkeep cost: a theoretical upper bound of 5x space increase over original
(encoded) data
Si
pni...p1i p2i
o1ik
...
o1i2
o1i1
o2il
...
o2i2
o2i1
onij
...
oni2
oni1
108
Wrote Cite
s
Wrote
Paper1 Paper2
Paper3
Peter Smith
Conference
ESWC
Paper4
Conference
Cite
s
Paper5
ISWC
Conference
Conference
Cite
s
Conference
Juan Perez
Wrote
John Smith
Wrote
Wrote
Wrote
Cite
s
Wrote
An Example (Publications):
logical representation
RDF Subgraph for the Publication Graph
109
conference/eswc/2
012/paper/03
conference/iswc/20
12/paper/05
person/peter-smith
ontoware/
ontology#author
2012
ontoware/
ontology#year
conference/iswc/
semanticweb/
ontology#isPartOf
ISWC
semanticweb/
ontology#hasAcronym
semanticweb/ontology/C
onferenceEvent
w3c/
rdf-syntax#type
ontoware/
ontology#author
conference/eswc/
semanticweb/
ontology#isPartOf
2012
ontoware/
ontology#year
w3c/
rdf-syntax#type
ontoware/
ontology#biblioReference
ESWC
semanticweb/
ontology#hasAcronym
conference/iswc/20
12/paper/01
An Example (Publications):
first step, encode strings
110
http://data.semanticweb.org/person/john-smith 20
http://data.semanticweb.org/person/juan-perez 19
18http://data.semanticweb.org/person/peter-smith
http://data.semanticweb.org/ns/swc/ontology#hasAcronym 17
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 16
15"2012"
14http://swrc.ontoware.org/ontology#biblioReference
13http://swrc.ontoware.org/ontology#year
12http://swrc.ontoware.org/ontology#author
11http://data.semanticweb.org/ns/swc/ontology#isPartOf
http://data.semanticweb.org/ns/swc/ontology#ConferenceEvent 10
http://data.semanticweb.org/conference/eswc/2012 9
http://data.semanticweb.org/conference/iswc/2012 8
7"ISWC2012"
"ESWC2012" 6
5http://data.semanticweb.org/conference/iswc/2012/paper/05
http://data.semanticweb.org/conference/eswc/2012/paper/04 4
3http://data.semanticweb.org/conference/eswc/2012/paper/03
http://data.semanticweb.org/conference/iswc/2012/paper/02 2
1http://data.semanticweb.org/conference/eswc/2012/paper/01
HexastoreMappingDictionary
An Example (Publications):
first step, encode strings
111
http://data.semanticweb.org/person/john-smith 20
http://data.semanticweb.org/person/juan-perez 19
18http://data.semanticweb.org/person/peter-smith
http://data.semanticweb.org/ns/swc/ontology#hasAcronym 17
http://www.w3.org/1999/02/22-rdf-syntax-ns#type 16
15"2012"
14http://swrc.ontoware.org/ontology#biblioReference
13http://swrc.ontoware.org/ontology#year
12http://swrc.ontoware.org/ontology#author
11http://data.semanticweb.org/ns/swc/ontology#isPartOf
http://data.semanticweb.org/ns/swc/ontology#ConferenceEvent 10
http://data.semanticweb.org/conference/eswc/2012 9
http://data.semanticweb.org/conference/iswc/2012 8
7"ISWC2012"
"ESWC2012" 6
5http://data.semanticweb.org/conference/iswc/2012/paper/05
http://data.semanticweb.org/conference/eswc/2012/paper/04 4
3http://data.semanticweb.org/conference/eswc/2012/paper/03
http://data.semanticweb.org/conference/iswc/2012/paper/02 2
1http://data.semanticweb.org/conference/eswc/2012/paper/01
HexastoreMappingDictionary
• Next, index original encoded RDF file …
Hexastore / Physical implementation.
Native storage
112
2
2
4
3
3
4
4
9
8
1
5
2
4
3
S
11
12
1
1
13 1
14 1
9
18
5
2
1
13
1
12
11
1
8
20
15
11
12
1
2
13 1
14 2
9
1918
15
41
11
12
1
2
13 1
14 1
9
2019
15
2
12 1
111
13 1
8
18
15
17 1
116 10
7
17
1
1
16 10
6
P O
• Uses vector-storage [Weiss
et al 2009]
• Each index level:
– partially sorted vector (on disk,
one file per level)
– each record:
<rdf term id, cardinality, pointer to set>
– first level gets special
treatment:
• fully sorted vector: search =
binary search, O(log(N))
• Hint: using an additional hash
can circumvent the binary search
Hexastore / Physical implementation.
Native storage
113
2
2
4
3
3
4
4
9
8
1
5
2
4
3
S
11
12
1
1
13 1
14 1
9
18
5
2
1
13
1
12
11
1
8
20
15
11
12
1
2
13 1
14 2
9
1918
15
41
11
12
1
2
13 1
14 1
9
2019
15
2
12 1
111
13 1
8
18
15
17 1
116 10
7
17
1
1
16 10
6
P O
• Pros:
– Optimized for read-intensive /
static workloads
– Optimal / fast triple pattern
matching
– No comparisons when
scanning since each (partial)
cardinality is known !
• Cons:
– Not suitable for write intensive
workloads (worst case can
result in rewrite of entire
index!)
Hexastore / Physical implementation.
DBMS (managed) storage
114
• Builds on top of sorted
indexes:
– B+Tree, LSM Tree, etc
• Each index level:
– Same sorted index structure
(but not necessarily)
– each record:
<rdf term id list, cardinality>
– Search cost = as provided
by the managed index
structure
2
2
4
3
3
4
4
9
8
1
5
2
4
3
S
1
1
1
1
11
12
1
1
13 1
14 1
1 11 9 -
1 -1812
1 13 5 -
1 14 2 -
2
2
2
1
13
1
12
11
1
2 11 8 -
2 12 20 -
2 13 15 -
3
3
3
3
11
12
1
2
13 1
14 2
3 11 9 -
12 -3 18
3 13 15 -
3 -14 1
4
4
4
4 11
12
1
2
13 1
14 1
4 11 9 -
4 -12 20
4 13 15 -
4 14 2 -
5
5
5 12 1
111
13 1
5 11 8 -
5 12 18 -
5 13 15 -
8
8 17 1
116
8 16 10 -
8 17 7 -
9
9 17
1
1
16
9 16 10 -
9 17 6 -
SP SPO
-3 1912
43 -14
12 19 -4
Hexastore / Physical implementation.
DBMS (managed) storage
115
• Pros:
– Flexible design: each index can
be configured for a different
physical data-structure:
• i.e. PSO=B+Tree, OSP=LSM Tree
– Separation of concerns (the
underlying DBMS manages and
optimizes for access patterns,
cache, etc…)
• Cons:
– Less control over fine-grained
data storage (up to the level
permitted by the DBMS API)
– Usually less potential for native
optimization
2
2
4
3
3
4
4
9
8
1
5
2
4
3
S
1
1
1
1
11
12
1
1
13 1
14 1
1 11 9 -
1 -1812
1 13 5 -
1 14 2 -
2
2
2
1
13
1
12
11
1
2 11 8 -
2 12 20 -
2 13 15 -
3
3
3
3
11
12
1
2
13 1
14 2
3 11 9 -
12 -3 18
3 13 15 -
3 -14 1
4
4
4
4 11
12
1
2
13 1
14 1
4 11 9 -
4 -12 20
4 13 15 -
4 14 2 -
5
5
5 12 1
111
13 1
5 11 8 -
5 12 18 -
5 13 15 -
8
8 17 1
116
8 16 10 -
8 17 7 -
9
9 17
1
1
16
9 16 10 -
9 17 6 -
SP SPO
-3 1912
43 -14
12 19 -4
RDF3x (1)
 RISC style operators.
 Implements the traditional Optimize-then-Execute
paradigm, to identify left-linear plans of star-shaped sub-
queries of basic triple patterns.
 Makes use of secondary memory structures to locally
store RDF data and reduce the overhead of I/O
operations.
 Registers aggregated count or number of occurrences,
and is able to detect if a query will return an empty
answer.
116
[Neumann and Weikum 2008]
RDF3x (2)
 Triple table, each row represents a triple.
 Mapping dictionary, replacing all the literal
strings by an id; two dictionary indices.
 Compressed clustered B+-tree to index all triples:
 Six permutations of subject, predicate and object;
each permutation in a different index.
 Triples are sorted lexicographically in each B+-tree.
117
[Neumann and Weikum 2008]
RDF3x (3)
118
Mapping dictionary
Triple Table
119
RDF3x (4)
[Figure from Neumann et al 2010]
 Compressed B+-trees.
 One B+-tree per pattern.
 Structure is updatable.
120
RDF3x (5)
[Figure from Lei Zou http://hotdb.info/slides/RDF-talk.pdf]
RDF3x (6)
121
Wrote Cites
Wrote
Paper1 Paper2
Paper3
Peter Smith
122
Conference
ESWC
Paper4
Conference
Cites
Paper5
ISWC
Conference
Conference
Cites
Conference
Juan Perez
Wrote
John Smith
Wrote
Wrote
Wrote
Cites
Wrote
Example
RDF Subgraph for the Publication Graph
123
conference/eswc/2
012/paper/03
conference/iswc/20
12/paper/05
person/peter-smith
ontoware/
ontology#author
2012
ontoware/
ontology#year
conference/iswc/
semanticweb/
ontology#isPartOf
ISWC
semanticweb/
ontology#hasAcronym
semanticweb/ontology/C
onferenceEvent
w3c/
rdf-syntax#type
ontoware/
ontology#author
conference/eswc/
semanticweb/
ontology#isPartOf
2012
ontoware/
ontology#year
w3c/
rdf-syntax#type
ontoware/
ontology#biblioReference
ESWC
semanticweb/
ontology#hasAcronym
conference/iswc/20
12/paper/01
RDF3x: Representation of the
Publication Graph
124
OID Resource
0 http://www.w3.org/1999/02/22-rdf-syntax-ns#type
1 http://data.semanticweb.org/ns/swc/ontology#hasAcronym
2 http://data.semanticweb.org/ns/swc/ontology#isPartOf
3 http://swrc.ontoware.org/ontology#author
4 http://swrc.ontoware.org/ontology#year
5 http://swrc.ontoware.org/ontology#biblioReference
6 http://data.semanticweb.org/conference/iswc/2012
7 http://data.semanticweb.org/ns/swc/ontology#ConferenceEvent
8 ISWC2012
9 http://data.semanticweb.org/conference/eswc/2012
10 ESWC2012
11 http://data.semanticweb.org/conference/iswc/2012/paper/05
12 http://data.semanticweb.org/conference/iswc/2012/paper/02
13 http://data.semanticweb.org/conference/eswc/2012/paper/01
14 http://data.semanticweb.org/conference/eswc/2012/paper/03
15 http://data.semanticweb.org/conference/eswc/2012/paper/04
16 http://data.semanticweb.org/author/peter-smith
17 http://data.semanticweb.org/author/juan-perez
18 http://data.semanticweb.org/author/john-smith
19 2012
S P O
6 0 7
9 0 7
6 0 7
9 1 8
11 1 10
12 2 6
13 2 6
14 2 9
15 2 9
11 3 16
12 3 18
13 3 16
14 3 16
14 3 17
15 3 17
15 3 18
11 4 19
12 4 19
13 4 19
14 4 19
15 4 19
13 5 12
14 5 13
14 5 15
15 5 12
RDF3xMappingDictionary
TripleTable
Graphium (http://graphium.ldc.usb.ve)
 Abstract Graph Database Engine built on top of: Neo4j, Sparksee,
and HyperGraphDB.
 Labeled attribute multigraph.
 Nodes and edges can have properties.
 There are no restrictions on the number of edges between two
nodes.
 Loops are allowed.
 Attributes are single valued.
 Different types of traversal strategies.
 Different data mining strategies.
 Java Graph-based and RDF-based API.
125
[ Flores et al. 2013, De Abreu et al. 2013]
Graphium (http://graphium.ldc.usb.ve)
126
[ Flores et al. 2013, De Abreu et al. 2013]
Example: Property Graph Model
• Nodes and edges may have properties
• Properties: Key-value pairs
The Beatles
Let it be
Revolver
Help!
created
Year: 1970
Duration: 35:16
Year: 1965
Year: 1966
Duration: 35:01
Homepage:
thebeatles.com
Origin: Liverpool
Source: “Scaling Up Linked Data”.
EUCLID project.
Graphium (http://graphium.ldc.usb.ve)
127
[ Flores et al. 2013, De Abreu et al. 2013]
Example: RDF Model
• Properties are represented
with nodes and edges
The Beatles
Let it be
Revolver
Help!
created
1970
35:16
duration
1965
year
1966
35:01
duration
Liverpool
thebeatles.com
Graphium RDF-based API (1)
Creating a graph
Just use the command-line tool to load your RDF graph from an NT file, and create the
database using Neo4j or Sparksee.
./create <GDBM (Sparksee or Neo4j)> <NT file> <DB location>
For example…
> ./create Neo4j publications.nt pubDB-neo4j/
> ./create Sparksee publications.nt pubDB-sparksee/
128
Graphium RDF-based API (2)
Loading the graph in Java
GraphRDF g;
g = GraphiumLoader.open( PATH );
…
g.close();
129
Graphium RDF-based API (3)
Searching for nodes…
If the node you are searching for is an URI, use:
If it is a Blank Node, use:
v = g.getVertexURI(“http://data.semanticweb.org/author/peter-smith”);
130
v = g.getVertexBlankNode(“blanknodeid”);
Vertex v;
Graphium RDF-based API (4)
From each vertex you can…
(1) Get the RDF object it represents:
131
// If you don’t know what object it represents,
// use the superclass RDFobject…
RDFobject v_any = v.getAny();
// If you know, use any of the subclasses…
URI v_uri = v.getURI();
BlankNode v_bn = v.getBlankNode();
Literal v_lit = v.getLiteral();
Graphium RDF-based API (5)
From each vertex you can…
(2) Get its in/out relationships:
132
GraphIterator<Edge> it;
// For out-going relationships…
it = v.getEdgesOut();
// For in-coming relationships…
it = v.getEdgesIn();
while (it.hasNext()) {
Edge rel = it.next();
…
}
it.close();
Graphium RDF-based API (6)
133
From each edge you can…
(1) Get its predicate/URI:
(2) Get the start/end of the edge (the subject and object of each triple):
URI predicate = rel.getURI();
Vertex start_v, end_v;
start_v = rel.getStart();
end_v = rel.getEnd();
Graphium RDF-based API (7)
134
You can also traverse the graph…
Vertex v;
…
GraphIterator<Vertex> it = kHop.run(2,v);
while (it.hasNext()) {
… it.next() …
}
it.close();
Graphium RDF-based API (8)
135
You can also traverse the graph using…
BFS…
DFS…
k-Hops…
Or even k-Hops from a set of nodes…
GraphIterator<Vertex> it = BFS.run(v);
GraphIterator<Vertex> it = DFS.run(v);
GraphIterator<Vertex> it = kHop.run(k,v);
GraphIterator<Vertex> it = kHop.run(k,v1,v2,v3,…);
COFFEE BREAK
136
HANDS-ON SESSION
137
5
Hands-on Goals
• Implement simple graph-based queries by using
Neo4j and Sparksee API.
• Implement simple RDF-based queries by using
Graphium API.
• Implement graph invariants by using Graphium
API.
Go to https://www.github.com/AlejandroFloresV/Tutorial-GraphDB
http://www.sparsity-technologies.com/downloads/javadoc-
sparkseejava.pdf
138
Author Cites
Author
Paper1 Paper2
Paper3
Peter Smith
139
Conference
ESWC
Paper4
Conference
Cites
Paper5
ISWC
Conference
Conference
Cites
Conference
Juan Perez
Author
John Smith
Author
Author
Author
Cites
Author
Example
Assignments
140
> make Neo4jCreate
> make Neo4jQuery
> ./publications Neo4jCreate neo4j_db
> ./publications Neo4jQuery neo4j_db
1) Implement the query “Papers written by Peter
Smith” using the Sparksee API and Neo4j API.
Complete the code for Neo4j, located in the files
experiment/Neo4jCreate.java and experiment/Neo4jQuery.java. Then..
> make SparkseeCreate
> make SparkseeQuery
> ./publications SparkseeCreate sparksee_db
> ./publications SparkseeQuery sparksee_db
Now complete the code for Sparksee, located in the files
experiment/SparkseeCreate.java and experiment/SparkseeQuery.java.
Then..
RDF Subgraph for the Publication Graph
141
conference/eswc/2
012/paper/03
conference/iswc/20
12/paper/05
person/peter-smith
ontoware/
ontology#author
2012
ontoware/
ontology#year
conference/iswc/
semanticweb/
ontology#isPartOf
ISWC
semanticweb/
ontology#hasAcronym
semanticweb/ontology/C
onferenceEvent
w3c/
rdf-syntax#type
ontoware/
ontology#author
conference/eswc/
semanticweb/
ontology#isPartOf
2012
ontoware/
ontology#year
w3c/
rdf-syntax#type
ontoware/
ontology#biblioReference
ESWC
semanticweb/
ontology#hasAcronym
conference/iswc/20
12/paper/01
Assignments
2) Implement the following queries using the Graphium
API3:
a) “Papers written by Peter Smith”.
b) “Papers cited by a paper written by Peter Smith that have at most 2 cites.”.
c) “Papers cited by a paper written by Peter Smith or cited by papers cited by
a paper written by Peter Smith.”
d) “Number of papers cited by a paper written by Peter Smith or cited by
papers cited by a paper written by Peter Smith. “
e) “Number of papers cited by a paper written by Peter Smith or cited by
papers cited by a paper written by Peter Smith, and have been published in
ESWC.”
f) “Number of papers cited by a paper written by Peter Smith or cited by
papers cited by a paper written by Peter Smith, have been published in
ESWC and have at most 4 co-authors “.
3 http://graphium.ldc.usb.ve
142
Assignments
143
> ./create <Neo4j or Sparksee> publications.nt graphium_db
2) Implementing queries using the Graphium API.
First, let's create the Graphium DB from the NT file publications.nt:
Assignments
144
> make A
> ./publications A graphium_db
2a) “Papers written by Peter Smith”.
Complete the code in experiment/A.java, and run (in the terminal)...
2b) “Papers cited by a paper written by Peter Smith that
have at most 2 cites”.
Complete the code in experiment/B.java, and run (in the terminal)...
> Make B
> ./publications B graphium_db
Assignments
145
> Make C
> ./publications C graphium_db
2c) “Papers cited by a paper written by Peter Smith or
cited by papers cited by a paper written by Peter
Smith”.
Complete the code in experiment/C.java, and run (in the terminal)...
2d) “Number of papers cited by a paper written by Peter
Smith or cited by papers cited by a paper written by
Peter Smith”.
Complete the code in experiment/D.java, and run (in the terminal)...
> Make D
> ./publications D graphium_db
Assignments
146
> Make E
> ./publications E graphium_db
2e) “Number of papers cited by a paper written by Peter Smith or
cited by papers cited by a paper written by Peter Smith, and
have been published in ESWC”.
Complete the code in experiment/E.java, and run (in the terminal)...
2 f) “Number of papers cited by a paper written by Peter Smith or
cited by papers cited by a paper written by Peter Smith, have
been published in ESWC and have at most 4 co-authors”.
Complete the code in experiment/F.java, and run (in the terminal)...
> Make F
> ./publications F graphium_db
Assignments
3) Implement graph invariants using the
Graphium API3
147
Invariant Description
Vertex and Edge Count number of vertices and edges in the graph.
Graph Density number of edges in the graph divided by the
number of possible edges in a complete digraph.
Reciprocity Reciprocity measures the extend to which a triple
that relates resources A and B is reciprocated by a
another triple that relates B with A too.
In- and Out-degree Distribution Distribution of the number of in-coming and out-
going edges of the vertices of a graph.
In-coming and Out-going h-index h is the maximum number, such that h vertices
have each at least h in-coming neighbors (resp.,
out-going neighbors) in the graph.
3 http://graphium.ldc.usb.ve
Assignments
148
> make Nodes
> ./publications Nodes graphium_db
3) Implement graph invariants using the Graphium
API.
3a) Number of nodes/vertices in the graph
Complete the code in experiment/Nodes.java, and run (in the terminal)...
> make Density
> ./publications Density graphium_db
3b) Graph Density
Complete the code in experiment/Density.java, and run (in the
terminal)...
Assignments
4) Compute graph invariant of different RDF
graphs, using the Chrysalis tool, and upload the
results in the Graphium Chrysalis4 Website.
149
4 http://graphium.ldc.usb.ve/chrysalis/
5 https://sites.google.com/site/dawfederation/data-sets
Go to the DAW5 website and download one of the datasets (.nt files). Then
use the terminal to run:
> ./create <Neo4j or Sparksee> <.nt file path> test_db
> ./chrysalis test_db
Now, go to the Graphium Chrysalis website and upload the file
chrysalis.log to visualize it.
EMPIRICAL EVALUATION
150
6
Experimental Set-Up
Machine:
Model: Sun Fire x4100
Quantity of processors: 2
Processor specification: Dual-Core AMD Opteron™ Processor 2218 64 bits
RAM memory: 16GB
Disk capacity: 2 disks, 135GB e/o
Network interface: 2 + ALOM
Operating System: CentOS 5.5. 64 bits
Engines:
RDF3x version: 0.3.7
Sparksee version: 4.8 Very Large Databases
HyperGraphDB version: 1.2
Neo4J version: 1.9
Timeout: 3,600 secs.
Implementation:
We provide both internal implementation for traversal methods using the API’s of each engine, and the
external implementation using the generic algorithm through the core-graph API methods.
Runs: Up to 10 times for each test. Average is reported. All experiments were run in cold cache.
The graphs were loaded in sif format for the general purpose graph engines and in nt format for RDF-3x.
Only RDF-3x could load graphs from Berlin Benchmark of over 50 million edges.
Download the code and datasets at https://github.com/gpalma/gdbb/
151COLD 2013
152
Graph Name #Nodes #Edges Description
DSJC1000.1 1,000 49,629 Graph Density:0.1
[Johnson91]
DSJC1000.5 1,000 249,826 Graph Density:0.5
[Johnson91]
DSJC1000.9 1,000 449,449 Graph Density:0.9
[Johnson91]
SSCA2-17 131,072 3,907,909 GTgraph 1)
[Johnson91] Johnson, D., Aragon, C., McGeoch, L., and Schevon, C. Optimization by simulated
annealing: an experimental evaluation; part ii, graph coloring and number partitioning. Operations
research 39, 3 (1991), 378–406.
GTgraph: A suite of three synthetic graph generators:
1) SSCA2: generates graphs used as input instances for the DARPA. High Productivity Computing
Systems SSCA#2 graph theory benchmark
http://www.cse.psu.edu/~madduri/software/GTgraph/index.html
Benchmark Graphs
153
Task Description
adjacentXP Given a node X and an edge label P, find adjacent
nodes Y.
adjacentX Given a node X, find adjacent nodes Y.
edgeBetween Given two nodes X and Y, find the labels of the
edges between X and Y.
2-hopX Given a node X, find the 2-hop Y nodes.
3-hopX Given a node X, find the 3-hop Y nodes.
4-hopX Given a node X, find the 4-hop Y nodes.
Benchmark Tasks
The benchmark tasks evaluate the engine performance while executing
the following graph operations:
154
Task SPARQL Query
adjacentXP Select ?y
where {
<http://graphdatabase.ldc.usb.ve/resource/6> <http://graphdatabase.ldc.usb.ve/resource/pr>
?y}
adjacentX Select ?y
where {
<http://graphdatabase.ldc.usb.ve/resource/6> ?p ?y}
edgeBetween Select ?p
where {
<http://graphdatabase.ldc.usb.ve/resource/6> ?p
<http://graphdatabase.ldc.usb.ve/resource/8>
2-hopX Select ?z
where {
<http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1.
?y1 ?p2 ?z}
3-hopX Select ?z
where {
<http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1.
?y1 ?p2 ?y2.
?y2 ?p3 ?z}
4-hopX Select ?z
where {
<http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1.
?y1 ?p2 ?y2.
?y2 ?p3 ?y3.
?y3 ?p4 ?z}
Benchmark Tasks: RDF Engines
The benchmark tasks were implemented in the RDF engines with the
following SPARQL queries:
155
Benchmark of Graph
Graph Name #Nodes #Edges Density #Labels
DSJC1000.1
[Johnson91]
1,000 99,258 0.099 1
DSJC1000.5
[Johnson91]
1,000 499,652 0.50 1
DSJC1000.9
[Johnson91]
1,000 898,898 0.899 1
USA-road-
d.NY
264,346 730,100 0.00001045 7,970
USA-road-
d.FLA
1,070,376 2,687,902 0.00000235 22,704
Berlin10M 2,743,235 9,709,119 0.00000129 40
✔
✔
✔
✔
✔
✔
✔
Dense GraphsSparse GraphsFew LabelsMany LabelsSmall GraphsMedium-Size Graphs
[Johnson91] Johnson, D., Aragon, C., McGeoch, L., and Schevon, C. Optimization by simulated annealing: an experimental
evaluation; part ii, graph coloring and number partitioning. Operations research 39, 3 (1991), 378–406.
USA-road-d* Graphs 9th DIMACS Implementation Challenge - Shortest Paths http://www.dis.uniroma1.it/challenge9/download.shtml
Berlin10M: Berlin Bechmark-http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/
COLD 2013
Graph Creation-Time and Memory
HyperGraphDB timed out after
24 hours (Berlin10M).
RDF-3x is faster than the rest
of the engines.
Secondary Memory grows as
the size of the graph.
In general, RDF-3x requires
less memory than the rest of
the engines.
156COLD 2013
Adjacency Tests
RFD-3x is impacted by the graph density;
it timed out for 4-hops in
DSJC.5 and DSJC.9.
 RDF-3x exploits internal structures in
sparse graphs.
Sparksee outperforms all the engines in
dense graph with few label; internal and
external implementations are competitive.
Number of label negatively impacts
RDF-3x exploits internal structures and overcomes the rest of the graph engines.
Neo4j and Sparksee are competitive.
HyperGraphDB always consumes at least one order of magnitude more time than
the rest of the engines.
157COLD 2013
Traversals-BFS and DFS (Internal and
External)
Internal representations outperform
external representations.
Neo4j exploits internal representation
of graph, neighborhood indices, and
main memory-structures and
overcomes the rest of the engines.
158COLD 2013
Mining Tasks
Graph Summarization seems to be more
affected by graph density, size of the graph
and number of labels.
Sparksee overcomes the rest of engines
when the graphs are dense.
Neo4j overcomes the rest of the engines
in sparse graphs.
Sparksee overcomes the rest of
engines when the graphs are
dense.
Neo4j overcomes the rest of the
engines in sparse graphs.
159COLD 2013
http://graphium.ldc.usb.ve/
160COLD 2013
161
QUESTIONS & DISCUSSION
162
6
Graph Data Storing Features
163
Graph Database Main
Memory
Persistent
Memory
Backend
Storage/Implem
entation
Indexi
ng
Sparksee X X X
HyperGraphDB1 X X X X
Neo4j X X X
Hexastore2 X X X X
RDF3x X X
Graphium3
Graph-based
X X X X
Graphium3
RDF-based
X X X X
2Backend Storage: Tokyo Cabinet
3Backend Implementation: Neo4j and Sparksee
1Backend Storage: Berkeley DB
Graph Data Management Features
164
Graph Database Graph-based API/
RDF-based API
Query
Language
Data
Visuali-
zation
Data
Definition
Data
Manipulation
Standard
Compliance
Sparksee X X X X
HyperGraphDB X X X
Neo4j X X X X X
Hexastore X X X X
RDF3x X
Graphium
Graph-based
X X X
Graphium
RDF-based
X X X
Adjacency
Queries
Node/Edge
Adjacency
K-neighborhood
of a node.
Reachability
Queries
Test whenever
two nodes are
connected by a
path.
Shortest Path.
Pattern
Matching
Queries
Find all sub-graphs of
a data graph.
Summarization
Queries
Aggregate the
results of a
graph-based
query
165
Graph-Based Operations
166
Graph
Database
Node/Edge
adjacency
K-
neighborhood
Fixed-length
paths
RegularSimple
Paths
ShortestPath
Pattern
Matching
DataMining
Tasks
Sparksee X X X X X X
HyperGraphDB X X X X X
Neo4j X X X X X X
Hexastore X X
RDF3x X X
Graphium
Graph-based
X X X X X X X
Graphium
RDF-based
X X X X
Graph-Based Operations
SUMMARY & CLOSING
167
7
Data Storing Features:
 Engines implement a wide variety of
structures to efficiently represent large graphs.
 RDF engines implement special-purpose
structures to efficiently store RDF graphs.
168
Conclusions (1)
Data Management Features:
 General-purpose graph database engines
provide APIs to manage and query data.
 RDF engines support general data management
operations.
169
Conclusions (2)
Graph-Based Operations Support:
 General-purpose graph database engines provide API
methods to implement a wide variety of graph-based
operations.
 Only few offer query languages to express pattern
matching.
 RDF engines are not customized for basic graph-based
operations, however,
 They efficiently implement the pattern matching-
based language SPARQL.
170
Conclusions (3)
 Extension of existing engines to support specific
tasks of graph traversal and mining.
 Definition of benchmarks to study the performance
and quality of existing graph database engines, e.g.,
 Graph traversal, link prediction, pattern discovery,
graph mining.
 Empirical evaluation of the performance and quality
of existing graph database engines
171
Future Directions
Acknowledgements
Maribel Acosta is supported by XLike,
European Community's Seventh
Framework Programme FP7-ICT-2011-7
(XLike, Grant 288342).
Edna Ruckhaus and Maria-Esther Vidal are
partially supported by DID-USB.
172
• P. Anderson, A. Thor, J. Benik, L. Raschid, and M.-E. Vidal. Pang: finding patterns in
anno- tation graphs. In SIGMOD Conference, pages 677–680, 2012.
• R. Angles and C. Gutiérrez. Survey of graph database models. ACM Comput. Surv.,
40(1), 2008.
• M. Atre, V. Chaoji, M. Zaki, and J. Hendler. Matrix Bit loaded: a scalable lightweight
join query processor for RDF data. In International Conference on World Wide Web,
Raleigh, USA, 2010. ACM.
• D. De Abreu, A. Flores, G. Palma, V. Pestana, J. Piñero, J. Queipo, J. Sánchez, M.E.
Vidal. Choosing Between Graph Databases and RDF Engines for Consuming and
Mining Linked Data. COLD 2013 in Conjunction with ISWC 2013.
• A.Flores, G. Palma, M.E. Vidal, D. De Abreu, V. Pestana, J. Piñero, J. Queipo, J.
Sánchez, GRAPHIUM: Visualizing Performance of Graph and RDF Engines on
Linked Data. Poster & Demo ISWC 2013.
• A. Flores, G. Palma, M.E. Vidal, D. De Abreu, V. Pestana, J. Piñero, J. Queipo, J.
Sánchez, Graphium Chrysalis: Exploiting Graph Database Engines to Analyze RDF
Graphs. Demo ESWC 2014.
173
References
• B.Iordanov.Hypergraphdb:Ageneralizedgraphdatabase.InWAIMWorkshops,pages25–
36, 2010.
• N. Martíınez-Bazan, V. Muntés-Mulero, S. G ómez-Villamor, J. Nin, M.-A. Sá́ nchez-
Martíınez, and J.-L. Larriba-Pey. Dex: high-performance exploration on large graphs
for information retrieval. In CIKM, pages 573–582, 2007.
• T. Neumann and G. Weikum. RDF-3X: a RISC-style engine for RDF. Proc. VLDB,
1(1), 2008.
• M. K. Nguyen, C. Basca, and A. Bernstein. B+hash tree: optimizing query execution
times for on-disk semantic web data structures. In Proceedings Of The 6th
International Workshop On Scalable Semantic Web Knowledge Base Systems
(SSWS2010), 2010.
• I Robinson, J. Webber, and E. Eifrem. Graph Databases. O’Reilly Media,
Incorporated, 2013.
174
References
• S. Sakr and E. Pardede, editors. Graph Data Management: Techniques and
Applications. IGI Global, 2011.
• D.Shimpiand S.Chaudhari. An overview of graph databases .In IJCA Proceedings on
International Conference on Recent Trends in Information Technology and Computer
Science 2012, 2013.
• M.-E. Vidal, A. Martínez, E. Ruckhaus, T. Lampo, and J. Sierra. On the efficiency of
querying and storing rdf documents. In Graph Data Management, pages 354–385.
2011.
• C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple indexing for semantic web
data management. Proc. of the 34th Intl Conf. on Very Large Data Bases (VLDB),
2008.
• C Weiss, A Bernstein, On-disk storage techniques for Semantic Web data - Are B-
Trees always the optimal solution?, Proceedings of the 5th International Workshop on
Scalable Semantic Web Knowledge Base Systems, October 2009.
175
References

Contenu connexe

Tendances

An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...Amazon Web Services
 
Accenture Tech Vision 2020 - Overview
Accenture Tech Vision 2020 - OverviewAccenture Tech Vision 2020 - Overview
Accenture Tech Vision 2020 - Overviewaccenture
 
Rethinking Accenture's network
Rethinking Accenture's networkRethinking Accenture's network
Rethinking Accenture's networkaccenture
 
Cloud computing and migration strategies to cloud
Cloud computing and migration strategies to cloudCloud computing and migration strategies to cloud
Cloud computing and migration strategies to cloudSourabh Saxena
 
Assessing Your Company's Cloud Readiness
Assessing Your Company's Cloud ReadinessAssessing Your Company's Cloud Readiness
Assessing Your Company's Cloud ReadinessAmazon Web Services
 
The Industrialist: Trends & Innovations - October 2022
The Industrialist: Trends & Innovations - October 2022The Industrialist: Trends & Innovations - October 2022
The Industrialist: Trends & Innovations - October 2022accenture
 
The Industrialist: Trends & Innovations - July 2022
The Industrialist: Trends & Innovations - July 2022The Industrialist: Trends & Innovations - July 2022
The Industrialist: Trends & Innovations - July 2022accenture
 
INTIENT Patient Solution
INTIENT Patient SolutionINTIENT Patient Solution
INTIENT Patient Solutionaccenture
 
The Transformation Journey with Cloud Technology
The Transformation Journey with Cloud TechnologyThe Transformation Journey with Cloud Technology
The Transformation Journey with Cloud TechnologyAmazon Web Services
 
The Total Economic ImpactTM (TEI) of Neo4j, Featuring Forrester
The Total Economic ImpactTM (TEI) of Neo4j, Featuring ForresterThe Total Economic ImpactTM (TEI) of Neo4j, Featuring Forrester
The Total Economic ImpactTM (TEI) of Neo4j, Featuring ForresterNeo4j
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsKrishnaram Kenthapadi
 
A pattern language for microservices (#gluecon #gluecon2016)
A pattern language for microservices (#gluecon #gluecon2016)A pattern language for microservices (#gluecon #gluecon2016)
A pattern language for microservices (#gluecon #gluecon2016)Chris Richardson
 
Network Operations | SlideShare | Accenture
Network Operations | SlideShare | AccentureNetwork Operations | SlideShare | Accenture
Network Operations | SlideShare | AccentureAccenture Operations
 
Cloud Migration Checklist | Microsoft Azure Migration
Cloud Migration Checklist | Microsoft Azure MigrationCloud Migration Checklist | Microsoft Azure Migration
Cloud Migration Checklist | Microsoft Azure MigrationIntellika
 
DDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at ScaleDDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at Scaleinside-BigData.com
 
The Value Multiplier | SlideShare | Accenture
The Value Multiplier | SlideShare | AccentureThe Value Multiplier | SlideShare | Accenture
The Value Multiplier | SlideShare | AccentureAccenture Operations
 
Multi-Cloud Strategy for Unrestricted Possibilities
Multi-Cloud Strategy for Unrestricted PossibilitiesMulti-Cloud Strategy for Unrestricted Possibilities
Multi-Cloud Strategy for Unrestricted PossibilitiesHarsh V Sehgal
 

Tendances (20)

An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
An Overview of Best Practices for Large Scale Migrations - AWS Transformation...
 
Accenture Tech Vision 2020 - Overview
Accenture Tech Vision 2020 - OverviewAccenture Tech Vision 2020 - Overview
Accenture Tech Vision 2020 - Overview
 
Rethinking Accenture's network
Rethinking Accenture's networkRethinking Accenture's network
Rethinking Accenture's network
 
Cloud computing and migration strategies to cloud
Cloud computing and migration strategies to cloudCloud computing and migration strategies to cloud
Cloud computing and migration strategies to cloud
 
Accenture Labs Innovation Stories
Accenture Labs Innovation StoriesAccenture Labs Innovation Stories
Accenture Labs Innovation Stories
 
Assessing Your Company's Cloud Readiness
Assessing Your Company's Cloud ReadinessAssessing Your Company's Cloud Readiness
Assessing Your Company's Cloud Readiness
 
The Industrialist: Trends & Innovations - October 2022
The Industrialist: Trends & Innovations - October 2022The Industrialist: Trends & Innovations - October 2022
The Industrialist: Trends & Innovations - October 2022
 
The Industrialist: Trends & Innovations - July 2022
The Industrialist: Trends & Innovations - July 2022The Industrialist: Trends & Innovations - July 2022
The Industrialist: Trends & Innovations - July 2022
 
INTIENT Patient Solution
INTIENT Patient SolutionINTIENT Patient Solution
INTIENT Patient Solution
 
The Transformation Journey with Cloud Technology
The Transformation Journey with Cloud TechnologyThe Transformation Journey with Cloud Technology
The Transformation Journey with Cloud Technology
 
The Total Economic ImpactTM (TEI) of Neo4j, Featuring Forrester
The Total Economic ImpactTM (TEI) of Neo4j, Featuring ForresterThe Total Economic ImpactTM (TEI) of Neo4j, Featuring Forrester
The Total Economic ImpactTM (TEI) of Neo4j, Featuring Forrester
 
Fairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML SystemsFairness and Privacy in AI/ML Systems
Fairness and Privacy in AI/ML Systems
 
A pattern language for microservices (#gluecon #gluecon2016)
A pattern language for microservices (#gluecon #gluecon2016)A pattern language for microservices (#gluecon #gluecon2016)
A pattern language for microservices (#gluecon #gluecon2016)
 
Network Operations | SlideShare | Accenture
Network Operations | SlideShare | AccentureNetwork Operations | SlideShare | Accenture
Network Operations | SlideShare | Accenture
 
Groundswell. La estrategia 2.0 de la Sexta
Groundswell. La estrategia 2.0 de la SextaGroundswell. La estrategia 2.0 de la Sexta
Groundswell. La estrategia 2.0 de la Sexta
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Cloud Migration Checklist | Microsoft Azure Migration
Cloud Migration Checklist | Microsoft Azure MigrationCloud Migration Checklist | Microsoft Azure Migration
Cloud Migration Checklist | Microsoft Azure Migration
 
DDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at ScaleDDN EXA 5 - Innovation at Scale
DDN EXA 5 - Innovation at Scale
 
The Value Multiplier | SlideShare | Accenture
The Value Multiplier | SlideShare | AccentureThe Value Multiplier | SlideShare | Accenture
The Value Multiplier | SlideShare | Accenture
 
Multi-Cloud Strategy for Unrestricted Possibilities
Multi-Cloud Strategy for Unrestricted PossibilitiesMulti-Cloud Strategy for Unrestricted Possibilities
Multi-Cloud Strategy for Unrestricted Possibilities
 

En vedette

HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingHARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingMaribel Acosta Deibe
 
Conference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic DataConference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic DataAnna Lisa Gentile
 
CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013Lora Aroyo
 
Couchbase Performance Benchmarking 2012
Couchbase Performance Benchmarking 2012Couchbase Performance Benchmarking 2012
Couchbase Performance Benchmarking 2012Altoros
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentMaribel Acosta Deibe
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataEUCLID project
 
Thinking Outside the Table
Thinking Outside the TableThinking Outside the Table
Thinking Outside the TableOntotext
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesCambridge Semantics
 
Aula 12 - Gestão do Conhecimento
Aula 12 - Gestão do ConhecimentoAula 12 - Gestão do Conhecimento
Aula 12 - Gestão do ConhecimentoFilipo Mór
 
Inglês técnico técnicas de leitura
Inglês técnico   técnicas de leituraInglês técnico   técnicas de leitura
Inglês técnico técnicas de leituraAnselmo Gomes
 

En vedette (13)

HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via CrowdsourcingHARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
HARE: A Hybrid SPARQL Engine to Enhance Query Answers via Crowdsourcing
 
Conference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic DataConference Live: Accessible and Sociable Conference Semantic Data
Conference Live: Accessible and Sociable Conference Semantic Data
 
CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013CrowdSem 2013 Workshop @ISWC2013
CrowdSem 2013 Workshop @ISWC2013
 
Couchbase Performance Benchmarking 2012
Couchbase Performance Benchmarking 2012Couchbase Performance Benchmarking 2012
Couchbase Performance Benchmarking 2012
 
Crowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality AssessmentCrowdsourcing Linked Data Quality Assessment
Crowdsourcing Linked Data Quality Assessment
 
Resource Description and Access (RDA)
Resource Description and Access (RDA)Resource Description and Access (RDA)
Resource Description and Access (RDA)
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
Thinking Outside the Table
Thinking Outside the TableThinking Outside the Table
Thinking Outside the Table
 
AACR2r Parte II: Pontos de acesso (2015)
AACR2r Parte II: Pontos de acesso (2015)AACR2r Parte II: Pontos de acesso (2015)
AACR2r Parte II: Pontos de acesso (2015)
 
Semantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational DatabasesSemantic Graph Databases: The Evolution of Relational Databases
Semantic Graph Databases: The Evolution of Relational Databases
 
Aula 12 - Gestão do Conhecimento
Aula 12 - Gestão do ConhecimentoAula 12 - Gestão do Conhecimento
Aula 12 - Gestão do Conhecimento
 
Skimming and scanning
Skimming and scanningSkimming and scanning
Skimming and scanning
 
Inglês técnico técnicas de leitura
Inglês técnico   técnicas de leituraInglês técnico   técnicas de leitura
Inglês técnico técnicas de leitura
 

Similaire à Semantic Data Management in Graph Databases: ESWC 2014 Tutorial

Semantic Data Management in Graph Databases
Semantic Data Management in Graph DatabasesSemantic Data Management in Graph Databases
Semantic Data Management in Graph DatabasesMaribel Acosta Deibe
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXBenjamin Bengfort
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache sparkEmiliano Martinez Sanchez
 
Social network-analysis-in-python
Social network-analysis-in-pythonSocial network-analysis-in-python
Social network-analysis-in-pythonJoe OntheRocks
 
E5-roughsets unit-V.pdf
E5-roughsets unit-V.pdfE5-roughsets unit-V.pdf
E5-roughsets unit-V.pdfRamya Nellutla
 
Graph Introduction.ppt
Graph Introduction.pptGraph Introduction.ppt
Graph Introduction.pptFaruk Hossen
 
Lecture 5b graphs and hashing
Lecture 5b graphs and hashingLecture 5b graphs and hashing
Lecture 5b graphs and hashingVictor Palmar
 
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdfAdvanced-Concepts-Team
 
Relations and functions
Relations and functions Relations and functions
Relations and functions Leslie Amoguis
 
Asymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAsymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAmrinder Arora
 
relationsandfunctionslessonproper-160929053921.pdf
relationsandfunctionslessonproper-160929053921.pdfrelationsandfunctionslessonproper-160929053921.pdf
relationsandfunctionslessonproper-160929053921.pdfBhanuCharan9
 
Representing and visiting graphs
Representing and visiting graphsRepresenting and visiting graphs
Representing and visiting graphsFulvio Corno
 
19. Java data structures algorithms and complexity
19. Java data structures algorithms and complexity19. Java data structures algorithms and complexity
19. Java data structures algorithms and complexityIntro C# Book
 
Formulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on GraphFormulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on Graphijtsrd
 
19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm Complexity19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm ComplexityIntro C# Book
 
Decision Trees and Bayes Classifiers
Decision Trees and Bayes ClassifiersDecision Trees and Bayes Classifiers
Decision Trees and Bayes ClassifiersAlexander Jung
 

Similaire à Semantic Data Management in Graph Databases: ESWC 2014 Tutorial (20)

Semantic Data Management in Graph Databases
Semantic Data Management in Graph DatabasesSemantic Data Management in Graph Databases
Semantic Data Management in Graph Databases
 
Lausanne 2019 #4
Lausanne 2019 #4Lausanne 2019 #4
Lausanne 2019 #4
 
Graph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkXGraph Analyses with Python and NetworkX
Graph Analyses with Python and NetworkX
 
Graph convolutional networks in apache spark
Graph convolutional networks in apache sparkGraph convolutional networks in apache spark
Graph convolutional networks in apache spark
 
Social network-analysis-in-python
Social network-analysis-in-pythonSocial network-analysis-in-python
Social network-analysis-in-python
 
E5-roughsets unit-V.pdf
E5-roughsets unit-V.pdfE5-roughsets unit-V.pdf
E5-roughsets unit-V.pdf
 
Graph Introduction.ppt
Graph Introduction.pptGraph Introduction.ppt
Graph Introduction.ppt
 
Lecture 5b graphs and hashing
Lecture 5b graphs and hashingLecture 5b graphs and hashing
Lecture 5b graphs and hashing
 
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
2024.03.22 - Mike Heddes - Introduction to Hyperdimensional Computing.pdf
 
Relations and functions
Relations and functions Relations and functions
Relations and functions
 
Asymptotic Notation and Data Structures
Asymptotic Notation and Data StructuresAsymptotic Notation and Data Structures
Asymptotic Notation and Data Structures
 
relationsandfunctionslessonproper-160929053921.pdf
relationsandfunctionslessonproper-160929053921.pdfrelationsandfunctionslessonproper-160929053921.pdf
relationsandfunctionslessonproper-160929053921.pdf
 
Representing and visiting graphs
Representing and visiting graphsRepresenting and visiting graphs
Representing and visiting graphs
 
Functions
FunctionsFunctions
Functions
 
19. Java data structures algorithms and complexity
19. Java data structures algorithms and complexity19. Java data structures algorithms and complexity
19. Java data structures algorithms and complexity
 
Formulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on GraphFormulas for Surface Weighted Numbers on Graph
Formulas for Surface Weighted Numbers on Graph
 
19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm Complexity19. Data Structures and Algorithm Complexity
19. Data Structures and Algorithm Complexity
 
Dijkstra
DijkstraDijkstra
Dijkstra
 
d
dd
d
 
Decision Trees and Bayes Classifiers
Decision Trees and Bayes ClassifiersDecision Trees and Bayes Classifiers
Decision Trees and Bayes Classifiers
 

Dernier

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Dernier (20)

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

Semantic Data Management in Graph Databases: ESWC 2014 Tutorial

  • 1. Semantic Data Management in Graph Databases -Tutorial at ESWC 2014- Maria-Esther Vidal Edna Ruckhaus Maribel Acosta Cosmin Basca USB 1 Alejandro Flores
  • 2. 2 Network of Friends in a High School Relationships among artists in Last.fm http://sixdegrees.hu/last.fm/ Network structure of music genres and their stylistic origin http://www.infosysblogs.com/web2/2013/01/network_structure_of_music_gen.html Network structure of Patent Citations http://www.infosysblogs.com/web2/2013/07/ Graphs …
  • 3. 3 A significant increase of graph data in the form of social & biological information. Tasks to be Solved … Patterns of connections between people to understand functioning of society. Topological properties of graphs can be used to identify patterns that reveal phenomena, anomalies and potentially lead to a discovery.
  • 4. Tasks to be Solved … (2) Degree 4 The importance of the information relies on the relations more or equal than on the entities. In-degree Out-Degree Hubs http://www.infosysblogs.com/web2/2013/01/network_structure_of_music_gen.html
  • 5. Basic Graph Operations Basic Graph operations that need to be efficiently performed:  Graph traversal.  Measuring path length.  Finding the most interesting central node.  Graph Invariants. 5
  • 6. NoSQL-based Approaches These approaches can handle unstructured, unpredictable or messy data. 6 Wide-column stores BigTable model of Google Cassandra Document Stores Semi-structured Data MongoDB Key-value stores Berkeley DB Graph Databases Graph-based oriented data
  • 7. Agenda Morning Session 7 Basic Concepts & Background (60 min) The Graph Data Management Paradigm ( 35 min.) Coffee Break (30 min.) Existing Graph Database Engines (75 min.) Questions & Discussion (15 min.) Lunch Break (90 min.) 1 2 3 4
  • 8. Agenda Afternoon Session 8 RDF-Based Engines (45 min) Empirical Evaluation & Hands-On Description( 45 min.) Coffee Break (30 min.) Hands-On (75 min.) Questions & Discussion (15 min.) 1 2 3 4
  • 10. Abstract Data Type Graph G=(V, E,Σ,L) is a graph:  V is a finite set of nodes or vertices, e.g. V={Term, forOffice, Organization,…}  E is a finite set of edges representing binary relationship between elements in V, e.g. E={(forOffice,Term) (forOffice,Organization),(Office,Organization)…}  Σis a set of labels, e.g., Σ={domain, range, sc, type, …}  L is a function: V x V  Σ, 10 e.g., L={((forOffice,Term),domain), ((forOffice,Organization),range)… }
  • 11. Abstract Data Type Multi-Graph G=(V, E,Σ,L) is a multi-graph:  V is a finite set of nodes or vertices, e.g. V={Term, forOffice, Organization,…}  E is a set of edges representing binary relationship between elements in V, e.g. E={(forOffice,Term) (forOffice,Organization),(Office,Organization)…}  Σis a set of labels, e.g., Σ={domain, range, sc, type, …}  L is a function: V x V  PowerSet(Σ), 11 e.g., L={((forOffice,Term),{domain}), ((forOffice,Organization),{range}), ((_id0,AZ),{forOffice, forOrganization})… }
  • 12. Basic Operations Given a graph G, the following are operations over G:  AddNode(G,x): adds node x to the graph G.  DeleteNode(G,x): deletes the node x from graph G.  Adjacent(G,x,y): tests if there is an edge from x to y.  Neighbors(G,x): nodes y s.t. there is an edge from x to y.  AdjacentEdges(G,x,y): set of labels of edges from x to y.  Add(G,x,y,l): adds an edge between x and y with label l.  Delete(G,x,y,l): deletes an edge between x and y with label l.  Reach(G,x,y): tests if there a path from x to y.  Path(G,x,y): a (shortest) path from x to y.  2-hop(G,x): set of nodes y s.t. there is a path of length 2 from x to y, or from y to x.  n-hop(G,x): set of nodes y s.t. there is a path of length n from x to y, or from y to x. 12
  • 13. Implementation of Graphs 13 Adjacency List For each node a list of neighbors. If the graph is directed, adjacency list of i contains only the outgoing nodes of i. Cheaper for obtaining the neighbors of a node. Not suitable for checking if there is an edge between two nodes. Incidence List Vertices and edges are stored as records or objects. Each vertex stores incident edges. Each edge stores incident nodes. Incidence Matrix Bidimensional representation of graph. Rows represent Vertices. Columns represent edges An entry of 1 represents that the Source Vertex is incident to the Edge. Adjacency Matrix Bidimensional representation of graph. Rows represent Source Vertices. Columns represent Destination Vertices. Each entry with 1 represents that there is an edge from the source node to the destination node. Compressed Adjacency Matrix Differential encoding between two consecutive nodes [Sakr and Pardede 2012]
  • 14. Adjacency List 14 V1 V2 V3 V4 (V1,{L2}) (V3,{L3} ) (V1,{L1}) Properties:  Storage: O(|V|+|E|+|L|)  Adjacent(G,x,y): O(|V|)  Neighbors(G,x): O(|V|)  AdjacentEdges(G,x,y): O(|V|+|E|)  Add(G,x,y,l): O(|V|+|E|)  Delete(G,x,y,l): O(|V|+|E|) L2 L3 L1 V1 V2 V3 V4
  • 15. Implementation of Graphs 15 Adjacency List For each node a list of neighbors. If the graph is directed, adjacency list of i contains only the outgoing nodes of i. Cheaper for obtaining the neighbors of a node. Not suitable for checking if there is an edge between two nodes. Incidence List Vertices and edges are stored as records of objects. Each vertex stores incident edges. Each edge stores incident nodes. Adjacency Matrix Bidimensional representation of graph. Rows represent Source Vertices. Columns represent Destination Vertices. Each entry with 1 represents that there is an edge from the source node to the destination node. Incidence Matrix Bi-dimensional representation of graph. Rows represent Vertices. Columns represent edges An entry of 1 represents that the Source Vertex is incident to the Edge. Compressed Adjacency Matrix Differential encoding between two consecutive nodes [Sakr and Pardede 2012]
  • 16. Incidence List 16 Properties:  Storage: O(|V|+|E|+|L|)  Adjacent(G,x,y): O(|E|)  Neighbors(G,x): O(|E|)  AdjacentEdges(G,x,y): O(|E|)  Add(G,x,y,l): O(|E|)  Delete(G,x,y,l): O(|E|) (source,L2) (source,L3) (source,L1 ) (destination,L3) (V4,V1) (V2,V1) (V2,V3) (destination,L 2) (destination,L 1) V1 V2 V3 V4 L1 L2 L3 L2 L3 L1 V1 V2 V3 V4
  • 17. Implementation of Graphs 17 Adjacency List For each node a list of neighbors. If the graph is directed, adjacency list of i contains only the outgoing nodes of i. Cheaper for obtaining the neighbors of a node. Not suitable for checking if there is an edge between two nodes. Incidence List Vertices and edges are stored as records of objects. Each vertex stores incident edges. Each edge stores incident nodes. Compressed Adjacency List Differential encoding between two consecutive nodes Adjacency Matrix Bidimensional graph representation. Rows represent source vertices. Columns represent destination vertices. Each non-null entry represents that there is an edge from the source node to the destination node. Incidence Matrix Bidimensiona l graph representatio n. Rows represent Vertices. Columns represent edges A non-null entry represents that the source vertex is incident to the Edge. [Sakr and Pardede 2012]
  • 18. Compressed Adjacency List Gap Encoding 18 Node Neighbors 1 20,23,24,25,27 2 30,32,33,34 3 40,42,45,46,47,48 ….. Node Successors 1 38,2,0,0,1 2 56,1,0,0 3 74,1,2,0,0,0 ….. v(x) = 2x if x ³ 0 2 x -1 if x < 0 ì í ï îï The ordered adjacent list: A(x)=(a1,a2,a3,…,an) is encoded as (v(a1-x),a2-a1-1,a3-a2-1,…, an-an-1 -1) Where: Properties:  Storage: O(|V|+|E|+|L|)  Adjacent(G,x,y): O(|V|)  Neighbors(G,x): O(|V|)  AdjacentEdges(G,x,y): O(|V|+|E|)  Add(G,x,y,l): O(|V|+|E|)  Delete(G,x,y,l): O(|V|+|E|) Original AdjacencyList Compressed AdjacencyList Nodes
  • 19. Adjacency List For each node a list of neighbors. If the graph is directed, adjacency list of i contains only the outgoing nodes of i. Cheaper for obtaining the neighbors of a node. Not suitable for checking if there is an edge between two nodes. Incidence List Vertices and edges are stored as records of objects. Each vertex stores incident edges. Each edge stores incident nodes. Compressed Adjacency List Differential encoding between two consecutive nodes Adjacency Matrix Bidimensional graph representation. Rows represent source vertices. Columns represent destination vertices. Each non-null entry represents that there is an edge from the source node to the destination node. Incidence Matrix Bidimensiona l graph representatio n. Rows represent Vertices. Columns represent edges A non-null entry represents that the source vertex is incident to the Edge. Implementation of Graphs 19 [Sakr and Pardede 2012]
  • 20. 20 {L2} {L3} {L1} V1 V2 V3 V4 V1 V2 V3 V4 Adjacency Matrix L2 L3 L1 V1 V2 V3 V4 Properties:  Storage: O(|V|2)  Adjacent(G,x,y): O(1)  Neighbors(G,x): O(|V|)  AdjacentEdges(G,x,y): O(|E|)  Add(G,x,y,l): O(|E|)  Delete(G,x,y,l): O(|E|)
  • 21. Implementation of Graphs 21 Adjacency List For each node a list of neighbors. If the graph is directed, adjacency list of i contains only the outgoing nodes of i. Cheaper for obtaining the neighbors of a node. Not suitable for checking if there is an edge between two nodes. Incidence List Vertices and edges are stored as records of objects. Each vertex stores incident edges. Each edge stores incident nodes. Compressed Adjacency List Differential encoding between two consecutive nodes Adjacency Matrix Bidimensional graph representation. Rows represent source vertices. Columns represent destination vertices. Each non-null entry represents that there is an edge from the source node to the destination node. Incidence Matrix Bidimensiona l graph representatio n. Rows represent Vertices. Columns represent edges A non-null entry represents that the source vertex is incident to the Edge. [Sakr and Pardede 2012]
  • 22. 22 destination destination source source destination source L1 L2 L3 V1 V2 V3 V4 Incidence Matrix L2 L3 L1 V1 V2 V3 V4 Properties:  Storage: O(|V|x|E|)  Adjacent(G,x,y): O(|E|)  Neighbors(G,x): O(|V|x|E|)  AdjacentEdges(G,x,y): O(|E|)  Add(G,x,y,l): O(|V|)  Delete(G,x,y,l): O(|V|)
  • 23. Traversal Search Breadth First Search Expands shallowest unexpanded nodes first. Unexpanded nodes are stored in queue. Depth First Search Expands deepest unexpanded nodes first. Unexpanded nodes are stored in a stack. 23
  • 24. 24 7 8 2 1 3 4 5 6 0 2 1 3 4 5 6 0 1 Starting Node First Level Visited Nodes Second Level Visited Nodes Third Level Visited Nodes Breadth First Search Notation:
  • 25. 25 7 8 2 1 3 4 5 6 0 2 1 3 4 5 6 0 Depth First Search 1 Starting Node First Level Visited Nodes Second Level Visited Nodes Third Level Visited Nodes Notation:
  • 26. THE GRAPH DATA MANAGEMENT PARADIGM 26 2
  • 27. The Graph Data Management Paradigm Graph database models:  Data models for schema and instances are graph models or generalizations of them.  Data manipulation is expressed as graph- based operations. 27
  • 28. Survey of Graph Database Models 28 A graph data model represents data and schema:  Graphs or  More general, the notion of graphs: Hypergraphs, Digraphs. [Renzo and Gutiérrez 2008]
  • 29. Survey of Graph Database Models 29 [Renzo and Gutiérrez 2008] Types of relationship supported by graph data models: • Properties, • Mono- or multi- valued. • Groups of real- world objects. • Structures to represent neighborhoods of an entity. Attributes Entities Neighborhood relations • Part-of, composed-by, n-ary associations. • Subclasses and superclasses, • Relations of instantiations. • Recursively specified relations. Standard abstractions Derivation and inheritance Nested relations
  • 30. Representative Approaches 30 No Index Only works for “toy” graphs. Vertices are on the scale of 1K. Node/Edge Index Create index in different nodes/edges Hexastore, RDF3x, BitMat,Neoj4, Frequent Subgraph Index Indices on frequently queried subgraphs Reachability and Neighborhood Indices 2-hop labeling schemas for index structures. Every two nodes within distance L.
  • 34. Graph Databases Advantages  Natural modeling of highly connected data.  Special graph storage structure  Efficient schemaless graph algorithms.  Support for query languages.  Operators to query the graph structure. 34
  • 36. Existing General Graph Database Engines 36 Sparksee Java library for management of persistent and temporary graphs. Implementation relies on bitmaps and secondary structures (B+- tree) HyperGraphDB Implements the hyper graph data model. Neo4j Network oriented model where relations are first-class objects. Native disk- based storage manager for graphs. Framework for graph traversal. Graphium An Abstract Graph based database engine. Built on top of Neo4j, Sparksee, and HyperGraphDB Framework for RDF-based graphs and Data Mining tasks. Most graph databases implement an API instead of a query language.
  • 37. Sparksee(1)  Labeled attribute multi-graph.  All node and edge objects belong to a type.  Edges have a direction:  Tail: corresponds to the source of the edge.  Head: corresponds to the destination of the edge.  Node and edge objects can have one or more attributes.  There are no restrictions on the number of edges between two nodes.  Loops are allowed.  Attributes are single valued. 37 [Martínez-Basan et al. 2007] http://www.sparsity-technologies.com/dex_tutorials
  • 38. Sparksee (2)  Attributes can have different indexes:  Basic attributes.  Indexed attributes.  Unique attributes.  Edges can index neighborhoods.  Neighborhood index.  The persistent database of Sparksee is a single file.  Sparksee can manage very large graphs. 38 [Martínez-Basan et al. 2007] http://www.sparsity-technologies.com/dex_tutorials
  • 39. Sparksee Architecture 39 [Martínez-Basan et al. 2007] http://www.sparsity-technologies.com/dex_tutorials
  • 40. Sparksee Internal Representation (1) 40 [Martínez-Basan et al. 2007] Sparksee Approach: Map + Bitmaps  Links Link: bidirectional association between values and OIDs (keys)  N is a collection of nodes and E is a collection of directed edges.  The same value can be associated to multiple OIDs (keys)  Given a value  a set of OIDs (keys)  Given an OID  the value http://www.sparsity-technologies.com/
  • 41. Sparksee Internal Representation (2) 41 [Martínez-Basan et al. 2007]  A Sparksee Graph G=<n,e,o,t,h>  n= M<string,T> attributes and properties of nodes.  e= M<string,T> attributes and properties of edges.  o= L<tid,nid> list of objects, nodes or edges, for each type.  t= L<nid,eid> list of edges where each node is the tail.  h= L<nid,eid> list of edges where each node is the head. http://www.sparsity-technologies.com/
  • 42. Sparksee Internal Representation (3) 42 [Martínez-Basan et al. 2007]  A Sparksee Graph is a combination of Bitmaps:  Bitmap for each node or edge type.  One link for each attribute.  Two links for each type: Out- going and in-going edges.  Maps are B+trees  A compressed UTF-8 storage for UNICODE string. http://www.sparsity-technologies.com/
  • 43. Sparksee API 43 Operations of the Abstract Data Type Graph:  Graph construction.  Definition of node and edge types.  Creation of node and edge objects.  Definition and use of attributes.  Query nodes and edges.  Management of edges and nodes. http://www.sparsity-technologies.com/s
  • 44. Sparksee API (2) Graph construction 44 SparkseeConfig cfg = new SparkseeConfig(); cfg.setCacheMaxSize(2048); // 2 GB Sparksee sparksee= new Sparksee(cfg); Database db = sparksee.create(”Publications.dex", ”PUBLICATIONSDex"); Session sess = db.newSession(); Graph graph = sess.getGraph(); …… sess.close(); db.close(); sparksee.close(); Operations on the graph database will be performed on the object graph
  • 45. Wrote Cites Wrote Paper1 Paper2 Paper3 Peter Smith 45 Conference ESWC Paper4 Conference Cites Paper5 ISWC Conference Conference Cites Conference Juan Perez Wrote John Smith Wrote Wrote Wrote Cites Wrote Example
  • 46. Sparksee API (3) Adding nodes to the graph 46 int authorTypeId = graph.newNodeType(”AUTHOR"); int paperTypeId = graph.newNodeType(”PAPER"); int conferenceTypeId = graph.newNodeType(”CONFERENCE"); long author1 = graph.newNode(authorTypeId); long author2 = graph.newNode(authorTypeId); long author3 = graph.newNode(authorTypeId); long paper1 = graph.newNode(paperTypeId); long paper2 = graph.newNode(paperTypeId); long paper3 = graph.newNode(paperTypeId); long paper4 = graph.newNode(paperTypeId); long paper5 = graph.newNode(paperTypeId); long conference1 = graph.newNode(conferenceTypeId); long conference2 = graph.newNode(conferenceTypeId); http://sparsity-technologies.com/downloads/javadoc-java/com/sparsity/sparksee/gdb/Graph.html
  • 47. Sparksee API (4) Adding edges to the graph 47 int wroteTypeId = graph.newEdgeType(”WROTE", true,true); int isWrittenTypeId = graph.newEdgeType(”Is-Written", true,true); int citeTypeId = graph.newEdgeType(”CITE", true,true); int isCitedByTypeId = graph.newEdgeType(”IsCitedBy", true,true); int conferenceTypeId = graph.newEdgeType(”IsConference", true,true); long wrote1 = graph.newEdge(wroteTypeId, author1, paper1); long wrote2 = graph.newEdge(wroteTypeId, author1, paper3); long wrote3 = graph.newEdge(wroteTypeId, author1, paper5); long wrote4 = graph.newEdge(wroteTypeId, author2, paper3); long wrote5 = graph.newEdge(wroteTypeId, author2, paper4); long wrote4 = graph.newEdge(wroteTypeId, author3, paper2); long wrote5 = graph.newEdge(wroteTypeId, author3, paper4); newEdgeType(java.lang.String name, boolean directed,boolean neighbors) Creates a new edge If TRUE, this creates a directed edge type, otherwise this creates a undirected edge type.neighbors - [in] If TRUE, this indexes neighbor nodes, otherwise not. Returns:Unique Sparksee type identifier.
  • 48. Sparksee API (5) Indexing  Basic attributes: There is no index associated with the attribute.  Indexed attributes: There is an index automatically maintained by the system associated with the attribute.  Unique attributes: The same as for indexed attributes but with an added integrity restriction: two different objects cannot have the same value, with the exception of the null value.  Sparksee operations: Accessing the graph through an attribute will automatically use the defined index, significantly improving the performance of the operation.  A specific index can also be defined to improve certain navigational operations, for example, Neighborhood.  There is the AttributeKind enum class which includes basic, indexed and unique attributes. 48
  • 49. Sparksee API (6) Adding attributes to nodes of the graph 49 nameAttrId = graph.newAttribute(authorTypeId, "Name", DataType.String, AttributeKind.Indexed); conferenceNameAttrId = graph.newAttribute(conferenceTypeId, ”ConferenceName", DataType.String, AttributeKind.Indexed); Value v = new Value(); graph.setAttribute(author1, nameAttrId, v.setString(Peter Smith")); graph.setAttribute(author2, nameAttrId, v.setString(”Juan Perez")); graph.setAttribute(author2, nameAttrId, v.setString(”John Smith")); graph.setAttribute(conference1, conferenceNameAttrId, v.setString(“ESWC”))); graph.setAttribute(conference2, conferenceNameAttrId, v.setString(“ISWC"));
  • 50. Sparksee API (7) Searching in the graph 50 Value v = new Value(); // retrieve all ’AUTHOR’ node objects Objects authorObjs1 = graph.select(authorTypeId); ...// retrieve Peter Smith from the graph, which is a ”AUTHOR" Objects authorObjs2 = graph.select(nameAttrId, Condition.Equal, v.setString(”Peter Smith")); ...// retrieve all ’author' node objects having ”Smith" in the name. It would retrieve // Peter Smith and John Smith Objects authorObjs3 = graph.select(nameAttrId, Condition.Like, v.setString(”Smith")); ... authorObjs1.close(); authorObjs2.close(); authorObjs3.close();
  • 51. Sparksee API (8) Searching in the graph 51 Graph graph = sess.getGraph(); Value v = new Value(); ... // retrieve all ’author' node objects having a value for the 'name' attribute // satisfying the ’^J[^]*n$' Objects authorObjs = graph.select(nameAttrId, Condition.RegExp, v.setString(”^J[^]*n$")); ... authorObjs.close();
  • 52. Sparksee API (9) Searching in the graph 52 Graph graph = sess.getGraph(); sess.begin(); Value v = new Value(); int authorTypeId = graph.findType(”AUTHOR"); int nameAttrId = graph.findAttribute(authorTypeId, "NAME"); v.setString(”Peter Smith"); Long peterSmith = graph.findObject(nameAttrId, v); sess.commit();
  • 53. Sparksee API (10) Traversing the graph  Navigational direction can be restricted through the edges  The EdgesDirection enum class:  Outgoing: Only out-going edges, starting from the source node, will be valid for the navigation.  Ingoing: Only in-coming edges will be valid for the navigation.  Any: Both out-going and in-going edges will be valid for the navigation. 53
  • 54. Graph graph = sess.getGraph(); ... Objects authorObjs1 = graph.select(authorTypeId); ...// retrieve Peter Smith from the graph, which is a ”AUTHOR" Objects node1 = graph.select(nameAttrId, Condition.Equal, v.setString(”Peter Smith")); Objects node2 = graph.select(nameAttrId, Condition.Like, v.setString(”Paper")); Int wroteTypeId = graph.findType(”WROTE"); Int citeTypeId = graph.findType(”CITE"); ...// retrieve all in-comings WROTE Objects edges = graph.explode(node1, wroteTypeId, EdgesDirection.Ingoing); ...// retrieve all nodes through CITE Objects cites = graph.neighbors(node2, citeTypeId, EdgesDirection.Any); ... edges.close(); cites.close(); Sparksee API (11) Traversing the graph 54 Explode-based methods visit the edges of a given node Neighbor-based methods visit the neighbor nodes of a given node identifier
  • 55. Sparksee API (12) Traversing the graph 55 Graph graph = sess.getGraph(); ... Objects node2 = graph.select(nameAttrId, Condition.Like, v.setString(”Paper")); int citeTypeId = graph.findType(”CITE"); // 1-hop cites Objects cites1 = graph.neighbors(node2, citeTypeId, EdgesDirection.Any); // cites of cites (2-hop) Objects cites2 = graph.neighbors(cites1, citeTypeId, EdgesDirection.Any); ….. edges.close(); cites.close();
  • 56. Sparksee API Summary 56 Operation Description CreateGraph Creates a empty graph DropGraph Removes a graph RegisterType Registers a new node or edge type GetTypes Returns the list of all node or edge types FindType Returns the id of a node or edge type NewNode Creates a new empty node of a given node type AddNode Copies a node AddEdge Adds a new edge of some edge type DropObject Removes a node or an edge Scan Returns all nodes or edges of a given type Select Returns the nodes or edges of a given type which have an attribute that evaluates true to a comparison operator over a value Related Returns all neighbors of a node following edges members of an edge type Neighbors Returns all neighbors
  • 57. Neo4J  Labeled attribute multigraph.  Nodes and edges can have properties.  There are no restrictions on the number of edges between two nodes.  Loops are allowed.  Attributes are single valued.  Different types of indexes: Nodes & relationships.  Different types of traversal strategies.  API for Java, Python. 57 [Robinson et al. 2013] http://neo4j.org/
  • 58. Neo4J Architecture 58 [Robinson et al. 2013] http://neo4j.org/
  • 59. Neo4J Logical View 59 [Robinson et al. 2013] http://neo4j.org/
  • 60. Wrote Cites Wrote Paper1 Paper2 Paper3 Peter Smith 60 Conference ESWC Paper4 Conference Cites Paper5 ISWC Conference Conference Cites Conference Juan Perez Wrote John Smith Wrote Wrote Wrote Cites Wrote Example
  • 61. Neo4J API (1) Creating a graph and adding nodes 61 db = new GraphDatabaseFactory().newEmbeddedDatabase( DB_PATH ); Node author1 = db.createNode(); Node author2 = db.createNode(); Node author3 = db.createNode(); Node paper1 = db.createNode(); Node paper2 = db.createNode(); Node paper3 = db.createNode(); Node paper4 = db.createNode(); Node paper5 = db.createNode(); Node conference1 = db.createNode(); Node conference2= db.createNode();
  • 62. Neo4J API (2) Adding attributes to nodes author1.setProperty(“firstname”,”Peter”); author1.setProperty(“lastname”,”Smith”); author2.setProperty(“firstname”,”Juan”); author2.setProperty(“lastname”,”Perez”); author3.setProperty(“firstname”,”John”); author3.setProperty(“lastname”,”Smith”); conference1.setProperty(“name”,”ESWC”); conference2.setProperty(“name”,”ISWC”); 62
  • 63. Neo4J API (3) Adding relationships 63 RelationshipType relWrote = DynamicRelationshipType.withName(”WROTE”); Relationship rel1=author1.createRelationshipTo(paper1,relWrote); Relationship rel2=author1.createRelationshipTo(paper5,relWrote); Relationship rel3=author1.createRelationshipTo(paper3,relWrote); … RelationshipType relCites = DynamicRelationshipType.withName(”CITES”); Relationship rel8=paper1.createRelationshipTo(paper2,relCites); Relationship rel9=paper3.createRelationshipTo(paper1,relCites); Relationship rel10=paper3.createRelationshipTo(paper4,relCites); Relationship rel11=paper4.createRelationshipTo(paper2,relCites); … RelationshipType relConf = DynamicRelationshipType.withName(”Conference”); Relationship rel12=paper1.createRelationshipTo(conference1,relConf); Relationship rel13=paper2.createRelationshipTo(conference2,relConf); Relationship rel14=paper3.createRelationshipTo(conference1,relConf); Relationship rel15=paper4.createRelationshipTo(conference1,relConf); Relationship rel16=paper5.createRelationshipTo(conference2,relConf);
  • 64. Neo4J API (4) Creating indexes 64 IndexManager index = db.index(); Index<Node> authorIndex = index.forNodes( ”authors" ); //”authors” index name Index<Node> paperIndex = index.forNodes( ”papers" ); Index<Node> conferenceIndex = index.forNodes( ”conferences" ); RelationshipIndex wroteIndex = index.forRelationships( ”write" ); RelationshipIndex citeIndex = index.forRelationships( ”cite" ); RelationshipIndex conferenceRelIndex = index.forRelationships( ”conferenceRel" );
  • 66. import org.neo4j.graphdb.index.IndexHits; import org.neo4j.graphdb.index.Index; import org.neo4j.graphdb.index.IndexManager; import org.neo4j.graphdb.factory.*; import org.neo4j.graphdb.traversal.*; import org.neo4j.tooling.*; import org.neo4j.kernel.*; import org.neo4j.helpers.collection.*; g = new GraphDatabaseFactory(). newEmbeddedDatabaseBuilder(PATH). loadPropertiesFromFile("conf/neo4j.properties"). newGraphDatabase(); globalOP = GlobalGraphOperations.at(g); indexManager = g.index(); index = indexManager.forNodes("id"); Neo4J API (6) Traversing Graphs 66
  • 67. Wrote Cites Wrote Paper1 Paper2 Paper3 Peter Smith 67 Conference ESWC Paper4 Conference Cites Paper5 ISWC Conference Conference Cites Conference Juan Perez Wrote John Smith Wrote Wrote Wrote Cites Wrote Query: Papers written by Peter Smith.
  • 68. Node v; Relationship rel; Iterator<Relationship> it; v = index.get("id","Peter Smith").getSingle(); if (v == null) return; DynamicRelationshipType wrote = DynamicRelationshipType. withName(”wrote"); it = v.getRelationships(wrote,Direction.OUTGOING).iterator(); while (it.hasNext()) { rel = it.next(); System.out.println(rel.getEndNode().getProperty("id")); } g.shutdown(); } Neo4J API (6) Traversing Graphs 68
  • 69. Wrote Cites Wrote Paper1 Paper2 Paper3 Peter Smith 69 Conference ESWC Paper4 Conference Cites Paper5 ISWC Conference Conference Cites Conference Juan Perez Wrote John Smith Wrote Wrote Wrote Cites Wrote Query: Papers written by Peter Smith.
  • 70. Wrote Cites Wrote Paper1 Paper2 Paper3 Peter Smith 70 Conference ESWC Paper4 Conference Cites Paper5 ISWC Conference Conference Cites Conference Juan Perez Wrote John Smith Wrote Wrote Wrote Cites Wrote Query: Papers published at ESWC.
  • 71. import org.neo4j.graphdb.index.IndexHits; import org.neo4j.graphdb.index.Index; import org.neo4j.graphdb.index.IndexManager; import org.neo4j.graphdb.factory.*; import org.neo4j.graphdb.traversal.*; import org.neo4j.tooling.*; import org.neo4j.kernel.*; import org.neo4j.helpers.collection.*; g = new GraphDatabaseFactory(). newEmbeddedDatabaseBuilder(args[0]). loadPropertiesFromFile("conf/neo4j.properties"). newGraphDatabase(); globalOP = GlobalGraphOperations.at(g); indexManager = g.index(); index = indexManager.forNodes("id"); Neo4J API (6) Traversing Graphs 71
  • 72. Node v; Relationship rel; Iterator<Relationship> it; v = index.get("id",”ESWC").getSingle(); if (v == null) return; DynamicRelationshipType conference = DynamicRelationshipType. withName(“Conference"); it = v.getRelationships(conference,Direction.INCOMING).iterator(); while (it.hasNext()) { rel = it.next(); System.out.println(rel.getStartNode().getProperty("id")); } g.shutdown(); } Neo4J API (6) Traversing Graphs 72
  • 73. Wrote Cites Wrote Paper1 Paper2 Paper3 Peter Smith 73 Conference ESWC Paper4 Conference Cites Paper5 ISWC Conference Conference Cites Conference Juan Perez Wrote John Smith Wrote Wrote Wrote Cites Wrote Query: Papers published at ESWC.
  • 74. Neo4J: Cypher Query Language (1)  START:  Specifies one or more starting points in the graph.  Starting points can be defined as index lookups or just by an element ID.  MATCH:  Pattern matching to match based on the starting point(s)  WHERE:  Filtering criteria.  RETURN:  What is projected out from the evaluation of the query. http://www.neo4j.org/learn/cypher 74
  • 75. Neo4J: Cypher Query Language (2) Query: Is Peter Smith connected to John Smith. 75 START author=node:authors( 'firstname:”Peter" AND lastname:”Smith"') m=node:authors( 'firstname:”John" AND lastname:”Smith"’) MATCH (author)-[R*]->(m) RETURN count(R) MATCH (author:authors {firstname:”Peter" AND lastname:”Smith"})- [R*]->(m:author:authors {firstname:”John" AND lastname:”Smith"}) RETURN count(R)
  • 76. Neo4J: Cypher Query Language (2) Query: Papers written by Peter Smith. 76 START author=node:authors( 'firstname:”Peter" AND lastname:”Smith"') MATCH (author)-[:WROTE]->(papers) RETURN papers MATCH (author:authors{ firstname:”Peter" AND lastname:”Smith”}-[:WROTE]->(papers) RETURN papers
  • 77. Neo4J: Cypher Query Language (3) Query: Papers cited by a paper written by Peter Smith. 77 START author=node:authors( 'firstname:”Peter" AND lastname:”Smith"') MATCH (author)-[:WROTE]->()-[:CITES](papers) RETURN papers
  • 78. Neo4J: Cypher Query Language (3) Query: Papers cited by a paper written by Peter Smith that have at most 20 cites. 78 START author=node:authors( 'firstname:”Peter" AND lastname:”Smith"') MATCH (author)-[:WROTE]->()-[:CITES]->()-[:CITES]->(papers) WITH COUNT(papers) as cites WHERE cites < 21 RETURN papers
  • 79. Neo4J: Cypher Query Language (4) Query: Papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith. 79 START author=node:authors( 'firstname:”Peter" AND lastname:”Smith"') MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers) RETURN papers
  • 80. Neo4J: Cypher Query Language (5) Query: Number of papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith. 80 START author=node:authors( 'firstname:”Peter" AND lastname:”Smith"') MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers) RETURN count(papers)
  • 81. Neo4J: Cypher Query Language (6) Query: Number of papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith, and have been published in ESWC. 81 START author=node:authors( 'firstname:”Peter" AND lastname:”Smith"') MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers) WHERE papers.conference! =‘ESWC’ RETURN count(papers) Exclamation mark in papers.conference ensures that papers without the conference attribute are not included in the output.
  • 82. Neo4J: Cypher Query Language (7) Query: Number of papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith, have been published in ESWC and have at most 4 co-authors . 82 START author=node:authors( 'firstname:”Peter" AND lastname:”Smith"') MATCH (author)-[:WROTE]->()-[:CITES*1..2](papers)- [:Was-Written]->authorFinal WHERE papers.conference! =‘ESWC’ WITH author, count(authorFinal) as authorFinalCount WHERE authorFinalCount < 4 RETURN author, authorFinalCount Exclamation mark in papers.conference ensures that papers without the conference attribute are not included in the output.
  • 83. Neo4J API-Cypher Query: Peter Smith Neighborhood 83 static Neo4j TestGraph; String Query = "start n=node:node_auto_index(idNode='PeterSmith')" + "match n-[r]->m return m.idNode"; ExecutionResult result = TestGraph.query(Query); boolean res = false; for ( Map<String, Object> row : result ){ for ( Map.Entry<String, Object> column : row.entrySet() ){ print(column.getValue().toString()); res = true; } } if (!res) print(" No result");
  • 84. Operation Description GraphDatabaseFactory Creates a empty graph registerShutdownHook Removes a graph createNode Creates a new empty node of a given node type AddNode Copies a node createRelationshipTo Adds a new edge of some edge type forNodes Index for a Node IndexManager Index Manager forRelationships Index for a Relationship add Add an object to an index TraversalDescription Different types of strategies to traverse a graph Traversal Branch Selector preorderDepthFirst, postorderDepthFirst, preorderBreadthFirst, postorderBreadthFirst Neo4J API Summary 84
  • 85. Graphium (http://graphium.ldc.usb.ve)  Abstract Graph Database Engine built on top of: Neo4j, Sparksee, and HyperGraphDB.  Labeled attribute multigraph.  Nodes and edges can have properties.  There are no restrictions on the number of edges between two nodes.  Loops are allowed.  Attributes are single valued.  Different types of traversal strategies.  Different data mining strategies.  Java Graph-based and RDF-based API. 85 [ Flores et al. 2013, De Abreu et al. 2013]
  • 86. Graphium (http://graphium.ldc.usb.ve) 86 [ Flores et al. 2013, De Abreu et al. 2013] Example: Property Graph Model • Nodes and edges may have properties • Properties: Key-value pairs The Beatles Let it be Revolver Help! created Year: 1970 Duration: 35:16 Year: 1965 Year: 1966 Duration: 35:01 Homepage: thebeatles.com Origin: Liverpool Source: “Scaling Up Linked Data”. EUCLID project.
  • 87. Graphium Architecture 87 GRAPHIUM Neo4j Sparksee HyperGraphDB Graph-based API RDF-based API Data Mining Traversal API Graph Invariants
  • 88. Wrote Cites Wrote Paper1 Paper2 Paper3 Peter Smith 88 Conference ESWC Paper4 Conference Cites Paper5 ISWC Conference Conference Cites Conference Juan Perez Wrote John Smith Wrote Wrote Wrote Cites Wrote Example
  • 89. Graphium Graph-based API (1) Creating a graph 89 GraphDB db; db = new Neo4j(); // or… db = new DEXDB(); // Sparksee was previously known as DEX // or… db = new HyperGraphDB();
  • 90. Graphium Graph-based API (2) Adding nodes db.addNode(”Peter Smith”); db.addNode(”Juan Perez”); db.addNode(”John Smith”); db.addNode(”ESWC”); db.addNode(”ISWC”); db.addNode(”Paper1”); db.addNode(”Paper2”); db.addNode(”Paper3”); db.addNode(”Paper4”); db.addNode(”Paper5”); 90
  • 91. Graphium Graph-based API (3) Adding relationships 91 db.addEdge(new Edge(“WROTE”, “Peter Smith”, “Paper1”)); db.addEdge(new Edge(“WROTE”, “Peter Smith”, “Paper3”)); db.addEdge(new Edge(“WROTE”, “Peter Smith”, “Paper5”)); … db.addEdge(new Edge(“CITES”, “Paper1”, “Paper2”)); db.addEdge(new Edge(“CITES”, “Paper3”, “Paper1”)); db.addEdge(new Edge(“CITES”, “Paper3”, “Paper4”)); db.addEdge(new Edge(“CITES”, “Paper4”, “Paper2”)); … db.addEdge(new Edge(“Conference”, “Paper1”, “ESWC”)); db.addEdge(new Edge(“Conference”, “Paper2”, “ISWC”)); db.addEdge(new Edge(“Conference”, “Paper3”, “ESWC”)); db.addEdge(new Edge(“Conference”, “Paper4”, “ESWC”)); db.addEdge(new Edge(“Conference”, “Paper5”, “ISWC”));
  • 92. Graphium Graph-based API (4) Adjacency: “Papers cited by Paper3” 92 GraphIterator<String> it; it = db.adj(“Paper3”, “CITES”); while (it.hasNext()) { … it.next() … } it.close();
  • 93. Graphium Graph-based API (5) 93 Traversing the graph: “2-Hops from Peter Smith” GraphIterator<String> it; it = db.kHops(“Peter Smith”, 2); while (it.hasNext()) { … it.next() … } it.close();
  • 94. Graph Mining Tasks These tasks evaluate the engine performance while executing the following graph mining:  Graph summarization.  Dense subgraph.  Shortest path. 94
  • 95. 4 5 76 1 2 3 8 9 0 4 5 6 7 1 2 3 Add(8,1) Add(9,3) Add(0.3) Del(4,1) 8 9 0 Corrections Cost = 1(superedge) + 4(corrections)= 5 Graph Summarization [Navlakha et al. 2008] Original Graph Summarized Graph 95
  • 97. A B C D E F G H 1 2 1 1 1 3 3 2 3 1 1 1 2 Given a subset of nodes E’ compute the densest subgraph containing them, i.e., density of E’ is maximized: density(E') = weight(a,b) | E'|(a,b) in E' å [Khuller et al. 2010] Dense Subgraph 97
  • 98. Shortest Path • Given two nodes finds the shortest path between them. • Implemented using Iterative Depth First Search (DFID). • Ten node pairs are taken at random from each graph, and then the Shortest Path algorithm is executed on each pair of nodes 98
  • 101. RDF-Based Graph Database Engines 101 Native storage Hexastore RDF3X BitMat DBMS-backed storage Hexastore* Bhyper Graphium RDF-based
  • 102. In the beginnings … 102 … there were dinosaurs
  • 103. In the beginnings … 103 • Traditionally: – Use Relational DBMS’ to store data – Early Approach: one large SPO table (physical) • Pros: simple schema, fast updates (when no index present), fast single triple pattern queries • Cons: not suitable for multi-join queries, results in multiple self-joins
  • 104. In the beginnings … 104 • Traditionally: – Use Relational DBMS’ to store data – Further Approaches: property tables (multiple flavors) • Pros: fast retrieval of subject / object pairs for given predicates • Cons: not easy to solve queries where predicate is unbound & harder to maintain schema (a table for each known predicate)
  • 105. RDF-tailored management approaches 105 • Same conceptual approach: one large SPO table (but with RDF-specific physical organization): – Triples are indexed in all 3!=6 possible permutations: SPO, PSO, OSP, OPS, PSO, POS – Logical extension of the vertical partitioning proposed in [Abadi et al 2008] (the PSO index = generalization of the SO table) – Each index (i.e. SPO) stores partial triple pattern cardinalities – Resulting in 6 tree-based indexes, each 3 levels deep • First proposed in Hexastore [Weiss et al. 2008]
  • 106. Hexastore / Conceptual Model 106 • 2 main operations: (input = tp: partially bound triple pattern, i.e., <bound_subject> <some predicate> ?o) • cardinality = getCardinality(tp) • term_set = getSet(tp) • Resulting term_set is sorted Si pni...p1i p2i o1ik ... o1i2 o1i1 o2il ... o2i2 o2i1 onij ... oni2 oni1
  • 107. Hexastore / Conceptual Model 107 • Pros: • The optimal index permutation & level is used (i.e. <bound_s> ?p ?o : SPO level 1) • Ability to make use of fast merge-joins • Cons: • Higher upkeep cost: a theoretical upper bound of 5x space increase over original (encoded) data Si pni...p1i p2i o1ik ... o1i2 o1i1 o2il ... o2i2 o2i1 onij ... oni2 oni1
  • 108. 108 Wrote Cite s Wrote Paper1 Paper2 Paper3 Peter Smith Conference ESWC Paper4 Conference Cite s Paper5 ISWC Conference Conference Cite s Conference Juan Perez Wrote John Smith Wrote Wrote Wrote Cite s Wrote An Example (Publications): logical representation
  • 109. RDF Subgraph for the Publication Graph 109 conference/eswc/2 012/paper/03 conference/iswc/20 12/paper/05 person/peter-smith ontoware/ ontology#author 2012 ontoware/ ontology#year conference/iswc/ semanticweb/ ontology#isPartOf ISWC semanticweb/ ontology#hasAcronym semanticweb/ontology/C onferenceEvent w3c/ rdf-syntax#type ontoware/ ontology#author conference/eswc/ semanticweb/ ontology#isPartOf 2012 ontoware/ ontology#year w3c/ rdf-syntax#type ontoware/ ontology#biblioReference ESWC semanticweb/ ontology#hasAcronym conference/iswc/20 12/paper/01
  • 110. An Example (Publications): first step, encode strings 110 http://data.semanticweb.org/person/john-smith 20 http://data.semanticweb.org/person/juan-perez 19 18http://data.semanticweb.org/person/peter-smith http://data.semanticweb.org/ns/swc/ontology#hasAcronym 17 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 16 15"2012" 14http://swrc.ontoware.org/ontology#biblioReference 13http://swrc.ontoware.org/ontology#year 12http://swrc.ontoware.org/ontology#author 11http://data.semanticweb.org/ns/swc/ontology#isPartOf http://data.semanticweb.org/ns/swc/ontology#ConferenceEvent 10 http://data.semanticweb.org/conference/eswc/2012 9 http://data.semanticweb.org/conference/iswc/2012 8 7"ISWC2012" "ESWC2012" 6 5http://data.semanticweb.org/conference/iswc/2012/paper/05 http://data.semanticweb.org/conference/eswc/2012/paper/04 4 3http://data.semanticweb.org/conference/eswc/2012/paper/03 http://data.semanticweb.org/conference/iswc/2012/paper/02 2 1http://data.semanticweb.org/conference/eswc/2012/paper/01 HexastoreMappingDictionary
  • 111. An Example (Publications): first step, encode strings 111 http://data.semanticweb.org/person/john-smith 20 http://data.semanticweb.org/person/juan-perez 19 18http://data.semanticweb.org/person/peter-smith http://data.semanticweb.org/ns/swc/ontology#hasAcronym 17 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 16 15"2012" 14http://swrc.ontoware.org/ontology#biblioReference 13http://swrc.ontoware.org/ontology#year 12http://swrc.ontoware.org/ontology#author 11http://data.semanticweb.org/ns/swc/ontology#isPartOf http://data.semanticweb.org/ns/swc/ontology#ConferenceEvent 10 http://data.semanticweb.org/conference/eswc/2012 9 http://data.semanticweb.org/conference/iswc/2012 8 7"ISWC2012" "ESWC2012" 6 5http://data.semanticweb.org/conference/iswc/2012/paper/05 http://data.semanticweb.org/conference/eswc/2012/paper/04 4 3http://data.semanticweb.org/conference/eswc/2012/paper/03 http://data.semanticweb.org/conference/iswc/2012/paper/02 2 1http://data.semanticweb.org/conference/eswc/2012/paper/01 HexastoreMappingDictionary • Next, index original encoded RDF file …
  • 112. Hexastore / Physical implementation. Native storage 112 2 2 4 3 3 4 4 9 8 1 5 2 4 3 S 11 12 1 1 13 1 14 1 9 18 5 2 1 13 1 12 11 1 8 20 15 11 12 1 2 13 1 14 2 9 1918 15 41 11 12 1 2 13 1 14 1 9 2019 15 2 12 1 111 13 1 8 18 15 17 1 116 10 7 17 1 1 16 10 6 P O • Uses vector-storage [Weiss et al 2009] • Each index level: – partially sorted vector (on disk, one file per level) – each record: <rdf term id, cardinality, pointer to set> – first level gets special treatment: • fully sorted vector: search = binary search, O(log(N)) • Hint: using an additional hash can circumvent the binary search
  • 113. Hexastore / Physical implementation. Native storage 113 2 2 4 3 3 4 4 9 8 1 5 2 4 3 S 11 12 1 1 13 1 14 1 9 18 5 2 1 13 1 12 11 1 8 20 15 11 12 1 2 13 1 14 2 9 1918 15 41 11 12 1 2 13 1 14 1 9 2019 15 2 12 1 111 13 1 8 18 15 17 1 116 10 7 17 1 1 16 10 6 P O • Pros: – Optimized for read-intensive / static workloads – Optimal / fast triple pattern matching – No comparisons when scanning since each (partial) cardinality is known ! • Cons: – Not suitable for write intensive workloads (worst case can result in rewrite of entire index!)
  • 114. Hexastore / Physical implementation. DBMS (managed) storage 114 • Builds on top of sorted indexes: – B+Tree, LSM Tree, etc • Each index level: – Same sorted index structure (but not necessarily) – each record: <rdf term id list, cardinality> – Search cost = as provided by the managed index structure 2 2 4 3 3 4 4 9 8 1 5 2 4 3 S 1 1 1 1 11 12 1 1 13 1 14 1 1 11 9 - 1 -1812 1 13 5 - 1 14 2 - 2 2 2 1 13 1 12 11 1 2 11 8 - 2 12 20 - 2 13 15 - 3 3 3 3 11 12 1 2 13 1 14 2 3 11 9 - 12 -3 18 3 13 15 - 3 -14 1 4 4 4 4 11 12 1 2 13 1 14 1 4 11 9 - 4 -12 20 4 13 15 - 4 14 2 - 5 5 5 12 1 111 13 1 5 11 8 - 5 12 18 - 5 13 15 - 8 8 17 1 116 8 16 10 - 8 17 7 - 9 9 17 1 1 16 9 16 10 - 9 17 6 - SP SPO -3 1912 43 -14 12 19 -4
  • 115. Hexastore / Physical implementation. DBMS (managed) storage 115 • Pros: – Flexible design: each index can be configured for a different physical data-structure: • i.e. PSO=B+Tree, OSP=LSM Tree – Separation of concerns (the underlying DBMS manages and optimizes for access patterns, cache, etc…) • Cons: – Less control over fine-grained data storage (up to the level permitted by the DBMS API) – Usually less potential for native optimization 2 2 4 3 3 4 4 9 8 1 5 2 4 3 S 1 1 1 1 11 12 1 1 13 1 14 1 1 11 9 - 1 -1812 1 13 5 - 1 14 2 - 2 2 2 1 13 1 12 11 1 2 11 8 - 2 12 20 - 2 13 15 - 3 3 3 3 11 12 1 2 13 1 14 2 3 11 9 - 12 -3 18 3 13 15 - 3 -14 1 4 4 4 4 11 12 1 2 13 1 14 1 4 11 9 - 4 -12 20 4 13 15 - 4 14 2 - 5 5 5 12 1 111 13 1 5 11 8 - 5 12 18 - 5 13 15 - 8 8 17 1 116 8 16 10 - 8 17 7 - 9 9 17 1 1 16 9 16 10 - 9 17 6 - SP SPO -3 1912 43 -14 12 19 -4
  • 116. RDF3x (1)  RISC style operators.  Implements the traditional Optimize-then-Execute paradigm, to identify left-linear plans of star-shaped sub- queries of basic triple patterns.  Makes use of secondary memory structures to locally store RDF data and reduce the overhead of I/O operations.  Registers aggregated count or number of occurrences, and is able to detect if a query will return an empty answer. 116 [Neumann and Weikum 2008]
  • 117. RDF3x (2)  Triple table, each row represents a triple.  Mapping dictionary, replacing all the literal strings by an id; two dictionary indices.  Compressed clustered B+-tree to index all triples:  Six permutations of subject, predicate and object; each permutation in a different index.  Triples are sorted lexicographically in each B+-tree. 117 [Neumann and Weikum 2008]
  • 119. 119 RDF3x (4) [Figure from Neumann et al 2010]  Compressed B+-trees.  One B+-tree per pattern.  Structure is updatable.
  • 120. 120 RDF3x (5) [Figure from Lei Zou http://hotdb.info/slides/RDF-talk.pdf]
  • 122. Wrote Cites Wrote Paper1 Paper2 Paper3 Peter Smith 122 Conference ESWC Paper4 Conference Cites Paper5 ISWC Conference Conference Cites Conference Juan Perez Wrote John Smith Wrote Wrote Wrote Cites Wrote Example
  • 123. RDF Subgraph for the Publication Graph 123 conference/eswc/2 012/paper/03 conference/iswc/20 12/paper/05 person/peter-smith ontoware/ ontology#author 2012 ontoware/ ontology#year conference/iswc/ semanticweb/ ontology#isPartOf ISWC semanticweb/ ontology#hasAcronym semanticweb/ontology/C onferenceEvent w3c/ rdf-syntax#type ontoware/ ontology#author conference/eswc/ semanticweb/ ontology#isPartOf 2012 ontoware/ ontology#year w3c/ rdf-syntax#type ontoware/ ontology#biblioReference ESWC semanticweb/ ontology#hasAcronym conference/iswc/20 12/paper/01
  • 124. RDF3x: Representation of the Publication Graph 124 OID Resource 0 http://www.w3.org/1999/02/22-rdf-syntax-ns#type 1 http://data.semanticweb.org/ns/swc/ontology#hasAcronym 2 http://data.semanticweb.org/ns/swc/ontology#isPartOf 3 http://swrc.ontoware.org/ontology#author 4 http://swrc.ontoware.org/ontology#year 5 http://swrc.ontoware.org/ontology#biblioReference 6 http://data.semanticweb.org/conference/iswc/2012 7 http://data.semanticweb.org/ns/swc/ontology#ConferenceEvent 8 ISWC2012 9 http://data.semanticweb.org/conference/eswc/2012 10 ESWC2012 11 http://data.semanticweb.org/conference/iswc/2012/paper/05 12 http://data.semanticweb.org/conference/iswc/2012/paper/02 13 http://data.semanticweb.org/conference/eswc/2012/paper/01 14 http://data.semanticweb.org/conference/eswc/2012/paper/03 15 http://data.semanticweb.org/conference/eswc/2012/paper/04 16 http://data.semanticweb.org/author/peter-smith 17 http://data.semanticweb.org/author/juan-perez 18 http://data.semanticweb.org/author/john-smith 19 2012 S P O 6 0 7 9 0 7 6 0 7 9 1 8 11 1 10 12 2 6 13 2 6 14 2 9 15 2 9 11 3 16 12 3 18 13 3 16 14 3 16 14 3 17 15 3 17 15 3 18 11 4 19 12 4 19 13 4 19 14 4 19 15 4 19 13 5 12 14 5 13 14 5 15 15 5 12 RDF3xMappingDictionary TripleTable
  • 125. Graphium (http://graphium.ldc.usb.ve)  Abstract Graph Database Engine built on top of: Neo4j, Sparksee, and HyperGraphDB.  Labeled attribute multigraph.  Nodes and edges can have properties.  There are no restrictions on the number of edges between two nodes.  Loops are allowed.  Attributes are single valued.  Different types of traversal strategies.  Different data mining strategies.  Java Graph-based and RDF-based API. 125 [ Flores et al. 2013, De Abreu et al. 2013]
  • 126. Graphium (http://graphium.ldc.usb.ve) 126 [ Flores et al. 2013, De Abreu et al. 2013] Example: Property Graph Model • Nodes and edges may have properties • Properties: Key-value pairs The Beatles Let it be Revolver Help! created Year: 1970 Duration: 35:16 Year: 1965 Year: 1966 Duration: 35:01 Homepage: thebeatles.com Origin: Liverpool Source: “Scaling Up Linked Data”. EUCLID project.
  • 127. Graphium (http://graphium.ldc.usb.ve) 127 [ Flores et al. 2013, De Abreu et al. 2013] Example: RDF Model • Properties are represented with nodes and edges The Beatles Let it be Revolver Help! created 1970 35:16 duration 1965 year 1966 35:01 duration Liverpool thebeatles.com
  • 128. Graphium RDF-based API (1) Creating a graph Just use the command-line tool to load your RDF graph from an NT file, and create the database using Neo4j or Sparksee. ./create <GDBM (Sparksee or Neo4j)> <NT file> <DB location> For example… > ./create Neo4j publications.nt pubDB-neo4j/ > ./create Sparksee publications.nt pubDB-sparksee/ 128
  • 129. Graphium RDF-based API (2) Loading the graph in Java GraphRDF g; g = GraphiumLoader.open( PATH ); … g.close(); 129
  • 130. Graphium RDF-based API (3) Searching for nodes… If the node you are searching for is an URI, use: If it is a Blank Node, use: v = g.getVertexURI(“http://data.semanticweb.org/author/peter-smith”); 130 v = g.getVertexBlankNode(“blanknodeid”); Vertex v;
  • 131. Graphium RDF-based API (4) From each vertex you can… (1) Get the RDF object it represents: 131 // If you don’t know what object it represents, // use the superclass RDFobject… RDFobject v_any = v.getAny(); // If you know, use any of the subclasses… URI v_uri = v.getURI(); BlankNode v_bn = v.getBlankNode(); Literal v_lit = v.getLiteral();
  • 132. Graphium RDF-based API (5) From each vertex you can… (2) Get its in/out relationships: 132 GraphIterator<Edge> it; // For out-going relationships… it = v.getEdgesOut(); // For in-coming relationships… it = v.getEdgesIn(); while (it.hasNext()) { Edge rel = it.next(); … } it.close();
  • 133. Graphium RDF-based API (6) 133 From each edge you can… (1) Get its predicate/URI: (2) Get the start/end of the edge (the subject and object of each triple): URI predicate = rel.getURI(); Vertex start_v, end_v; start_v = rel.getStart(); end_v = rel.getEnd();
  • 134. Graphium RDF-based API (7) 134 You can also traverse the graph… Vertex v; … GraphIterator<Vertex> it = kHop.run(2,v); while (it.hasNext()) { … it.next() … } it.close();
  • 135. Graphium RDF-based API (8) 135 You can also traverse the graph using… BFS… DFS… k-Hops… Or even k-Hops from a set of nodes… GraphIterator<Vertex> it = BFS.run(v); GraphIterator<Vertex> it = DFS.run(v); GraphIterator<Vertex> it = kHop.run(k,v); GraphIterator<Vertex> it = kHop.run(k,v1,v2,v3,…);
  • 138. Hands-on Goals • Implement simple graph-based queries by using Neo4j and Sparksee API. • Implement simple RDF-based queries by using Graphium API. • Implement graph invariants by using Graphium API. Go to https://www.github.com/AlejandroFloresV/Tutorial-GraphDB http://www.sparsity-technologies.com/downloads/javadoc- sparkseejava.pdf 138
  • 139. Author Cites Author Paper1 Paper2 Paper3 Peter Smith 139 Conference ESWC Paper4 Conference Cites Paper5 ISWC Conference Conference Cites Conference Juan Perez Author John Smith Author Author Author Cites Author Example
  • 140. Assignments 140 > make Neo4jCreate > make Neo4jQuery > ./publications Neo4jCreate neo4j_db > ./publications Neo4jQuery neo4j_db 1) Implement the query “Papers written by Peter Smith” using the Sparksee API and Neo4j API. Complete the code for Neo4j, located in the files experiment/Neo4jCreate.java and experiment/Neo4jQuery.java. Then.. > make SparkseeCreate > make SparkseeQuery > ./publications SparkseeCreate sparksee_db > ./publications SparkseeQuery sparksee_db Now complete the code for Sparksee, located in the files experiment/SparkseeCreate.java and experiment/SparkseeQuery.java. Then..
  • 141. RDF Subgraph for the Publication Graph 141 conference/eswc/2 012/paper/03 conference/iswc/20 12/paper/05 person/peter-smith ontoware/ ontology#author 2012 ontoware/ ontology#year conference/iswc/ semanticweb/ ontology#isPartOf ISWC semanticweb/ ontology#hasAcronym semanticweb/ontology/C onferenceEvent w3c/ rdf-syntax#type ontoware/ ontology#author conference/eswc/ semanticweb/ ontology#isPartOf 2012 ontoware/ ontology#year w3c/ rdf-syntax#type ontoware/ ontology#biblioReference ESWC semanticweb/ ontology#hasAcronym conference/iswc/20 12/paper/01
  • 142. Assignments 2) Implement the following queries using the Graphium API3: a) “Papers written by Peter Smith”. b) “Papers cited by a paper written by Peter Smith that have at most 2 cites.”. c) “Papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith.” d) “Number of papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith. “ e) “Number of papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith, and have been published in ESWC.” f) “Number of papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith, have been published in ESWC and have at most 4 co-authors “. 3 http://graphium.ldc.usb.ve 142
  • 143. Assignments 143 > ./create <Neo4j or Sparksee> publications.nt graphium_db 2) Implementing queries using the Graphium API. First, let's create the Graphium DB from the NT file publications.nt:
  • 144. Assignments 144 > make A > ./publications A graphium_db 2a) “Papers written by Peter Smith”. Complete the code in experiment/A.java, and run (in the terminal)... 2b) “Papers cited by a paper written by Peter Smith that have at most 2 cites”. Complete the code in experiment/B.java, and run (in the terminal)... > Make B > ./publications B graphium_db
  • 145. Assignments 145 > Make C > ./publications C graphium_db 2c) “Papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith”. Complete the code in experiment/C.java, and run (in the terminal)... 2d) “Number of papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith”. Complete the code in experiment/D.java, and run (in the terminal)... > Make D > ./publications D graphium_db
  • 146. Assignments 146 > Make E > ./publications E graphium_db 2e) “Number of papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith, and have been published in ESWC”. Complete the code in experiment/E.java, and run (in the terminal)... 2 f) “Number of papers cited by a paper written by Peter Smith or cited by papers cited by a paper written by Peter Smith, have been published in ESWC and have at most 4 co-authors”. Complete the code in experiment/F.java, and run (in the terminal)... > Make F > ./publications F graphium_db
  • 147. Assignments 3) Implement graph invariants using the Graphium API3 147 Invariant Description Vertex and Edge Count number of vertices and edges in the graph. Graph Density number of edges in the graph divided by the number of possible edges in a complete digraph. Reciprocity Reciprocity measures the extend to which a triple that relates resources A and B is reciprocated by a another triple that relates B with A too. In- and Out-degree Distribution Distribution of the number of in-coming and out- going edges of the vertices of a graph. In-coming and Out-going h-index h is the maximum number, such that h vertices have each at least h in-coming neighbors (resp., out-going neighbors) in the graph. 3 http://graphium.ldc.usb.ve
  • 148. Assignments 148 > make Nodes > ./publications Nodes graphium_db 3) Implement graph invariants using the Graphium API. 3a) Number of nodes/vertices in the graph Complete the code in experiment/Nodes.java, and run (in the terminal)... > make Density > ./publications Density graphium_db 3b) Graph Density Complete the code in experiment/Density.java, and run (in the terminal)...
  • 149. Assignments 4) Compute graph invariant of different RDF graphs, using the Chrysalis tool, and upload the results in the Graphium Chrysalis4 Website. 149 4 http://graphium.ldc.usb.ve/chrysalis/ 5 https://sites.google.com/site/dawfederation/data-sets Go to the DAW5 website and download one of the datasets (.nt files). Then use the terminal to run: > ./create <Neo4j or Sparksee> <.nt file path> test_db > ./chrysalis test_db Now, go to the Graphium Chrysalis website and upload the file chrysalis.log to visualize it.
  • 151. Experimental Set-Up Machine: Model: Sun Fire x4100 Quantity of processors: 2 Processor specification: Dual-Core AMD Opteron™ Processor 2218 64 bits RAM memory: 16GB Disk capacity: 2 disks, 135GB e/o Network interface: 2 + ALOM Operating System: CentOS 5.5. 64 bits Engines: RDF3x version: 0.3.7 Sparksee version: 4.8 Very Large Databases HyperGraphDB version: 1.2 Neo4J version: 1.9 Timeout: 3,600 secs. Implementation: We provide both internal implementation for traversal methods using the API’s of each engine, and the external implementation using the generic algorithm through the core-graph API methods. Runs: Up to 10 times for each test. Average is reported. All experiments were run in cold cache. The graphs were loaded in sif format for the general purpose graph engines and in nt format for RDF-3x. Only RDF-3x could load graphs from Berlin Benchmark of over 50 million edges. Download the code and datasets at https://github.com/gpalma/gdbb/ 151COLD 2013
  • 152. 152 Graph Name #Nodes #Edges Description DSJC1000.1 1,000 49,629 Graph Density:0.1 [Johnson91] DSJC1000.5 1,000 249,826 Graph Density:0.5 [Johnson91] DSJC1000.9 1,000 449,449 Graph Density:0.9 [Johnson91] SSCA2-17 131,072 3,907,909 GTgraph 1) [Johnson91] Johnson, D., Aragon, C., McGeoch, L., and Schevon, C. Optimization by simulated annealing: an experimental evaluation; part ii, graph coloring and number partitioning. Operations research 39, 3 (1991), 378–406. GTgraph: A suite of three synthetic graph generators: 1) SSCA2: generates graphs used as input instances for the DARPA. High Productivity Computing Systems SSCA#2 graph theory benchmark http://www.cse.psu.edu/~madduri/software/GTgraph/index.html Benchmark Graphs
  • 153. 153 Task Description adjacentXP Given a node X and an edge label P, find adjacent nodes Y. adjacentX Given a node X, find adjacent nodes Y. edgeBetween Given two nodes X and Y, find the labels of the edges between X and Y. 2-hopX Given a node X, find the 2-hop Y nodes. 3-hopX Given a node X, find the 3-hop Y nodes. 4-hopX Given a node X, find the 4-hop Y nodes. Benchmark Tasks The benchmark tasks evaluate the engine performance while executing the following graph operations:
  • 154. 154 Task SPARQL Query adjacentXP Select ?y where { <http://graphdatabase.ldc.usb.ve/resource/6> <http://graphdatabase.ldc.usb.ve/resource/pr> ?y} adjacentX Select ?y where { <http://graphdatabase.ldc.usb.ve/resource/6> ?p ?y} edgeBetween Select ?p where { <http://graphdatabase.ldc.usb.ve/resource/6> ?p <http://graphdatabase.ldc.usb.ve/resource/8> 2-hopX Select ?z where { <http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1. ?y1 ?p2 ?z} 3-hopX Select ?z where { <http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1. ?y1 ?p2 ?y2. ?y2 ?p3 ?z} 4-hopX Select ?z where { <http://graphdatabase.ldc.usb.ve/resource/6> ?p1 ?y1. ?y1 ?p2 ?y2. ?y2 ?p3 ?y3. ?y3 ?p4 ?z} Benchmark Tasks: RDF Engines The benchmark tasks were implemented in the RDF engines with the following SPARQL queries:
  • 155. 155 Benchmark of Graph Graph Name #Nodes #Edges Density #Labels DSJC1000.1 [Johnson91] 1,000 99,258 0.099 1 DSJC1000.5 [Johnson91] 1,000 499,652 0.50 1 DSJC1000.9 [Johnson91] 1,000 898,898 0.899 1 USA-road- d.NY 264,346 730,100 0.00001045 7,970 USA-road- d.FLA 1,070,376 2,687,902 0.00000235 22,704 Berlin10M 2,743,235 9,709,119 0.00000129 40 ✔ ✔ ✔ ✔ ✔ ✔ ✔ Dense GraphsSparse GraphsFew LabelsMany LabelsSmall GraphsMedium-Size Graphs [Johnson91] Johnson, D., Aragon, C., McGeoch, L., and Schevon, C. Optimization by simulated annealing: an experimental evaluation; part ii, graph coloring and number partitioning. Operations research 39, 3 (1991), 378–406. USA-road-d* Graphs 9th DIMACS Implementation Challenge - Shortest Paths http://www.dis.uniroma1.it/challenge9/download.shtml Berlin10M: Berlin Bechmark-http://wifo5-03.informatik.uni-mannheim.de/bizer/berlinsparqlbenchmark/ COLD 2013
  • 156. Graph Creation-Time and Memory HyperGraphDB timed out after 24 hours (Berlin10M). RDF-3x is faster than the rest of the engines. Secondary Memory grows as the size of the graph. In general, RDF-3x requires less memory than the rest of the engines. 156COLD 2013
  • 157. Adjacency Tests RFD-3x is impacted by the graph density; it timed out for 4-hops in DSJC.5 and DSJC.9.  RDF-3x exploits internal structures in sparse graphs. Sparksee outperforms all the engines in dense graph with few label; internal and external implementations are competitive. Number of label negatively impacts RDF-3x exploits internal structures and overcomes the rest of the graph engines. Neo4j and Sparksee are competitive. HyperGraphDB always consumes at least one order of magnitude more time than the rest of the engines. 157COLD 2013
  • 158. Traversals-BFS and DFS (Internal and External) Internal representations outperform external representations. Neo4j exploits internal representation of graph, neighborhood indices, and main memory-structures and overcomes the rest of the engines. 158COLD 2013
  • 159. Mining Tasks Graph Summarization seems to be more affected by graph density, size of the graph and number of labels. Sparksee overcomes the rest of engines when the graphs are dense. Neo4j overcomes the rest of the engines in sparse graphs. Sparksee overcomes the rest of engines when the graphs are dense. Neo4j overcomes the rest of the engines in sparse graphs. 159COLD 2013
  • 161. 161
  • 163. Graph Data Storing Features 163 Graph Database Main Memory Persistent Memory Backend Storage/Implem entation Indexi ng Sparksee X X X HyperGraphDB1 X X X X Neo4j X X X Hexastore2 X X X X RDF3x X X Graphium3 Graph-based X X X X Graphium3 RDF-based X X X X 2Backend Storage: Tokyo Cabinet 3Backend Implementation: Neo4j and Sparksee 1Backend Storage: Berkeley DB
  • 164. Graph Data Management Features 164 Graph Database Graph-based API/ RDF-based API Query Language Data Visuali- zation Data Definition Data Manipulation Standard Compliance Sparksee X X X X HyperGraphDB X X X Neo4j X X X X X Hexastore X X X X RDF3x X Graphium Graph-based X X X Graphium RDF-based X X X
  • 165. Adjacency Queries Node/Edge Adjacency K-neighborhood of a node. Reachability Queries Test whenever two nodes are connected by a path. Shortest Path. Pattern Matching Queries Find all sub-graphs of a data graph. Summarization Queries Aggregate the results of a graph-based query 165 Graph-Based Operations
  • 166. 166 Graph Database Node/Edge adjacency K- neighborhood Fixed-length paths RegularSimple Paths ShortestPath Pattern Matching DataMining Tasks Sparksee X X X X X X HyperGraphDB X X X X X Neo4j X X X X X X Hexastore X X RDF3x X X Graphium Graph-based X X X X X X X Graphium RDF-based X X X X Graph-Based Operations
  • 168. Data Storing Features:  Engines implement a wide variety of structures to efficiently represent large graphs.  RDF engines implement special-purpose structures to efficiently store RDF graphs. 168 Conclusions (1)
  • 169. Data Management Features:  General-purpose graph database engines provide APIs to manage and query data.  RDF engines support general data management operations. 169 Conclusions (2)
  • 170. Graph-Based Operations Support:  General-purpose graph database engines provide API methods to implement a wide variety of graph-based operations.  Only few offer query languages to express pattern matching.  RDF engines are not customized for basic graph-based operations, however,  They efficiently implement the pattern matching- based language SPARQL. 170 Conclusions (3)
  • 171.  Extension of existing engines to support specific tasks of graph traversal and mining.  Definition of benchmarks to study the performance and quality of existing graph database engines, e.g.,  Graph traversal, link prediction, pattern discovery, graph mining.  Empirical evaluation of the performance and quality of existing graph database engines 171 Future Directions
  • 172. Acknowledgements Maribel Acosta is supported by XLike, European Community's Seventh Framework Programme FP7-ICT-2011-7 (XLike, Grant 288342). Edna Ruckhaus and Maria-Esther Vidal are partially supported by DID-USB. 172
  • 173. • P. Anderson, A. Thor, J. Benik, L. Raschid, and M.-E. Vidal. Pang: finding patterns in anno- tation graphs. In SIGMOD Conference, pages 677–680, 2012. • R. Angles and C. Gutiérrez. Survey of graph database models. ACM Comput. Surv., 40(1), 2008. • M. Atre, V. Chaoji, M. Zaki, and J. Hendler. Matrix Bit loaded: a scalable lightweight join query processor for RDF data. In International Conference on World Wide Web, Raleigh, USA, 2010. ACM. • D. De Abreu, A. Flores, G. Palma, V. Pestana, J. Piñero, J. Queipo, J. Sánchez, M.E. Vidal. Choosing Between Graph Databases and RDF Engines for Consuming and Mining Linked Data. COLD 2013 in Conjunction with ISWC 2013. • A.Flores, G. Palma, M.E. Vidal, D. De Abreu, V. Pestana, J. Piñero, J. Queipo, J. Sánchez, GRAPHIUM: Visualizing Performance of Graph and RDF Engines on Linked Data. Poster & Demo ISWC 2013. • A. Flores, G. Palma, M.E. Vidal, D. De Abreu, V. Pestana, J. Piñero, J. Queipo, J. Sánchez, Graphium Chrysalis: Exploiting Graph Database Engines to Analyze RDF Graphs. Demo ESWC 2014. 173 References
  • 174. • B.Iordanov.Hypergraphdb:Ageneralizedgraphdatabase.InWAIMWorkshops,pages25– 36, 2010. • N. Martíınez-Bazan, V. Muntés-Mulero, S. G ómez-Villamor, J. Nin, M.-A. Sá́ nchez- Martíınez, and J.-L. Larriba-Pey. Dex: high-performance exploration on large graphs for information retrieval. In CIKM, pages 573–582, 2007. • T. Neumann and G. Weikum. RDF-3X: a RISC-style engine for RDF. Proc. VLDB, 1(1), 2008. • M. K. Nguyen, C. Basca, and A. Bernstein. B+hash tree: optimizing query execution times for on-disk semantic web data structures. In Proceedings Of The 6th International Workshop On Scalable Semantic Web Knowledge Base Systems (SSWS2010), 2010. • I Robinson, J. Webber, and E. Eifrem. Graph Databases. O’Reilly Media, Incorporated, 2013. 174 References
  • 175. • S. Sakr and E. Pardede, editors. Graph Data Management: Techniques and Applications. IGI Global, 2011. • D.Shimpiand S.Chaudhari. An overview of graph databases .In IJCA Proceedings on International Conference on Recent Trends in Information Technology and Computer Science 2012, 2013. • M.-E. Vidal, A. Martínez, E. Ruckhaus, T. Lampo, and J. Sierra. On the efficiency of querying and storing rdf documents. In Graph Data Management, pages 354–385. 2011. • C. Weiss, P. Karras, and A. Bernstein. Hexastore: Sextuple indexing for semantic web data management. Proc. of the 34th Intl Conf. on Very Large Data Bases (VLDB), 2008. • C Weiss, A Bernstein, On-disk storage techniques for Semantic Web data - Are B- Trees always the optimal solution?, Proceedings of the 5th International Workshop on Scalable Semantic Web Knowledge Base Systems, October 2009. 175 References