SlideShare une entreprise Scribd logo
1  sur  230
Télécharger pour lire hors ligne
An Overview of Graph Data Management and Analysis
M. Tamer ¨Ozsu
University of Waterloo
David R. Cheriton School of Computer Science
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 1 / 96
Graph Data are Very Common
Internet
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
Graph Data are Very Common
Social
networks
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
Graph Data are Very Common
Trade volumes
and
connections
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
Graph Data are Very Common
Biological
networks
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
Graph Data are Very Common
As of September 2011
Music
Brainz
(zitgist)
P20
Turismo
de
Zaragoza
yovisto
Yahoo!
Geo
Planet
YAGO
World
Fact-
book
El
Viajero
Tourism
WordNet
(W3C)
WordNet
(VUA)
VIVO UF
VIVO
Indiana
VIVO
Cornell
VIAF
URI
Burner
Sussex
Reading
Lists
Plymouth
Reading
Lists
UniRef
UniProt
UMBEL
UK Post-
codes
legislation
data.gov.uk
Uberblic
UB
Mann-
heim
TWC LOGD
Twarql
transport
data.gov.
uk
Traffic
Scotland
theses.
fr
Thesau-
rus W
totl.net
Tele-
graphis
TCM
Gene
DIT
Taxon
Concept
Open
Library
(Talis)
tags2con
delicious
t4gm
info
Swedish
Open
Cultural
Heritage
Surge
Radio
Sudoc
STW
RAMEAU
SH
statistics
data.gov.
uk
St.
Andrews
Resource
Lists
ECS
South-
ampton
EPrints
SSW
Thesaur
us
Smart
Link
Slideshare
2RDF
semantic
web.org
Semantic
Tweet
Semantic
XBRL
SW
Dog
Food
Source Code
Ecosystem
Linked Data
US SEC
(rdfabout)
Sears
Scotland
Geo-
graphy
Scotland
Pupils &
Exams
Scholaro-
meter
WordNet
(RKB
Explorer)
Wiki
UN/
LOCODE
Ulm
ECS
(RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-
castle
LAAS
KISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP
(RKB
Explorer)
Crime
Reports
UK
Course-
ware
CORDIS
(RKB
Explorer)
CiteSeer
Budapest
ACM
riese
Revyu
research
data.gov.
ukRen.
Energy
Genera-
tors
reference
data.gov.
uk
Recht-
spraak.
nl
RDF
ohloh
Last.FM
(rdfize)
RDF
Book
Mashup
Rådata
nå!
PSH
Product
Types
Ontology
Product
DB
PBAC
Poké-
pédia
patents
data.go
v.uk
Ox
Points
Ord-
nance
Survey
Openly
Local
Open
Library
Open
Cyc
Open
Corpo-
rates
Open
Calais
OpenEI
Open
Election
Data
Project
Open
Data
Thesau-
rus
Ontos
News
Portal
OGOLOD
Janus
AMP
Ocean
Drilling
Codices
New
York
Times
NVD
ntnusc
NTU
Resource
Lists
Norwe-
gian
MeSH
NDL
subjects
ndlna
my
Experi-
ment
Italian
Museums
medu-
cator
MARC
Codes
List
Man-
chester
Reading
Lists
Lotico
Weather
Stations
London
Gazette
LOIUS
Linked
Open
Colors
lobid
Resources
lobid
Organi-
sations
LEM
Linked
MDB
LinkedL
CCN
Linked
GeoData
LinkedCT
Linked
User
Feedback
LOV
Linked
Open
Numbers
LODE
Eurostat
(Ontology
Central)
Linked
EDGAR
(Ontology
Central)
Linked
Crunch-
base
lingvoj
Lichfield
Spen-
ding
LIBRIS
Lexvo
LCSH
DBLP
(L3S)
Linked
Sensor Data
(Kno.e.sis)
Klapp-
stuhl-
club
Good-
win
Family
National
Radio-
activity
JP
Jamendo
(DBtune)
Italian
public
schools
ISTAT
Immi-
gration
iServe
IdRef
Sudoc
NSZL
Catalog
Hellenic
PD
Hellenic
FBD
Piedmont
Accomo-
dations
GovTrack
GovWILD
Google
Art
wrapper
gnoss
GESIS
GeoWord
Net
Geo
Species
Geo
Names
Geo
Linked
Data
GEMET
GTAA
STITCH
SIDER
Project
Guten-
berg
Medi
Care
Euro-
stat
(FUB)
EURES
Drug
Bank
Disea-
some
DBLP
(FU
Berlin)
Daily
Med
CORDIS
(FUB)
Freebase
flickr
wrappr
Fishes
of Texas
Finnish
Munici-
palities
ChEMBL
FanHubz
Event
Media
EUTC
Produc-
tions
Eurostat
Europeana
EUNIS
EU
Insti-
tutions
ESD
stan-
dards
EARTh
Enipedia
Popula-
tion (En-
AKTing)
NHS
(En-
AKTing) Mortality
(En-
AKTing)
Energy
(En-
AKTing)
Crime
(En-
AKTing)
CO2
Emission
(En-
AKTing)
EEA
SISVU
educatio
n.data.g
ov.uk
ECS
South-
ampton
ECCO-
TCP
GND
Didactal
ia
DDC Deutsche
Bio-
graphie
data
dcs
Music
Brainz
(DBTune)
Magna-
tune
John
Peel
(DBTune)
Classical
(DB
Tune)
Audio
Scrobbler
(DBTune)
Last.FM
artists
(DBTune)
DB
Tropes
Portu-
guese
DBpedia
dbpedia
lite
Greek
DBpedia
DBpedia
data-
open-
ac-uk
SMC
Journals
Pokedex
Airports
NASA
(Data
Incu-
bator)
Music
Brainz
(Data
Incubator)
Moseley
Folk
Metoffice
Weather
Forecasts
Discogs
(Data
Incubator)
Climbing
data.gov.uk
intervals
Data
Gov.ie
data
bnf.fr
Cornetto
reegle
Chronic-
ling
America
Chem2
Bio2RDF
Calames
business
data.gov.
uk
Bricklink
Brazilian
Poli-
ticians
BNB
UniSTS
UniPath
way
UniParc
Taxono
my
UniProt
(Bio2RDF)
SGD
Reactome
PubMed
Pub
Chem
PRO-
SITE
ProDom
Pfam
PDB
OMIM
MGI
KEGG
Reaction
KEGG
Pathway
KEGG
Glycan
KEGG
Enzyme
KEGG
Drug
KEGG
Com-
pound
InterPro
Homolo
Gene
HGNC
Gene
Ontology
GeneID
Affy-
metrix
bible
ontology
BibBase
FTS
BBC
Wildlife
Finder
BBC
Program
mes BBC
Music
Alpine
Ski
Austria
LOCAH
Amster-
dam
Museum
AGROV
OC
AEMET
US Census
(rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
Linked data
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/
Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 3 / 96
Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 4 / 96
Graph Types
Property graph
film 2014
(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425
(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167
(name, “United Kingdom”)
(population, 62348447) actor 29704
(actor name, “Jack Nicholson”)
film 3418
(label, “The Passenger”)
film 1267
(label, “The Last Tycoon”)
director 8476
(director name, “Stanley Kubrick”)
film 2685
(label, “A Clockwork Orange”)
film 424
(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)
(actor)
(director) (actor)
(actor) (actor)
(director) (director)
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
Graph Types
RDF graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
Graph Types
Property graph
film 2014
(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425
(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167
(name, “United Kingdom”)
(population, 62348447) actor 29704
(actor name, “Jack Nicholson”)
film 3418
(label, “The Passenger”)
film 1267
(label, “The Last Tycoon”)
director 8476
(director name, “Stanley Kubrick”)
film 2685
(label, “A Clockwork Orange”)
film 424
(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)
(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Workload: Online queries and
analytic workloads
Query execution: Varies
RDF graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Workload: SPARQL queries
Query execution: subgraph
matching by homomorphism
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 6 / 96
Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 7 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Focus here is on the
dynamism of the
graphs in whether or
not they change and
how they change.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Focus here is on the
dynamism of the
graphs in whether or
not they change and
how they change.
Focus here is on the
how algorithms behave
as their input changes.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Focus here is on the
dynamism of the
graphs in whether or
not they change and
how they change.
Focus here is on the
how algorithms behave
as their input changes.
The types of workloads
that the approaches are
designed to handle.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
Graphs change
and we are
interested in
their changes.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
Graphs change
and we are
interested in
their changes.
Dynamic
graphs with
high veloc-
ity changes –
not possible to
see the entire
graph at once.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
Graphs change
and we are
interested in
their changes.
Dynamic
graphs with
high veloc-
ity changes –
not possible to
see the entire
graph at once.
Dynamic
graphs with un-
known changes
– requires re-
discovery of
the graph (e.g.,
LOD).
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Computation accesses a
portion of the graph
and the results are
computed for a subset
of vertices; e.g., point-
to-point shortest path,
subgraph matching,
reachability, SPARQL.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Computation accesses a
portion of the graph
and the results are
computed for a subset
of vertices; e.g., point-
to-point shortest path,
subgraph matching,
reachability, SPARQL.
Computation accesses
the entire graph and
may require multiple
iterations; e.g., PageR-
ank, clustering, graph
colouring, all pairs
shortest path.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
Online algo-
rithm with
some info
about forth-
coming input.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
Online algo-
rithm with
some info
about forth-
coming input.
Sees the en-
tire input
in advance,
which may
change; an-
swers computed
as change oc-
curs.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
Online algo-
rithm with
some info
about forth-
coming input.
Sees the en-
tire input
in advance,
which may
change; an-
swers computed
as change oc-
curs.
Similar to dy-
namic, but
computation
happens in
batches of
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
Example Design Points
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Compute the query result/perform analytic computation over the graph
as it exists.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
Example Design Points
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Compute the query result/perform analytic computation over the graph
as it is revealed.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
Example Design Points
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Compute the query result/perform analytic computation on each snap-
shot from scratch.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
Example Design Points
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Continuously compute the query result/perform analytic computation as
the input changes.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
Example Design Points
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Compute the query result/perform analytic computation after a batch of
input changes.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
Example Design Points – Not all alternatives make sense
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Dynamic (or batch-dynamic) algorithms do not make sense for static
graphs.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 10 / 96
Graph Processing Systems
System Memory/
Disk
Architecture
Computing
paradigm
Supported
Workloads
Hadoop Disk Parallel/Distributed MapReduce Analytical
Haloop Disk Parallel/Distributed MapReduce Analytical
Pegasus Disk Parallel/Distributed MapReduce Analytical
GraphX Disk Parallel/Distributed
MapReduce
(Spark)
Analytical
Pregel/Giraph Memory Parallel/Distributed Vertex-Centric Analytical
GraphLab Memory Parallel/Distributed Vertex-Centric Analytical
GraphChi Disk Single machine Vertex-Centric Analytical
Stream Disk Single machine Edge-Centric Analytical
Trinity Memory Parallel/Distributed
Flexible using K-V
store on DSM
Online &
Analytical
Titan Disk Parallel/Distributed
K-V store
(Cassandra)
Online
Neo4J Disk Single machine
Procedural/
Linked-list
Online
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 11 / 96
Graph Workloads
Online graph querying
Reachability
Single source shortest-path
Subgraph matching
SPARQL queries
Offline graph analytics
PageRank
Clustering
Strongly connected
components
Diameter finding
Graph colouring
All pairs shortest path
Graph pattern mining
Machine learning algorithms
(Belief propagation, Gaussian
non-negative matrix
factorization)
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 12 / 96
Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 13 / 96
Reachability Queries
film 2014
(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425
(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167
(name, “United Kingdom”)
(population, 62348447) actor 29704
(actor name, “Jack Nicholson”)
film 3418
(label, “The Passenger”)
film 1267
(label, “The Last Tycoon”)
director 8476
(director name, “Stanley Kubrick”)
film 2685
(label, “A Clockwork Orange”)
film 424
(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)
(actor)
(director) (actor)
(actor) (actor)
(director) (director)
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
Reachability Queries
film 2014
(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425
(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167
(name, “United Kingdom”)
(population, 62348447) actor 29704
(actor name, “Jack Nicholson”)
film 3418
(label, “The Passenger”)
film 1267
(label, “The Last Tycoon”)
director 8476
(director name, “Stanley Kubrick”)
film 2685
(label, “A Clockwork Orange”)
film 424
(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)
(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Can you reach film 1267 from film 2014?
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
Reachability Queries
film 2014
(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425
(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167
(name, “United Kingdom”)
(population, 62348447) actor 29704
(actor name, “Jack Nicholson”)
film 3418
(label, “The Passenger”)
film 1267
(label, “The Last Tycoon”)
director 8476
(director name, “Stanley Kubrick”)
film 2685
(label, “A Clockwork Orange”)
film 424
(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)
(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Is there a book whose rating is > 4.0 associated with a film that was
directed by Stanley Kubrick?
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
Reachability Queries
Think of Facebook graph and finding friends of friends.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
Subgraph Matching
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?r
rev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Su
bgraphMatching
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 15 / 96
Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 16 / 96
PageRank Computation
A web page is important if it is pointed to by other important
pages.
P1 P2
P3
P5P6
P4
r(Pi ) =
Pj ∈BPi
r(Pj )
|FPj
|
r(P2) =
r(P1)
2
+
r(P3)
3
rk+1(Pi ) =
Pj ∈BPi
rk(Pj )
|FPj
|
BPi
: in-neighbours of Pi
FPi
: out-neighbours of Pi
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 17 / 96
PageRank Computation
A web page is important if it is pointed to by other important
pages.
P1 P2
P3
P5P6
P4
rk+1(Pi ) =
Pj ∈BPi
rk(Pj )
|FPj
|
Iteration 0 Iteration 1 Iteration 2
Rank at
Iter. 2
r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5
r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4
r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5
r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1
r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3
r0(P6) = 1/6 r1(P6) = 1/6 r2(P6) = 14/72 2
Iterative processing.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 17 / 96
Some Alternative Computational Models for Offline
Analytics
Vertex-centric (Scatter-Gather)
Specify (a) computation at each vertex, and (b) communication with
neighbour vertices
Synchronous – Pregel [Malewicz et al., 2010], Giraph
Asynchronous – GraphLab [Low et al., 2012]
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
Some Alternative Computational Models for Offline
Analytics
Vertex-centric (Scatter-Gather)
Specify (a) computation at each vertex, and (b) communication with
neighbour vertices
Synchronous – Pregel [Malewicz et al., 2010], Giraph
Asynchronous – GraphLab [Low et al., 2012]
Block-centric
Similar to vertex-centric but on blocks for communication
Connected subgraph of the graph
Blogel [Yan et al., 2014]
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
Some Alternative Computational Models for Offline
Analytics
Vertex-centric (Scatter-Gather)
Specify (a) computation at each vertex, and (b) communication with
neighbour vertices
Synchronous – Pregel [Malewicz et al., 2010], Giraph
Asynchronous – GraphLab [Low et al., 2012]
Block-centric
Similar to vertex-centric but on blocks for communication
Connected subgraph of the graph
Blogel [Yan et al., 2014]
MapReduce
Need to save in HDFS intermediate results of each iteration – both
good and bad
Hadoop, Haloop [Bu et al., 2012]
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
Some Alternative Computational Models for Offline
Analytics
Vertex-centric (Scatter-Gather)
Specify (a) computation at each vertex, and (b) communication with
neighbour vertices
Synchronous – Pregel [Malewicz et al., 2010], Giraph
Asynchronous – GraphLab [Low et al., 2012]
Block-centric
Similar to vertex-centric but on blocks for communication
Connected subgraph of the graph
Blogel [Yan et al., 2014]
MapReduce
Need to save in HDFS intermediate results of each iteration – both
good and bad
Hadoop, Haloop [Bu et al., 2012]
Modified MapReduce
Based on Spark [Zaharia et al., 2010; Zaharia, 2016]
Keep intermediate states in memory
Provide fault-tolerance by keeping lineage
GraphX [Gonzalez et al., 2014]
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
Vertex-Centric Computation
“Think like a vertex”
vertex_scatter(vertex v)
Push local computation to
neighbours on the out-bound
edges
vertex_gather(vertex v)
Gather local computation from
neighbours on the in-bound edges
Continue until all vertices are
inactive
?
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 19 / 96
Vertex-Centric Computation
“Think like a vertex”
vertex_scatter(vertex v)
Push local computation to
neighbours on the out-bound
edges
vertex_gather(vertex v)
Gather local computation from
neighbours on the in-bound edges
Continue until all vertices are
inactive
Vertex state machine
?
Active Inactive
Vote halt
Message received
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 19 / 96
Synchronous Vertex-Centric Computation
Computation
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
Synchronous Vertex-Centric Computation
Superstep 1 Superstep 2 Superstep 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
Synchronous Vertex-Centric Computation
Machine 1
Machine 2
Machine 3
Communication
Barrier
Each machine performs
vertex-centric computation
on its graph partition
Superstep 1 Superstep 2 Superstep 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
Synchronous Vertex-Centric Computation
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Communication
Barrier
Each machine performs
vertex-centric computation
on its graph partition
Communication
Barrier
Superstep 1 Superstep 2 Superstep 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
Synchronous Vertex-Centric Computation
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Communication
Barrier
Each machine performs
vertex-centric computation
on its graph partition
Communication
Barrier
Superstep 1 Superstep 2 Superstep 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
Asynchronous Vertex-Centric Computation
No communication barriers. 
Uses the most recent vertex values. 
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
Asynchronous Vertex-Centric Computation
No communication barriers. 
Uses the most recent vertex values. 
Implemented via distributed locking
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
v0
v1 v2
v3 v4
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
Asynchronous Vertex-Centric Computation
No communication barriers. 
Uses the most recent vertex values. 
Implemented via distributed locking
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
v0
v1 v2
v3 v4
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
Asynchronous Vertex-Centric Computation
No communication barriers. 
Uses the most recent vertex values. 
Implemented via distributed locking
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
v0
v1 v2
v3 v4
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
Asynchronous Vertex-Centric Computation
No communication barriers. 
Uses the most recent vertex values. 
Implemented via distributed locking
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
v0
v1 v2
v3 v4
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
Asynchronous Vertex-Centric Computation
No communication barriers. 
Uses the most recent vertex values. 
Implemented via distributed locking
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
v0
v1 v2
v3 v4
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
64 machines TW UK
Giraph (byte array) 5.8GB 7.0GB
GraphLab (sync) 4.5GB 14GB
TW 16 machines 128 machines
Giraph (byte array) 8.5GB 5.8GB
GraphLab (sync) 11GB 3.3GB
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –
Performance degrades as more machines are used due to lock
contention, termination scheme, lack of message batching
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –
Performance degrades as more machines are used due to lock
contention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –
Performance degrades as more machines are used due to lock
contention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
No Mutations
Time Memory
Byte array  
Hash map  
With Mutations (DMST)
Time Memory
Byte array  
Hash map  
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –
Performance degrades as more machines are used due to lock
contention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
4 Message processing optimizations are very important.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –
Performance degrades as more machines are used due to lock
contention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
4 Message processing optimizations are very important.
5 Workloads have different resource demands
Algorithm CPU Memory Network
PageRank Medium Medium High
SSSP Low Low Low
WCC Low Medium Medium
DMST High High Medium
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
Block-Centric Computation
Blogel [Yan et al., 2014]: “Think like a block”; also “think like a
graph” [Tian et al., 2013]
Vertex-centric assumes all vertices communicate over the network;
this is not efficient
Read-world graphs have skewed vertex degree distribution
Common in power-law graphs
Problem: imbalanced communication workloads
Real-world graphs have large diameters
Common in road networks, web graphs, terrain meshes
Problem: one superstep per hop ⇒ too many supersteps
Real-world graphs have high average vertex degree
Common in social networks, mobile communication networks
Problem: heavy average communication workloads
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 23 / 96
Blogel Principles
Exploit the partitioning of the graph
Message exchanges only among blocks
Block: a connected subgraph of the graph
Within a block, run a serial in-memory algorithm; no need to follow a
BSP model
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 24 / 96
Benefits of Block-Centric Computation
High-degree vertices inside a block send no messages
Fewer number of supersteps
Fewer number of blocks than vertices
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 25 / 96
Example: Weakly Connected Component
Algorithm exchanges vertex id’s
with neighbours
id(vi ) ← min{vi , vj , . . . , vk}
where vj , . . . , vk are neighbours
of vi
Vertex-centric requires every
vertex sends to its neighbours
until every vertex is reached
Block-centric needs two
iterations:
1 All vertices in partition A
exchange ids; X and Y send
ids to neighbours in partition
B
2 All vertices in partition B
exchange ids
A B
0
X
Y
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 26 / 96
Block Construction
The partitioning algorithm needs to maximize number of vertices that
have all their edges in the same partition
Hash partitioning is not suitable because many vertices will probably
have at least one cut-edge
URL partitioner
For web graphs: based on domain names of web page nodes
2D partitioner
For spatial networks: based on coordinates of node
Graph Voronoi diagram partitioner
For general graphs
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 27 / 96
MapReduce Basics [Li et al., 2014]
For data analysis of very large data sets
Highly dynamic, irregular, schemaless, etc.
SQL too heavy
“Embarrassingly parallel problems”
New, simple parallel programming model
Data structured as (key, value) pairs
E.g. (doc-id, content), (word, count), etc.
Functional programming style with two functions to be given:
Map(k1,v1) → list(k2,v2)
Reduce(k2, list (v2)) → list(v3)
Implemented on a distributed file system (e.g., Google File System)
on very large clusters
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 28 / 96
MapReduce Processing
...
Inputdataset
Map
Map
Map
Map
(k1, v)
(k2, v)
(k2, v)
(k2, v)
(k1, v)
(k1, v)
(k2, v)
Group by k
Group by k
(k1, (v, v, v))
(k1, (v, v, v, v)) Reduce
Reduce
Outputdataset
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 29 / 96
MapReduce Architecture
Scheduler
Master
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Group Module
Reduce Module
Output Module
Reduce Process
Worker
Group Module
Reduce Module
Output Module
Reduce Process
Worker
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 30 / 96
Execution Flow with Architecture [Dean and Ghemawat, 2008]
MapReduce: Simplified Data Processing on Large Clusters
split 0
split 1
split 2
split 3
split 4
(1) fork
(3) read
(4) local write
(1) fork
(1) fork
(6) write
worker
worker
worker
Master
User
Program
output
file 0
output
file 1
worker
worker
(2)
assign
map
(2)
assign
reduce
(5) remote
(5) read
Input
files
Map
phasr
Intermediate files
(on local disks)
Reduce
phase
Output
files
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 31 / 96
Hadoop
Most popular MapReduce implementation – developed by Yahoo!
Two components
Processing engine
HDFS: Hadoop Distributed Storage System – others possible
Can be deployed on the same machine or on different machines
Processes
Job tracker: hosted on the master node and implements the schedule
Task tracker: hosted on the worker nodes and accepts tasks from job tracker
and executes them
HDFS
Name node: stores how data are partitioned, monitors the status of data
nodes, and data dictionary
Data node: Stores and manages data chunks assigned to it
Task Tracker Job Tracker Task Tracker
Data Node Name Node Data Node
Worker 1 Name Node Worker n
MapReduce
HDFS
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 32 / 96
HaLoop [Bu et al., 2012]
Overcome MapReduce shortcomings for iterative jobs
Having to save data in HDFS in between each iteration
Checking the fixpoint requires a new job at each iteration
Scheduler change: assign to the same machine the map  reduce
tasks that occur in different iterations but access the same data
Cache invariant data
Cache reduce output to easily check for fixpoint
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 33 / 96
Spark System
MapReduce does not perform well in iterative computations
Workflow model is acyclic
Have to write to HDFS after each iteration and have to read from
HDFS at the beginning of next iteration
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
Spark System
MapReduce does not perform well in iterative computations
Workflow model is acyclic
Have to write to HDFS after each iteration and have to read from
HDFS at the beginning of next iteration
Spark objectives
Better support for iterative programs
Provide a complete ecosystem
Similar abstraction (to MapReduce) for programming
Maintain MapReduce fault-tolerance and scalability
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
Spark System
MapReduce does not perform well in iterative computations
Workflow model is acyclic
Have to write to HDFS after each iteration and have to read from
HDFS at the beginning of next iteration
Spark objectives
Better support for iterative programs
Provide a complete ecosystem
Similar abstraction (to MapReduce) for programming
Maintain MapReduce fault-tolerance and scalability
Fundamental concepts
RDD: Reliable Distributed Datasets
Caching of working set
Maintaining lineage for fault-tolerance
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
Spark Ecosystem [Michiardi, 2015]
Native
Spark
Apps
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph
processing)
Apache Spark
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 35 / 96
Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? Cache
Yes
Transform
RDD?
No
Process
No
Transform
Yes
HDFS
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? Cache
Yes
Transform
RDD?
No
Process
No
Transform
Yes
HDFS
Each transform generates a
new RDD that may also be
cached or processed
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? Cache
Yes
Transform
RDD?
No
Process
No
Transform
Yes
HDFS
Each transform generates a
new RDD that may also be
cached or processed
Created from HDFS or parallelized arrays;
Partitioned across worker machines;
May be made persistent lazily;
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? Cache
Yes
Transform
RDD?
No
Process
No
Transform
Yes
HDFS
Each transform generates a
new RDD that may also be
cached or processed
Created from HDFS or parallelized arrays;
Partitioned across worker machines;
May be made persistent lazily;
Processing done on one of the RDDs;
Done in parallel across workers;
First processing on a RDD is from disk;
Subsequent processing of the same RDD from cache
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
CreateRDD
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
Transform RDD
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
Another transform
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
Cache results
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter( .contains(foo)).count
Action
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter( .contains(foo)).count
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
Tasks
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter( .contains(foo)).count
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter( .contains(foo)).count
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter( .contains(foo)).count
cachedMsgs.filter( .contains(bar)).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
RDD and Processing
HDFS
lines = spark.textFile(hdfs://...)
lines
Error, msg1
Warn, msg2
Error, msg1
Info, msg8
Warn, msg2
Info, msg8
Error, msg3
Info, msg5
Info, msg5
Error, msg4
Warn, msg9
Error, msg1
errors
errors = lines.filter( .startsWith(ERROR))
Error, msg1
Error, msg1
Error, msg3 Error, msg4
Error, msg1
messages
messages = errors.map .split(‘t ’)(2)
msg1
msg1
msg3 msg4
msg1
Thesearenotyetgenerated
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 38 / 96
RDD and Processing
lines
errors
messages
msg1
msg1
msg3 msg4
msg1
lines
messages.filter( .contains(foo)).count
errors
messages
msg1
msg1
msg3 msg4
msg1
NowtheRDDsarematerialized;
Commandnotyetexecuted
Driver
messages.filter( .contains(foo)).count
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 38 / 96
GraphX [Gonzalez et al., 2014]
Built on top of Spark
Objective is to combine data analytics with graph processing
Unify computation on tables and graphs
Carefully convert graph to tabular representation
Native GraphX API or can accommodate vertex-centric computation
Native
Spark
Apps
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph
processing)
Apache Spark
Vertex-
centric API
AppApp
App App
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 39 / 96
GraphX: Representation of Graphs as Tables
A
B
C
D
E
F
G
H
I
J
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
GraphX: Representation of Graphs as Tables
Partition 1
Partition 2
A
B
C
D
E
F
G
H
I
J
Edge-disjoint
partitioning
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
GraphX: Representation of Graphs as Tables
Partition 1
Partition 2
Machine1Machine2
Vertex Table
(RDD)
v-prop:vertex prop.
A
B
C
D
E
F
G
H
I
J
Edge-disjoint
partitioning
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
GraphX: Representation of Graphs as Tables
Partition 1
Partition 2
Machine1Machine2
Vertex Table
(RDD)
v-prop:vertex prop.
Edge Table
(RDD)
e-prop:edge prop.
A
B
C
D
E
F
G
H
I
J
Edge-disjoint
partitioning
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
GraphX: Representation of Graphs as Tables
Partition 1
Partition 2
Machine1Machine2
Vertex Table
(RDD)
v-prop:vertex prop.
Edge Table
(RDD)
e-prop:edge prop.
A
B
C
D
E
F
G
H
I
J
Edge-disjoint
partitioning
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
Joining vertices
and edges
Move vertices to edges
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
GraphX: Representation of Graphs as Tables
Partition 1
Partition 2
Machine1Machine2
Vertex Table
(RDD)
v-prop:vertex prop.
Edge Table
(RDD)
e-prop:edge prop.
Routing
Table
(RDD)
A
B
C
D
E
F
G
H
I
J
Edge-disjoint
partitioning
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
A 1 2
B 1
...
I 1
F 1 2
D 2
E 2
J 2
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
GraphX: Computation Model
Machine1Machine2
Vertex Table Edge Table
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
GraphX: Computation Model
Machine1Machine2
Vertex Table Edge Table
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
First Phase: Join
Vertex table Edge table
Triples View
A v-prop e-prop B v-prop
A v-prop e-prop C v-prop
C v-prop e-prop G v-prop
...
E v-prop e-prop G v-prop
J v-prop e-prop G v-prop
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
GraphX: Computation Model
Machine1Machine2
Vertex Table Edge Table
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
Triples View
A v-prop e-prop B v-prop
A v-prop e-prop C v-prop
C v-prop e-prop G v-prop
...
E v-prop e-prop G v-prop
J v-prop e-prop G v-prop
Second Phase: Compute neighbourhood
Group-by aggregate
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
GraphX: Operators
Table transform operators – inherited from Spark
map(func) Return a new RDD formed by passing each element
of the source through a function func
filter(func) Return a new RDD formed by selecting those
elements of the source on which func returns true
flatMap(func) Similar to map, but each input item can be mapped
to 0 or more output items
mapPartitions(func) Similar to map, but runs separately on each partition
(block) of the RDD, so func must be of type Iterator
sample(repl, fraction,
seed)
Sample a fraction fraction of the data, with or
without replacement (set repl accordingly), using a
given random number generator seed
union(otherDataset)
intersection()
Return a new RDD containing the union/intersection
of the elements in the source RDD and the argument
groupByKey() Operates on a RDD of (K, V) pairs, returns a RDD
of (K, IterableV) pairs
reduceByKey(func, . . .) Operates on a RDD of (K, V) pairs, returns a RDD
of (K, V) pairs where the values for each key are
aggregated using the given reduce function func
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 42 / 96
GraphX: Operators
Table transform operators – inherited from Spark
Graph operators
Graph(vertex coll,
edge coll)
Logically binds together a pair of vertex and edge
property collections into a property graph; verifies
that each vertex occurs only once and edges connect
existing vertices
triplets(vertex coll,
vertex coll, edge coll)
Returns the triplets view of the graph
mrTriplets(map,reduce) MapReduce triplets - encodes the two-stage process
of join to create triplets and group by
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 42 / 96
Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 43 / 96
RDF Introduction
Everything is an uniquely named
resource
http://data.linkedmdb.org/resource/actor/JN29704
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
RDF Introduction
Everything is an uniquely named
resource
Prefixes can be used to shorten the
names
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
RDF Introduction
Everything is an uniquely named
resource
Prefixes can be used to shorten the
names
Properties of resources can be defined
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
RDF Introduction
Everything is an uniquely named
resource
Prefixes can be used to shorten the
names
Properties of resources can be defined
Relationships with other resources can
be defined
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
RDF Introduction
Everything is an uniquely named
resource
Prefixes can be used to shorten the
names
Properties of resources can be defined
Relationships with other resources can
be defined
Resource descriptions can be
contributed by different people/groups
and can be located anywhere in the web
Integrated web “database”
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
RDF Data Model
Triple: Subject, Predicate (Property), Object
(s, p, o)
Subject: the entity that is described (URI
or blank node)
Predicate: a feature of the entity (URI)
Object: value of the feature (URI, blank
node or literal)
(s, p, o) ∈ (U ∪ B) × U × (U ∪ B ∪ L)
Set of RDF triples is called an RDF graph
U
Subject Object
U B U B L
U: set of URIs
B: set of blank nodes
L: set of literals
Predicate
Subject Predicate Object
http://...imdb.../film/2014 rdfs:label “The Shining”
http://...imdb.../film/2014 movie:releaseDate “1980-05-23”
http://...imdb.../29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 45 / 96
RDF Example Instance
Prefixes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/
bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/
lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/
Subject Predicate Object
mdb: film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”’
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2014 movie:actor mdb:actor/29704
mdb:film/2014 movie:actor mdb: actor/30013
mdb:film/2014 movie:music contributor mdb: music contributor/4110
mdb:film/2014 foaf:based near geo:2635167
mdb:film/2014 movie:relatedBook bm:0743424425
mdb:film/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:film/424 movie:director mdb:director/8476
mdb:film/424 rdfs:label “Spartacus”
mdb:actor/29704 movie:actor name “Jack Nicholson”
mdb:film/1267 movie:actor mdb:actor/29704
mdb:film/1267 rdfs:label “The Last Tycoon”
mdb:film/3418 movie:actor mdb:actor/29704
mdb:film/3418 rdfs:label “The Passenger”
geo:2635167 gn:name “United Kingdom”
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer
lexvo:iso639-3/eng rdfs:label “English”
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
URI Literal
URI
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 46 / 96
RDF Graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 47 / 96
RDF Query Model – SPARQL
Query Model - SPARQL Protocol and RDF Query Language
Given U (set of URIs), L (set of literals), and V (set of variables), a
SPARQL expression is defined recursively:
an atomic triple pattern, which is an element of
(U ∪ V ) × (U ∪ V ) × (U ∪ V ∪ L)
?x rdfs:label “The Shining”
P FILTER R, where P is a graph pattern expression and R is a built-in
SPARQL condition (i.e., analogous to a SQL predicate)
?x rev:rating ?p FILTER(?p  3.0)
P1 AND/OPT/UNION P2, where P1 and P2 are graph pattern
expressions
Example:
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r  4.0)
}
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 48 / 96
SPARQL Queries
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r  4.0)
}
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?r
rev:rating
FILTER(?r  4.0)
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 49 / 96
Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 50 / 96
Na¨ıve Triple Store Design
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r  4.0)
}
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2014 movie:actor mdb:actor/29704
mdb:film/2014 movie:actor mdb: actor/30013
mdb:film/2014 movie:music contributor mdb: music contributor/4110
mdb:film/2014 foaf:based near geo:2635167
mdb:film/2014 movie:relatedBook bm:0743424425
mdb:film/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:film/424 movie:director mdb:director/8476
mdb:film/424 rdfs:label “Spartacus”
mdb:actor/29704 movie:actor name “Jack Nicholson”
mdb:film/1267 movie:actor mdb:actor/29704
mdb:film/1267 rdfs:label “The Last Tycoon”
mdb:film/3418 movie:actor mdb:actor/29704
mdb:film/3418 rdfs:label “The Passenger”
geo:2635167 gn:name “United Kingdom”
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer
lexvo:iso639-3/eng rdfs:label “English”
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
Na¨ıve Triple Store Design
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r  4.0)
}
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2014 movie:actor mdb:actor/29704
mdb:film/2014 movie:actor mdb: actor/30013
mdb:film/2014 movie:music contributor mdb: music contributor/4110
mdb:film/2014 foaf:based near geo:2635167
mdb:film/2014 movie:relatedBook bm:0743424425
mdb:film/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:film/424 movie:director mdb:director/8476
mdb:film/424 rdfs:label “Spartacus”
mdb:actor/29704 movie:actor name “Jack Nicholson”
mdb:film/1267 movie:actor mdb:actor/29704
mdb:film/1267 rdfs:label “The Last Tycoon”
mdb:film/3418 movie:actor mdb:actor/29704
mdb:film/3418 rdfs:label “The Passenger”
geo:2635167 gn:name “United Kingdom”
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer
lexvo:iso639-3/eng rdfs:label “English”
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c t
FROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p=” r d f s : l a b e l ”
AND T2 . p=” movie : relatedBook ”
AND T3 . p=” movie : d i r e c t o r ”
AND T4 . p=” rev : r a t i n g ”
AND T5 . p=” movie : d i r e c t o r n a m e ”
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o  4.0
AND T5 . o=” S t a n l e y Kubrick ”
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
Na¨ıve Triple Store Design
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r  4.0)
}
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2014 movie:actor mdb:actor/29704
mdb:film/2014 movie:actor mdb: actor/30013
mdb:film/2014 movie:music contributor mdb: music contributor/4110
mdb:film/2014 foaf:based near geo:2635167
mdb:film/2014 movie:relatedBook bm:0743424425
mdb:film/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:film/424 movie:director mdb:director/8476
mdb:film/424 rdfs:label “Spartacus”
mdb:actor/29704 movie:actor name “Jack Nicholson”
mdb:film/1267 movie:actor mdb:actor/29704
mdb:film/1267 rdfs:label “The Last Tycoon”
mdb:film/3418 movie:actor mdb:actor/29704
mdb:film/3418 rdfs:label “The Passenger”
geo:2635167 gn:name “United Kingdom”
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer
lexvo:iso639-3/eng rdfs:label “English”
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c t
FROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p=” r d f s : l a b e l ”
AND T2 . p=” movie : relatedBook ”
AND T3 . p=” movie : d i r e c t o r ”
AND T4 . p=” rev : r a t i n g ”
AND T5 . p=” movie : d i r e c t o r n a m e ”
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o  4.0
AND T5 . o=” S t a n l e y Kubrick ”
Easy to implement
but
too many self-joins!
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss
et al., 2008]
Strings are mapped to ids using a mapping table
Original triple table
Subject Property Object
mdb: film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 movie:director mdb:director/8476
Encoded triple table
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
Mapping table
ID Value
0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss
et al., 2008]
Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographic order
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
B+ tree
Easy querying
through mapping
table
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss
et al., 2008]
Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographic order
Create indexes for permutations of the three columns: SPO, SOP,
PSO, POS, OPS, OSP
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
B+ tree
Easy querying
through mapping
table
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
Exhaustive Indexing–Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
Exhaustive Indexing–Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
Advantages
Eliminates some of the joins – they become range queries
Merge join is easy and fast
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
Exhaustive Indexing–Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
Advantages
Eliminates some of the joins – they become range queries
Merge join is easy and fast
Disadvantages
Space usage
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea
et al., 2013]
Clustered property table: group together the properties that tend to
occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type of
property into one property table
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject refs:label movie:director
mob:film/2014 “The Shining” mob:director/8476
mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor name
mdb:actor “Jack Nicholson”
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea
et al., 2013]
Clustered property table: group together the properties that tend to
occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type of
property into one property table
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject refs:label movie:director
mob:film/2014 “The Shining” mob:director/8476
mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor name
mdb:actor “Jack Nicholson”
Advantages
Fewer joins
If the data is structured, we have a relational system – similar to
normalized relations
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea
et al., 2013]
Clustered property table: group together the properties that tend to
occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type of
property into one property table
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject refs:label movie:director
mob:film/2014 “The Shining” mob:director/8476
mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor name
mdb:actor “Jack Nicholson”
Advantages
Fewer joins
If the data is structured, we have a relational system – similar to
normalized relations
Disadvantages
Potentially a lot of NULLs
Clustering is not trivial
Multi-valued properties are complicated
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
Binary Tables
Grouping by properties: For each property, build a two-column table,
containing both subject and object, ordered by subjects [Abadi et al.,
2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in the data)
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject Object
mdb:film/2014 mdb:director/8476
mdb:film/2685 mdb:director/8476
movie:director
Subject Object
mob:film/2014 “The Shining”
mob:film/2685 “The Clockwork Orange”
refs:label
Subject Object
mdb:actor/29704 “Jack Nicholson”
movie:actor name
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
Binary Tables
Grouping by properties: For each property, build a two-column table,
containing both subject and object, ordered by subjects [Abadi et al.,
2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in the data)
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject Object
mdb:film/2014 mdb:director/8476
mdb:film/2685 mdb:director/8476
movie:director
Subject Object
mob:film/2014 “The Shining”
mob:film/2685 “The Clockwork Orange”
refs:label
Subject Object
mdb:actor/29704 “Jack Nicholson”
movie:actor name
Advantages
Supports multi-valued properties
No NULLs
No clustering
Read only needed attributes (i.e. less I/O)
Good performance for subject-subject joins
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
Binary Tables
Grouping by properties: For each property, build a two-column table,
containing both subject and object, ordered by subjects [Abadi et al.,
2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in the data)
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject Object
mdb:film/2014 mdb:director/8476
mdb:film/2685 mdb:director/8476
movie:director
Subject Object
mob:film/2014 “The Shining”
mob:film/2685 “The Clockwork Orange”
refs:label
Subject Object
mdb:actor/29704 “Jack Nicholson”
movie:actor name
Advantages
Supports multi-valued properties
No NULLs
No clustering
Read only needed attributes (i.e. less I/O)
Good performance for subject-subject joins
Disadvantages
Not useful for subject-object joins
Expensive inserts
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
Graph-based Approach
Answering SPARQL query ≡ subgraph matching using homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013]
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?r
rev:rating
FILTER(?r  4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Subgraph
M
atching
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
Graph-based Approach
Answering SPARQL query ≡ subgraph matching using homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013]
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?r
rev:rating
FILTER(?r  4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Subgraph
M
atching
Advantages
Maintains the graph structure
Full set of queries can be handled
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
Graph-based Approach
Answering SPARQL query ≡ subgraph matching using homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013]
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?r
rev:rating
FILTER(?r  4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Subgraph
M
atching
Advantages
Maintains the graph structure
Full set of queries can be handled
Disadvantages
Graph pattern matching is expensive
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
Two Systems
gStoreSystem Architecture
Offline Online
Storage
Input Input
RDF Parser
RDF Graph
Builder
Encoding
Module
VS*-tree
builder
RDF data
RDF Triples
RDF Graph
Signature Graph
Key-Value
Store
VS*-tree
Store
SPARQL
Parser
SPARQL Query
Encoding
Module
VS*-tree
Query Graph
Filter
Module
Join
Module
Signature Graph
Node Candidate
Results
Fig. 4. System Architecture
itstrings, denoted as vS ig(u). We encode query Q with the
ame encoding method. Consequently, the match between Q
nd G can be verified by simply checking the match between
orresponding encoded bitstrings.
chameleon-db
Structural Index
...
Vertex Index
Spill Index
ClusterIndexStorageSystem
StorageAdvisor
Query
Engine Plan Generation Evaluation
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 57 / 96
gStore
General Approach:
Work directly on the RDF graph and the SPARQL query graph
Use a signature-based encoding of each entity and class vertex to
speed up matching
Filter-and-evaluate
Use a false positive algorithm to prune nodes and obtain a set of
candidates; then do more detailed evaluation on those
Use an index (VS∗-tree) over the data signature graph (has light
maintenance load) for efficient pruning
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 58 / 96
1. Encode Q and G to Get Signature Graphs
Query signature graph Q∗
0100 0000 1000 0000
00010
0000 0100
10000
Data signature graph G∗
0010 1000
0100 0001
00001
1000 0001
00010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
0001 0001
01000
0100 1000
01000
1001 1000
01000
0001 0100
01000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 59 / 96
2. Filter-and-Evaluate
Query signature graph Q∗
0100 0000 1000 0000
00010
0000 0100
10000
Data signature graph G∗
0010 1000
0100 0001
00001
1000 0001
00010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
0001 0001
01000
0100 1000
01000
1001 1000
01000
0001 0100
01000
Find matches of Q∗ over
signature graph G∗
Verify each match in
RDF graph G
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 60 / 96
How to Generate Candidate List
Two step process:
1 For each node of Q∗
get lists of nodes in G∗
that include that node.
2 Do a multi-way join to get the candidate list
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
How to Generate Candidate List
Two step process:
1 For each node of Q∗
get lists of nodes in G∗
that include that node.
2 Do a multi-way join to get the candidate list
Alternatives:
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
How to Generate Candidate List
Two step process:
1 For each node of Q∗
get lists of nodes in G∗
that include that node.
2 Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G∗
Both steps are inefficient
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
How to Generate Candidate List
Two step process:
1 For each node of Q∗
get lists of nodes in G∗
that include that node.
2 Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G∗
Both steps are inefficient
Use S-trees
Height-balanced tree over signatures
Run an inclusion query for each node of Q∗
and get lists of nodes in
G∗
that include that node.
• Given query signature q and a set of data signatures S, find all
data signatures si ∈ S where qsi = q
Does not support second step – expensive
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
How to Generate Candidate List
Two step process:
1 For each node of Q∗
get lists of nodes in G∗
that include that node.
2 Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G∗
Both steps are inefficient
Use S-trees
Height-balanced tree over signatures
Run an inclusion query for each node of Q∗
and get lists of nodes in
G∗
that include that node.
• Given query signature q and a set of data signatures S, find all
data signatures si ∈ S where qsi = q
Does not support second step – expensive
VS-tree (and VS∗
-tree)
Multi-resolution summary graph based on S-tree
Supports both steps efficiently
Grouping by vertices
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000 002
011
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000 002
011
003
008
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000 002
011
003
008
004
009
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000 002
011
003
008
004
009
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000 002
011
003
008
004
009
Possibly large join space!
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
VS-tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
Super edge
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 63 / 96
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
d3
2
d3
3
d3
3
d3
4
d3
1
d3
4
G3
00010 10000
01000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
d3
2
d3
3
d3
3
d3
4
d3
1
d3
4
G3
00010 10000
01000
003
008
002
011
004
009
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
d3
2
d3
3
d3
3
d3
4
d3
1
d3
4
G3
00010 10000
01000
003
008
002
011
004
009
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
d3
2
d3
3
d3
3
d3
4
d3
1
d3
4
G3
00010 10000
01000
003
008
002
011
004
009
Reduced join space!
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
Adaptivity to Workload
Applications that rely on RDF data are increasingly popular and are
more varied [Verborgh et al., 2014]
Data that are being handled are far more heterogeneous [Duan et al.,
2011]
SPARQL queries are becoming more diverse [Arias et al., 2011] and
dynamic [Kirchberg et al., 2011]
An experiment [Alu¸c et al., 2014a]
No single system is a sole winner across all queries
No single system is the sole loser across all queries, either
There can be 2–5 orders of magnitude difference in the performance
(i.e., query execution time) between the best and the worst system for
a given query
The winner in one query may timeout in another
Performance difference widens as dataset size increases
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
Adaptivity to Workload
Applications that rely on RDF data are increasingly popular and are
more varied [Verborgh et al., 2014]
Data that are being handled are far more heterogeneous [Duan et al.,
2011]
SPARQL queries are becoming more diverse [Arias et al., 2011] and
dynamic [Kirchberg et al., 2011]
An experiment [Alu¸c et al., 2014a]
No single system is a sole winner across all queries
No single system is the sole loser across all queries, either
There can be 2–5 orders of magnitude difference in the performance
(i.e., query execution time) between the best and the worst system for
a given query
The winner in one query may timeout in another
Performance difference widens as dataset size increases
Can existing systems cope with these trends – workload diversity 
dynamism
No! [Alu¸c et al., 2014b]
Fragmented data
Suboptimal pruning by indexes
Unnecessarily large sets of intermediate result tuples
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
Adaptivity to Workload
Applications that rely on RDF data are increasingly popular and are
more varied [Verborgh et al., 2014]
Data that are being handled are far more heterogeneous [Duan et al.,
2011]
SPARQL queries are becoming more diverse [Arias et al., 2011] and
dynamic [Kirchberg et al., 2011]
An experiment [Alu¸c et al., 2014a]
No single system is a sole winner across all queries
No single system is the sole loser across all queries, either
There can be 2–5 orders of magnitude difference in the performance
(i.e., query execution time) between the best and the worst system for
a given query
The winner in one query may timeout in another
Performance difference widens as dataset size increases
Our proposal: Idea behind chameleon-db
When designing and implementing an RDF data management system,
assume nothing about the workload upfront
Organize data dynamically and purely based on the workload
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
Characteristics:
Records are not necessarily of fixed length
Records are not grouped into tables
Records do not necessarily share the same set of RDF predicates
Each record represents a very tiny part of the RDF graph
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?x ?y
A
?y ?z
?b
Figure: Query 1
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?x ?y
A
?y ?z
?b
Figure: Query 1
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?x ?y
A
?y ?z
?b
Figure: Query 1
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?a
?z
C
?a
?x
A
?a
?y
B
Figure: Query 2
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?a
?z
C
?a
?x
A
?a
?y
B
Figure: Query 2
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?a
?z
C
?a
?x
A
?a
?y
B
Figure: Query 2
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
Group-by-Query Approach
v1
v21
A
v21
v20
B
v1
v98
A
v98
v30
C
v1
v250
A
v250
v40
D
v0
v32
A
v0
v52 C
v0
v80
C
v0
v66C
v0
v47
B
v6
v7
A
v6
v8
B
v6
v9
C
C1 C2 C3 C4 C5
?a
?z
C
?a
?x
A
?a
?y
B
Figure: Query 2
Advantages
Data are physically clustered for the workload
Better pruning by the indexes
Fewer intermediate result tuples
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
Challenges
Physical Data Layout: As the workloads change, the way data are
grouped together may no longer be suitable
Hierarchical Clustering Algorithm [Alu¸c et al., 2015]
Tunable-LSH [Alu¸c et al., 2015]
Indexing: Indexing upfront is not a choice
Query Evaluation: Can we execute queries efficiently even when the
physical layout is constantly changing? [Alu¸c et al., 2015]
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 67 / 96
chameleon-db
Prototype system [Alu¸c et al., 2013]
35,000 lines of code in C++ under Linux (plus code for SPARQL 1.0
parser)
Structural Index
...
Vertex Index
Spill Index
ClusterIndexStorageSystem
StorageAdvisor
Query
Engine Plan Generation Evaluation
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 68 / 96
Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 69 / 96
Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries –
SPARQL endpoints
Not all data sites can process
queries
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96
Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries –
SPARQL endpoints
Not all data sites can process
queries
Alternatives
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96
Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries –
SPARQL endpoints
Not all data sites can process
queries
Alternatives
Data re-distribution + query
decomposition
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture
Graph lecture

Contenu connexe

Tendances

Graph representation
Graph representationGraph representation
Graph representation
Tech_MX
 

Tendances (20)

Creating a Binary tree from a General Tree.pptx
Creating a Binary tree from a General Tree.pptxCreating a Binary tree from a General Tree.pptx
Creating a Binary tree from a General Tree.pptx
 
Graphs in data structure
Graphs in data structureGraphs in data structure
Graphs in data structure
 
Graph representation
Graph representationGraph representation
Graph representation
 
AVL Tree
AVL TreeAVL Tree
AVL Tree
 
Graph Data Structure
Graph Data StructureGraph Data Structure
Graph Data Structure
 
Trees, Binary Search Tree, AVL Tree in Data Structures
Trees, Binary Search Tree, AVL Tree in Data Structures Trees, Binary Search Tree, AVL Tree in Data Structures
Trees, Binary Search Tree, AVL Tree in Data Structures
 
PRIM’S AND KRUSKAL’S ALGORITHM
PRIM’S AND KRUSKAL’S  ALGORITHMPRIM’S AND KRUSKAL’S  ALGORITHM
PRIM’S AND KRUSKAL’S ALGORITHM
 
Graph Data Structure
Graph Data StructureGraph Data Structure
Graph Data Structure
 
Trees data structure
Trees data structureTrees data structure
Trees data structure
 
4.4 hashing
4.4 hashing4.4 hashing
4.4 hashing
 
Abstract Data Types
Abstract Data TypesAbstract Data Types
Abstract Data Types
 
Prim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning treePrim's Algorithm on minimum spanning tree
Prim's Algorithm on minimum spanning tree
 
Trees and Graphs in data structures and Algorithms
Trees and Graphs in data structures and AlgorithmsTrees and Graphs in data structures and Algorithms
Trees and Graphs in data structures and Algorithms
 
Graph in data structure
Graph in data structureGraph in data structure
Graph in data structure
 
Graph mining ppt
Graph mining pptGraph mining ppt
Graph mining ppt
 
Heap Sort in Design and Analysis of algorithms
Heap Sort in Design and Analysis of algorithmsHeap Sort in Design and Analysis of algorithms
Heap Sort in Design and Analysis of algorithms
 
Data Visualization With R
Data Visualization With RData Visualization With R
Data Visualization With R
 
Binary tree and Binary search tree
Binary tree and Binary search treeBinary tree and Binary search tree
Binary tree and Binary search tree
 
Topological Sorting
Topological SortingTopological Sorting
Topological Sorting
 
Graph in data structure
Graph in data structureGraph in data structure
Graph in data structure
 

En vedette

Facebook Technology Stack
Facebook Technology StackFacebook Technology Stack
Facebook Technology Stack
Husain Ali
 
Performance Test Plan - Sample 2
Performance Test Plan - Sample 2Performance Test Plan - Sample 2
Performance Test Plan - Sample 2
Atul Pant
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Spark Summit
 
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
ARADHYAYANA
 

En vedette (12)

Graph based data models
Graph based data modelsGraph based data models
Graph based data models
 
OrientDB, the fastest document-based graph database @ Confoo 2014 in Montreal...
OrientDB, the fastest document-based graph database @ Confoo 2014 in Montreal...OrientDB, the fastest document-based graph database @ Confoo 2014 in Montreal...
OrientDB, the fastest document-based graph database @ Confoo 2014 in Montreal...
 
Facebook Technology Stack
Facebook Technology StackFacebook Technology Stack
Facebook Technology Stack
 
OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0OrientDB Distributed Architecture v2.0
OrientDB Distributed Architecture v2.0
 
Test Planning
Test PlanningTest Planning
Test Planning
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Performance Test Plan - Sample 2
Performance Test Plan - Sample 2Performance Test Plan - Sample 2
Performance Test Plan - Sample 2
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the World
 
Test plan
Test planTest plan
Test plan
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Big data storages
Big data storagesBig data storages
Big data storages
 
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
ER DIAGRAM TO RELATIONAL SCHEMA MAPPING
 

Plus de M. Tamer Özsu (6)

Streaming Graph Processing and Analytics
Streaming Graph Processing and AnalyticsStreaming Graph Processing and Analytics
Streaming Graph Processing and Analytics
 
Vldb19 keynote
Vldb19 keynoteVldb19 keynote
Vldb19 keynote
 
Graph analytics-short
Graph analytics-shortGraph analytics-short
Graph analytics-short
 
Web Data Management in the RDF Age
Web Data Management in the RDF AgeWeb Data Management in the RDF Age
Web Data Management in the RDF Age
 
Web Data Management with RDF
Web Data Management with RDFWeb Data Management with RDF
Web Data Management with RDF
 
gStore: A Graph-based SPARQL Query Engine
gStore: A Graph-based SPARQL Query EnginegStore: A Graph-based SPARQL Query Engine
gStore: A Graph-based SPARQL Query Engine
 

Dernier

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
amitlee9823
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 

Dernier (20)

Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night StandCall Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Attibele ☎ 7737669865 🥵 Book Your One night Stand
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Graph lecture

  • 1. An Overview of Graph Data Management and Analysis M. Tamer ¨Ozsu University of Waterloo David R. Cheriton School of Computer Science © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 1 / 96
  • 2. Graph Data are Very Common Internet © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
  • 3. Graph Data are Very Common Social networks © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
  • 4. Graph Data are Very Common Trade volumes and connections © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
  • 5. Graph Data are Very Common Biological networks © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
  • 6. Graph Data are Very Common As of September 2011 Music Brainz (zitgist) P20 Turismo de Zaragoza yovisto Yahoo! Geo Planet YAGO World Fact- book El Viajero Tourism WordNet (W3C) WordNet (VUA) VIVO UF VIVO Indiana VIVO Cornell VIAF URI Burner Sussex Reading Lists Plymouth Reading Lists UniRef UniProt UMBEL UK Post- codes legislation data.gov.uk Uberblic UB Mann- heim TWC LOGD Twarql transport data.gov. uk Traffic Scotland theses. fr Thesau- rus W totl.net Tele- graphis TCM Gene DIT Taxon Concept Open Library (Talis) tags2con delicious t4gm info Swedish Open Cultural Heritage Surge Radio Sudoc STW RAMEAU SH statistics data.gov. uk St. Andrews Resource Lists ECS South- ampton EPrints SSW Thesaur us Smart Link Slideshare 2RDF semantic web.org Semantic Tweet Semantic XBRL SW Dog Food Source Code Ecosystem Linked Data US SEC (rdfabout) Sears Scotland Geo- graphy Scotland Pupils & Exams Scholaro- meter WordNet (RKB Explorer) Wiki UN/ LOCODE Ulm ECS (RKB Explorer) Roma RISKS RESEX RAE2001 Pisa OS OAI NSF New- castle LAAS KISTI JISC IRIT IEEE IBM Eurécom ERA ePrints dotAC DEPLOY DBLP (RKB Explorer) Crime Reports UK Course- ware CORDIS (RKB Explorer) CiteSeer Budapest ACM riese Revyu research data.gov. ukRen. Energy Genera- tors reference data.gov. uk Recht- spraak. nl RDF ohloh Last.FM (rdfize) RDF Book Mashup Rådata nå! PSH Product Types Ontology Product DB PBAC Poké- pédia patents data.go v.uk Ox Points Ord- nance Survey Openly Local Open Library Open Cyc Open Corpo- rates Open Calais OpenEI Open Election Data Project Open Data Thesau- rus Ontos News Portal OGOLOD Janus AMP Ocean Drilling Codices New York Times NVD ntnusc NTU Resource Lists Norwe- gian MeSH NDL subjects ndlna my Experi- ment Italian Museums medu- cator MARC Codes List Man- chester Reading Lists Lotico Weather Stations London Gazette LOIUS Linked Open Colors lobid Resources lobid Organi- sations LEM Linked MDB LinkedL CCN Linked GeoData LinkedCT Linked User Feedback LOV Linked Open Numbers LODE Eurostat (Ontology Central) Linked EDGAR (Ontology Central) Linked Crunch- base lingvoj Lichfield Spen- ding LIBRIS Lexvo LCSH DBLP (L3S) Linked Sensor Data (Kno.e.sis) Klapp- stuhl- club Good- win Family National Radio- activity JP Jamendo (DBtune) Italian public schools ISTAT Immi- gration iServe IdRef Sudoc NSZL Catalog Hellenic PD Hellenic FBD Piedmont Accomo- dations GovTrack GovWILD Google Art wrapper gnoss GESIS GeoWord Net Geo Species Geo Names Geo Linked Data GEMET GTAA STITCH SIDER Project Guten- berg Medi Care Euro- stat (FUB) EURES Drug Bank Disea- some DBLP (FU Berlin) Daily Med CORDIS (FUB) Freebase flickr wrappr Fishes of Texas Finnish Munici- palities ChEMBL FanHubz Event Media EUTC Produc- tions Eurostat Europeana EUNIS EU Insti- tutions ESD stan- dards EARTh Enipedia Popula- tion (En- AKTing) NHS (En- AKTing) Mortality (En- AKTing) Energy (En- AKTing) Crime (En- AKTing) CO2 Emission (En- AKTing) EEA SISVU educatio n.data.g ov.uk ECS South- ampton ECCO- TCP GND Didactal ia DDC Deutsche Bio- graphie data dcs Music Brainz (DBTune) Magna- tune John Peel (DBTune) Classical (DB Tune) Audio Scrobbler (DBTune) Last.FM artists (DBTune) DB Tropes Portu- guese DBpedia dbpedia lite Greek DBpedia DBpedia data- open- ac-uk SMC Journals Pokedex Airports NASA (Data Incu- bator) Music Brainz (Data Incubator) Moseley Folk Metoffice Weather Forecasts Discogs (Data Incubator) Climbing data.gov.uk intervals Data Gov.ie data bnf.fr Cornetto reegle Chronic- ling America Chem2 Bio2RDF Calames business data.gov. uk Bricklink Brazilian Poli- ticians BNB UniSTS UniPath way UniParc Taxono my UniProt (Bio2RDF) SGD Reactome PubMed Pub Chem PRO- SITE ProDom Pfam PDB OMIM MGI KEGG Reaction KEGG Pathway KEGG Glycan KEGG Enzyme KEGG Drug KEGG Com- pound InterPro Homolo Gene HGNC Gene Ontology GeneID Affy- metrix bible ontology BibBase FTS BBC Wildlife Finder BBC Program mes BBC Music Alpine Ski Austria LOCAH Amster- dam Museum AGROV OC AEMET US Census (rdfabout) Media Geographic Publications Government Cross-domain Life sciences User-generated content Linked data © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96 Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  • 7. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 3 / 96
  • 8. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 4 / 96
  • 9. Graph Types Property graph film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7) offers 0743424425amazonOffer geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director) © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
  • 10. Graph Types RDF graph mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
  • 11. Graph Types Property graph film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7) offers 0743424425amazonOffer geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director) Workload: Online queries and analytic workloads Query execution: Varies RDF graph mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director Workload: SPARQL queries Query execution: subgraph matching by homomorphism © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
  • 12. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 6 / 96
  • 13. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 7 / 96
  • 14. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 15. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Focus here is on the dynamism of the graphs in whether or not they change and how they change. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 16. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Focus here is on the dynamism of the graphs in whether or not they change and how they change. Focus here is on the how algorithms behave as their input changes. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 17. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Focus here is on the dynamism of the graphs in whether or not they change and how they change. Focus here is on the how algorithms behave as their input changes. The types of workloads that the approaches are designed to handle. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 18. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 19. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 20. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 21. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes. Dynamic graphs with high veloc- ity changes – not possible to see the entire graph at once. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 22. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Graphs do not change or we are not inter- ested in their changes – only a snapshot is considered. Graphs change and we are interested in their changes. Dynamic graphs with high veloc- ity changes – not possible to see the entire graph at once. Dynamic graphs with un- known changes – requires re- discovery of the graph (e.g., LOD). © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 23. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 24. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Computation accesses a portion of the graph and the results are computed for a subset of vertices; e.g., point- to-point shortest path, subgraph matching, reachability, SPARQL. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 25. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Computation accesses a portion of the graph and the results are computed for a subset of vertices; e.g., point- to-point shortest path, subgraph matching, reachability, SPARQL. Computation accesses the entire graph and may require multiple iterations; e.g., PageR- ank, clustering, graph colouring, all pairs shortest path. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 26. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 27. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 28. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. Sees the input piece-meal as it executes. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 29. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 30. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 31. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input. Sees the en- tire input in advance, which may change; an- swers computed as change oc- curs. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 32. Classification [Ammar and ¨Ozsu, 2015] Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Sees the en- tire input in advance. Sees the input piece-meal as it executes. One-pass on- line algorithm with limited memory. Online algo- rithm with some info about forth- coming input. Sees the en- tire input in advance, which may change; an- swers computed as change oc- curs. Similar to dy- namic, but computation happens in batches of © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
  • 33. Example Design Points Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Compute the query result/perform analytic computation over the graph as it exists. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
  • 34. Example Design Points Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Compute the query result/perform analytic computation over the graph as it is revealed. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
  • 35. Example Design Points Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Compute the query result/perform analytic computation on each snap- shot from scratch. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
  • 36. Example Design Points Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Continuously compute the query result/perform analytic computation as the input changes. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
  • 37. Example Design Points Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Compute the query result/perform analytic computation after a batch of input changes. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
  • 38. Example Design Points – Not all alternatives make sense Graph Dynamism Static Graphs Dynamic Graphs Streaming Graphs Evolving Graphs Algorithm Types Offline Online Streaming Incremental Dynamic Batch Dynamic Workload Types Online Queries Analytics Workloads Dynamic (or batch-dynamic) algorithms do not make sense for static graphs. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 10 / 96
  • 39. Graph Processing Systems System Memory/ Disk Architecture Computing paradigm Supported Workloads Hadoop Disk Parallel/Distributed MapReduce Analytical Haloop Disk Parallel/Distributed MapReduce Analytical Pegasus Disk Parallel/Distributed MapReduce Analytical GraphX Disk Parallel/Distributed MapReduce (Spark) Analytical Pregel/Giraph Memory Parallel/Distributed Vertex-Centric Analytical GraphLab Memory Parallel/Distributed Vertex-Centric Analytical GraphChi Disk Single machine Vertex-Centric Analytical Stream Disk Single machine Edge-Centric Analytical Trinity Memory Parallel/Distributed Flexible using K-V store on DSM Online & Analytical Titan Disk Parallel/Distributed K-V store (Cassandra) Online Neo4J Disk Single machine Procedural/ Linked-list Online © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 11 / 96
  • 40. Graph Workloads Online graph querying Reachability Single source shortest-path Subgraph matching SPARQL queries Offline graph analytics PageRank Clustering Strongly connected components Diameter finding Graph colouring All pairs shortest path Graph pattern mining Machine learning algorithms (Belief propagation, Gaussian non-negative matrix factorization) © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 12 / 96
  • 41. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 13 / 96
  • 42. Reachability Queries film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7) offers 0743424425amazonOffer geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director) © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
  • 43. Reachability Queries film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7) offers 0743424425amazonOffer geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director) Can you reach film 1267 from film 2014? © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
  • 44. Reachability Queries film 2014 (initial release date, “1980-05-23”) (label, “The Shining”) books 0743424425 (rating, 4.7) offers 0743424425amazonOffer geo 2635167 (name, “United Kingdom”) (population, 62348447) actor 29704 (actor name, “Jack Nicholson”) film 3418 (label, “The Passenger”) film 1267 (label, “The Last Tycoon”) director 8476 (director name, “Stanley Kubrick”) film 2685 (label, “A Clockwork Orange”) film 424 (label, “Spartacus”) actor 30013 (relatedBook) (hasOffer) (based near) (actor) (director) (actor) (actor) (actor) (director) (director) Is there a book whose rating is > 4.0 associated with a film that was directed by Stanley Kubrick? © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
  • 45. Reachability Queries Think of Facebook graph and finding friends of friends. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
  • 46. Subgraph Matching ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r > 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director Su bgraphMatching © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 15 / 96
  • 47. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 16 / 96
  • 48. PageRank Computation A web page is important if it is pointed to by other important pages. P1 P2 P3 P5P6 P4 r(Pi ) = Pj ∈BPi r(Pj ) |FPj | r(P2) = r(P1) 2 + r(P3) 3 rk+1(Pi ) = Pj ∈BPi rk(Pj ) |FPj | BPi : in-neighbours of Pi FPi : out-neighbours of Pi © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 17 / 96
  • 49. PageRank Computation A web page is important if it is pointed to by other important pages. P1 P2 P3 P5P6 P4 rk+1(Pi ) = Pj ∈BPi rk(Pj ) |FPj | Iteration 0 Iteration 1 Iteration 2 Rank at Iter. 2 r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5 r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4 r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5 r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1 r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3 r0(P6) = 1/6 r1(P6) = 1/6 r2(P6) = 14/72 2 Iterative processing. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 17 / 96
  • 50. Some Alternative Computational Models for Offline Analytics Vertex-centric (Scatter-Gather) Specify (a) computation at each vertex, and (b) communication with neighbour vertices Synchronous – Pregel [Malewicz et al., 2010], Giraph Asynchronous – GraphLab [Low et al., 2012] © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
  • 51. Some Alternative Computational Models for Offline Analytics Vertex-centric (Scatter-Gather) Specify (a) computation at each vertex, and (b) communication with neighbour vertices Synchronous – Pregel [Malewicz et al., 2010], Giraph Asynchronous – GraphLab [Low et al., 2012] Block-centric Similar to vertex-centric but on blocks for communication Connected subgraph of the graph Blogel [Yan et al., 2014] © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
  • 52. Some Alternative Computational Models for Offline Analytics Vertex-centric (Scatter-Gather) Specify (a) computation at each vertex, and (b) communication with neighbour vertices Synchronous – Pregel [Malewicz et al., 2010], Giraph Asynchronous – GraphLab [Low et al., 2012] Block-centric Similar to vertex-centric but on blocks for communication Connected subgraph of the graph Blogel [Yan et al., 2014] MapReduce Need to save in HDFS intermediate results of each iteration – both good and bad Hadoop, Haloop [Bu et al., 2012] © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
  • 53. Some Alternative Computational Models for Offline Analytics Vertex-centric (Scatter-Gather) Specify (a) computation at each vertex, and (b) communication with neighbour vertices Synchronous – Pregel [Malewicz et al., 2010], Giraph Asynchronous – GraphLab [Low et al., 2012] Block-centric Similar to vertex-centric but on blocks for communication Connected subgraph of the graph Blogel [Yan et al., 2014] MapReduce Need to save in HDFS intermediate results of each iteration – both good and bad Hadoop, Haloop [Bu et al., 2012] Modified MapReduce Based on Spark [Zaharia et al., 2010; Zaharia, 2016] Keep intermediate states in memory Provide fault-tolerance by keeping lineage GraphX [Gonzalez et al., 2014] © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
  • 54. Vertex-Centric Computation “Think like a vertex” vertex_scatter(vertex v) Push local computation to neighbours on the out-bound edges vertex_gather(vertex v) Gather local computation from neighbours on the in-bound edges Continue until all vertices are inactive ? © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 19 / 96
  • 55. Vertex-Centric Computation “Think like a vertex” vertex_scatter(vertex v) Push local computation to neighbours on the out-bound edges vertex_gather(vertex v) Gather local computation from neighbours on the in-bound edges Continue until all vertices are inactive Vertex state machine ? Active Inactive Vote halt Message received © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 19 / 96
  • 56. Synchronous Vertex-Centric Computation Computation © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
  • 57. Synchronous Vertex-Centric Computation Superstep 1 Superstep 2 Superstep 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
  • 58. Synchronous Vertex-Centric Computation Machine 1 Machine 2 Machine 3 Communication Barrier Each machine performs vertex-centric computation on its graph partition Superstep 1 Superstep 2 Superstep 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
  • 59. Synchronous Vertex-Centric Computation Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 Communication Barrier Each machine performs vertex-centric computation on its graph partition Communication Barrier Superstep 1 Superstep 2 Superstep 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
  • 60. Synchronous Vertex-Centric Computation Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 Communication Barrier Each machine performs vertex-centric computation on its graph partition Communication Barrier Superstep 1 Superstep 2 Superstep 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
  • 61. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  • 62. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Implemented via distributed locking Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 v0 v1 v2 v3 v4 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  • 63. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Implemented via distributed locking Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 v0 v1 v2 v3 v4 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  • 64. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Implemented via distributed locking Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 v0 v1 v2 v3 v4 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  • 65. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Implemented via distributed locking Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 v0 v1 v2 v3 v4 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  • 66. Asynchronous Vertex-Centric Computation No communication barriers. Uses the most recent vertex values. Implemented via distributed locking Machine 1 Machine 2 Machine 3 Machine 1 Machine 2 Machine 3 v0 v1 v2 v3 v4 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
  • 67. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  • 68. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 64 machines TW UK Giraph (byte array) 5.8GB 7.0GB GraphLab (sync) 4.5GB 14GB TW 16 machines 128 machines Giraph (byte array) 8.5GB 5.8GB GraphLab (sync) 11GB 3.3GB © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  • 69. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 2 Distributed locking for asynchronous execution is not scalable – Performance degrades as more machines are used due to lock contention, termination scheme, lack of message batching © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  • 70. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 2 Distributed locking for asynchronous execution is not scalable – Performance degrades as more machines are used due to lock contention, termination scheme, lack of message batching 3 Graph storage should be memory and mutation efficient. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  • 71. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 2 Distributed locking for asynchronous execution is not scalable – Performance degrades as more machines are used due to lock contention, termination scheme, lack of message batching 3 Graph storage should be memory and mutation efficient. No Mutations Time Memory Byte array Hash map With Mutations (DMST) Time Memory Byte array Hash map © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  • 72. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 2 Distributed locking for asynchronous execution is not scalable – Performance degrades as more machines are used due to lock contention, termination scheme, lack of message batching 3 Graph storage should be memory and mutation efficient. 4 Message processing optimizations are very important. © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  • 73. Summary of an Experiment [Han et al., 2014] A large study comparing Giraph, GraphLab, GPS, Mizan. 1 Giraph scales better across graphs; GraphLab scales better across more machines. 2 Distributed locking for asynchronous execution is not scalable – Performance degrades as more machines are used due to lock contention, termination scheme, lack of message batching 3 Graph storage should be memory and mutation efficient. 4 Message processing optimizations are very important. 5 Workloads have different resource demands Algorithm CPU Memory Network PageRank Medium Medium High SSSP Low Low Low WCC Low Medium Medium DMST High High Medium © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
  • 74. Block-Centric Computation Blogel [Yan et al., 2014]: “Think like a block”; also “think like a graph” [Tian et al., 2013] Vertex-centric assumes all vertices communicate over the network; this is not efficient Read-world graphs have skewed vertex degree distribution Common in power-law graphs Problem: imbalanced communication workloads Real-world graphs have large diameters Common in road networks, web graphs, terrain meshes Problem: one superstep per hop ⇒ too many supersteps Real-world graphs have high average vertex degree Common in social networks, mobile communication networks Problem: heavy average communication workloads © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 23 / 96
  • 75. Blogel Principles Exploit the partitioning of the graph Message exchanges only among blocks Block: a connected subgraph of the graph Within a block, run a serial in-memory algorithm; no need to follow a BSP model © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 24 / 96
  • 76. Benefits of Block-Centric Computation High-degree vertices inside a block send no messages Fewer number of supersteps Fewer number of blocks than vertices © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 25 / 96
  • 77. Example: Weakly Connected Component Algorithm exchanges vertex id’s with neighbours id(vi ) ← min{vi , vj , . . . , vk} where vj , . . . , vk are neighbours of vi Vertex-centric requires every vertex sends to its neighbours until every vertex is reached Block-centric needs two iterations: 1 All vertices in partition A exchange ids; X and Y send ids to neighbours in partition B 2 All vertices in partition B exchange ids A B 0 X Y © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 26 / 96
  • 78. Block Construction The partitioning algorithm needs to maximize number of vertices that have all their edges in the same partition Hash partitioning is not suitable because many vertices will probably have at least one cut-edge URL partitioner For web graphs: based on domain names of web page nodes 2D partitioner For spatial networks: based on coordinates of node Graph Voronoi diagram partitioner For general graphs © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 27 / 96
  • 79. MapReduce Basics [Li et al., 2014] For data analysis of very large data sets Highly dynamic, irregular, schemaless, etc. SQL too heavy “Embarrassingly parallel problems” New, simple parallel programming model Data structured as (key, value) pairs E.g. (doc-id, content), (word, count), etc. Functional programming style with two functions to be given: Map(k1,v1) → list(k2,v2) Reduce(k2, list (v2)) → list(v3) Implemented on a distributed file system (e.g., Google File System) on very large clusters © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 28 / 96
  • 80. MapReduce Processing ... Inputdataset Map Map Map Map (k1, v) (k2, v) (k2, v) (k2, v) (k1, v) (k1, v) (k2, v) Group by k Group by k (k1, (v, v, v)) (k1, (v, v, v, v)) Reduce Reduce Outputdataset © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 29 / 96
  • 81. MapReduce Architecture Scheduler Master Input Module Map Module Combine Module Partition Module Map Process Worker Input Module Map Module Combine Module Partition Module Map Process Worker Input Module Map Module Combine Module Partition Module Map Process Worker Group Module Reduce Module Output Module Reduce Process Worker Group Module Reduce Module Output Module Reduce Process Worker © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 30 / 96
  • 82. Execution Flow with Architecture [Dean and Ghemawat, 2008] MapReduce: Simplified Data Processing on Large Clusters split 0 split 1 split 2 split 3 split 4 (1) fork (3) read (4) local write (1) fork (1) fork (6) write worker worker worker Master User Program output file 0 output file 1 worker worker (2) assign map (2) assign reduce (5) remote (5) read Input files Map phasr Intermediate files (on local disks) Reduce phase Output files © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 31 / 96
  • 83. Hadoop Most popular MapReduce implementation – developed by Yahoo! Two components Processing engine HDFS: Hadoop Distributed Storage System – others possible Can be deployed on the same machine or on different machines Processes Job tracker: hosted on the master node and implements the schedule Task tracker: hosted on the worker nodes and accepts tasks from job tracker and executes them HDFS Name node: stores how data are partitioned, monitors the status of data nodes, and data dictionary Data node: Stores and manages data chunks assigned to it Task Tracker Job Tracker Task Tracker Data Node Name Node Data Node Worker 1 Name Node Worker n MapReduce HDFS © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 32 / 96
  • 84. HaLoop [Bu et al., 2012] Overcome MapReduce shortcomings for iterative jobs Having to save data in HDFS in between each iteration Checking the fixpoint requires a new job at each iteration Scheduler change: assign to the same machine the map reduce tasks that occur in different iterations but access the same data Cache invariant data Cache reduce output to easily check for fixpoint © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 33 / 96
  • 85. Spark System MapReduce does not perform well in iterative computations Workflow model is acyclic Have to write to HDFS after each iteration and have to read from HDFS at the beginning of next iteration © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
  • 86. Spark System MapReduce does not perform well in iterative computations Workflow model is acyclic Have to write to HDFS after each iteration and have to read from HDFS at the beginning of next iteration Spark objectives Better support for iterative programs Provide a complete ecosystem Similar abstraction (to MapReduce) for programming Maintain MapReduce fault-tolerance and scalability © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
  • 87. Spark System MapReduce does not perform well in iterative computations Workflow model is acyclic Have to write to HDFS after each iteration and have to read from HDFS at the beginning of next iteration Spark objectives Better support for iterative programs Provide a complete ecosystem Similar abstraction (to MapReduce) for programming Maintain MapReduce fault-tolerance and scalability Fundamental concepts RDD: Reliable Distributed Datasets Caching of working set Maintaining lineage for fault-tolerance © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
  • 88. Spark Ecosystem [Michiardi, 2015] Native Spark Apps Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph processing) Apache Spark © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 35 / 96
  • 89. Spark Programming Model [Zaharia et al., 2010, 2012] HDFS Create RDD · · · RDD Cache? Cache Yes Transform RDD? No Process No Transform Yes HDFS © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
  • 90. Spark Programming Model [Zaharia et al., 2010, 2012] HDFS Create RDD · · · RDD Cache? Cache Yes Transform RDD? No Process No Transform Yes HDFS Each transform generates a new RDD that may also be cached or processed © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
  • 91. Spark Programming Model [Zaharia et al., 2010, 2012] HDFS Create RDD · · · RDD Cache? Cache Yes Transform RDD? No Process No Transform Yes HDFS Each transform generates a new RDD that may also be cached or processed Created from HDFS or parallelized arrays; Partitioned across worker machines; May be made persistent lazily; © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
  • 92. Spark Programming Model [Zaharia et al., 2010, 2012] HDFS Create RDD · · · RDD Cache? Cache Yes Transform RDD? No Process No Transform Yes HDFS Each transform generates a new RDD that may also be cached or processed Created from HDFS or parallelized arrays; Partitioned across worker machines; May be made persistent lazily; Processing done on one of the RDDs; Done in parallel across workers; First processing on a RDD is from disk; Subsequent processing of the same RDD from cache © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
  • 93. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  • 94. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) CreateRDD Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  • 95. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) Transform RDD Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  • 96. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) Another transform Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  • 97. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() Cache results Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  • 98. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() cachedMsgs.filter( .contains(foo)).count Action Driver WorkerWorkerWorker Block 1 Block 2 Block 3 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  • 99. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() cachedMsgs.filter( .contains(foo)).count Driver WorkerWorkerWorker Block 1 Block 2 Block 3 Tasks © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  • 100. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() cachedMsgs.filter( .contains(foo)).count Driver WorkerWorkerWorker Block 1 Block 2 Block 3 TasksResults © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  • 101. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() cachedMsgs.filter( .contains(foo)).count Driver WorkerWorkerWorker Block 1 Block 2 Block 3 TasksResults Cache Cache Cache © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  • 102. Example – Log Mining [Zaharia et al., 2010, 2012] Load log messages from a file system, create a new file by filtering the error messages, read this file into memory, then interactively search for various patterns lines = spark.textFile(hdfs://...) errors = lines.filter( .startsWith(ERROR)) messages = errors.map( .split(‘t ’)(2)) cachedMsgs = messages.cache() cachedMsgs.filter( .contains(foo)).count cachedMsgs.filter( .contains(bar)).count Another Action accesses cache Driver WorkerWorkerWorker Block 1 Block 2 Block 3 TasksResults Cache Cache Cache © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
  • 103. RDD and Processing HDFS lines = spark.textFile(hdfs://...) lines Error, msg1 Warn, msg2 Error, msg1 Info, msg8 Warn, msg2 Info, msg8 Error, msg3 Info, msg5 Info, msg5 Error, msg4 Warn, msg9 Error, msg1 errors errors = lines.filter( .startsWith(ERROR)) Error, msg1 Error, msg1 Error, msg3 Error, msg4 Error, msg1 messages messages = errors.map .split(‘t ’)(2) msg1 msg1 msg3 msg4 msg1 Thesearenotyetgenerated © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 38 / 96
  • 104. RDD and Processing lines errors messages msg1 msg1 msg3 msg4 msg1 lines messages.filter( .contains(foo)).count errors messages msg1 msg1 msg3 msg4 msg1 NowtheRDDsarematerialized; Commandnotyetexecuted Driver messages.filter( .contains(foo)).count © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 38 / 96
  • 105. GraphX [Gonzalez et al., 2014] Built on top of Spark Objective is to combine data analytics with graph processing Unify computation on tables and graphs Carefully convert graph to tabular representation Native GraphX API or can accommodate vertex-centric computation Native Spark Apps Spark SQL Spark Streaming MLlib (machine learning) GraphX (graph processing) Apache Spark Vertex- centric API AppApp App App © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 39 / 96
  • 106. GraphX: Representation of Graphs as Tables A B C D E F G H I J © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  • 107. GraphX: Representation of Graphs as Tables Partition 1 Partition 2 A B C D E F G H I J Edge-disjoint partitioning © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  • 108. GraphX: Representation of Graphs as Tables Partition 1 Partition 2 Machine1Machine2 Vertex Table (RDD) v-prop:vertex prop. A B C D E F G H I J Edge-disjoint partitioning A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  • 109. GraphX: Representation of Graphs as Tables Partition 1 Partition 2 Machine1Machine2 Vertex Table (RDD) v-prop:vertex prop. Edge Table (RDD) e-prop:edge prop. A B C D E F G H I J Edge-disjoint partitioning A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  • 110. GraphX: Representation of Graphs as Tables Partition 1 Partition 2 Machine1Machine2 Vertex Table (RDD) v-prop:vertex prop. Edge Table (RDD) e-prop:edge prop. A B C D E F G H I J Edge-disjoint partitioning A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F Joining vertices and edges Move vertices to edges © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  • 111. GraphX: Representation of Graphs as Tables Partition 1 Partition 2 Machine1Machine2 Vertex Table (RDD) v-prop:vertex prop. Edge Table (RDD) e-prop:edge prop. Routing Table (RDD) A B C D E F G H I J Edge-disjoint partitioning A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F A 1 2 B 1 ... I 1 F 1 2 D 2 E 2 J 2 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
  • 112. GraphX: Computation Model Machine1Machine2 Vertex Table Edge Table A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
  • 113. GraphX: Computation Model Machine1Machine2 Vertex Table Edge Table A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F First Phase: Join Vertex table Edge table Triples View A v-prop e-prop B v-prop A v-prop e-prop C v-prop C v-prop e-prop G v-prop ... E v-prop e-prop G v-prop J v-prop e-prop G v-prop © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
  • 114. GraphX: Computation Model Machine1Machine2 Vertex Table Edge Table A v-prop B v-prop ... I v-prop D v-prop E v-prop F v-prop J v-prop A e-prop B A e-prop C ... F e-prop G A e-prop D A e-prop E ... E e-prop F Triples View A v-prop e-prop B v-prop A v-prop e-prop C v-prop C v-prop e-prop G v-prop ... E v-prop e-prop G v-prop J v-prop e-prop G v-prop Second Phase: Compute neighbourhood Group-by aggregate © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
  • 115. GraphX: Operators Table transform operators – inherited from Spark map(func) Return a new RDD formed by passing each element of the source through a function func filter(func) Return a new RDD formed by selecting those elements of the source on which func returns true flatMap(func) Similar to map, but each input item can be mapped to 0 or more output items mapPartitions(func) Similar to map, but runs separately on each partition (block) of the RDD, so func must be of type Iterator sample(repl, fraction, seed) Sample a fraction fraction of the data, with or without replacement (set repl accordingly), using a given random number generator seed union(otherDataset) intersection() Return a new RDD containing the union/intersection of the elements in the source RDD and the argument groupByKey() Operates on a RDD of (K, V) pairs, returns a RDD of (K, IterableV) pairs reduceByKey(func, . . .) Operates on a RDD of (K, V) pairs, returns a RDD of (K, V) pairs where the values for each key are aggregated using the given reduce function func © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 42 / 96
  • 116. GraphX: Operators Table transform operators – inherited from Spark Graph operators Graph(vertex coll, edge coll) Logically binds together a pair of vertex and edge property collections into a property graph; verifies that each vertex occurs only once and edges connect existing vertices triplets(vertex coll, vertex coll, edge coll) Returns the triplets view of the graph mrTriplets(map,reduce) MapReduce triplets - encodes the two-stage process of join to create triplets and group by © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 42 / 96
  • 117. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 43 / 96
  • 118. RDF Introduction Everything is an uniquely named resource http://data.linkedmdb.org/resource/actor/JN29704 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
  • 119. RDF Introduction Everything is an uniquely named resource Prefixes can be used to shorten the names xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
  • 120. RDF Introduction Everything is an uniquely named resource Prefixes can be used to shorten the names Properties of resources can be defined xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 y:JN29704:hasName “Jack Nicholson” y:JN29704:BornOnDate “1937-04-22” © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
  • 121. RDF Introduction Everything is an uniquely named resource Prefixes can be used to shorten the names Properties of resources can be defined Relationships with other resources can be defined xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 y:JN29704:hasName “Jack Nicholson” y:JN29704:BornOnDate “1937-04-22” y:TS2014:title “The Shining” y:TS2014:releaseDate “1980-05-23” y:TS2014 JN29704:movieActor © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
  • 122. RDF Introduction Everything is an uniquely named resource Prefixes can be used to shorten the names Properties of resources can be defined Relationships with other resources can be defined Resource descriptions can be contributed by different people/groups and can be located anywhere in the web Integrated web “database” xmlns:y=http://data.linkedmdb.org/resource/actor/ y:JN29704 y:JN29704:hasName “Jack Nicholson” y:JN29704:BornOnDate “1937-04-22” y:TS2014:title “The Shining” y:TS2014:releaseDate “1980-05-23” y:TS2014 JN29704:movieActor © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
  • 123. RDF Data Model Triple: Subject, Predicate (Property), Object (s, p, o) Subject: the entity that is described (URI or blank node) Predicate: a feature of the entity (URI) Object: value of the feature (URI, blank node or literal) (s, p, o) ∈ (U ∪ B) × U × (U ∪ B ∪ L) Set of RDF triples is called an RDF graph U Subject Object U B U B L U: set of URIs B: set of blank nodes L: set of literals Predicate Subject Predicate Object http://...imdb.../film/2014 rdfs:label “The Shining” http://...imdb.../film/2014 movie:releaseDate “1980-05-23” http://...imdb.../29704 movie:actor name “Jack Nicholson” . . . . . . . . . © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 45 / 96
  • 124. RDF Example Instance Prefixes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/ bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/ lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/ Subject Predicate Object mdb: film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23”’ mdb:film/2014 movie:director mdb:director/8476 mdb:film/2014 movie:actor mdb:actor/29704 mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn URI Literal URI © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 46 / 96
  • 125. RDF Graph mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 47 / 96
  • 126. RDF Query Model – SPARQL Query Model - SPARQL Protocol and RDF Query Language Given U (set of URIs), L (set of literals), and V (set of variables), a SPARQL expression is defined recursively: an atomic triple pattern, which is an element of (U ∪ V ) × (U ∪ V ) × (U ∪ V ∪ L) ?x rdfs:label “The Shining” P FILTER R, where P is a graph pattern expression and R is a built-in SPARQL condition (i.e., analogous to a SQL predicate) ?x rev:rating ?p FILTER(?p 3.0) P1 AND/OPT/UNION P2, where P1 and P2 are graph pattern expressions Example: SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r 4.0) } © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 48 / 96
  • 127. SPARQL Queries SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r 4.0) } ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r 4.0) © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 49 / 96
  • 128. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 50 / 96
  • 129. Na¨ıve Triple Store Design SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r 4.0) } Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2014 movie:actor mdb:actor/29704 mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
  • 130. Na¨ıve Triple Store Design SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r 4.0) } Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2014 movie:actor mdb:actor/29704 mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn SELECT T1 . o b j e c t FROM T as T1 , T as T2 , T as T3 , T as T4 , T as T5 WHERE T1 . p=” r d f s : l a b e l ” AND T2 . p=” movie : relatedBook ” AND T3 . p=” movie : d i r e c t o r ” AND T4 . p=” rev : r a t i n g ” AND T5 . p=” movie : d i r e c t o r n a m e ” AND T1 . s=T2 . s AND T1 . s=T3 . s AND T2 . o=T4 . s AND T3 . o=T5 . s AND T4 . o 4.0 AND T5 . o=” S t a n l e y Kubrick ” © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
  • 131. Na¨ıve Triple Store Design SELECT ?name WHERE { ?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d . ?d movie : director name ” Stanley Kubrick ” . ?m movie : relatedBook ?b . ?b rev : r a t i n g ? r . FILTER(? r 4.0) } Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2014 movie:actor mdb:actor/29704 mdb:film/2014 movie:actor mdb: actor/30013 mdb:film/2014 movie:music contributor mdb: music contributor/4110 mdb:film/2014 foaf:based near geo:2635167 mdb:film/2014 movie:relatedBook bm:0743424425 mdb:film/2014 movie:language lexvo:iso639-3/eng mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:film/424 movie:director mdb:director/8476 mdb:film/424 rdfs:label “Spartacus” mdb:actor/29704 movie:actor name “Jack Nicholson” mdb:film/1267 movie:actor mdb:actor/29704 mdb:film/1267 rdfs:label “The Last Tycoon” mdb:film/3418 movie:actor mdb:actor/29704 mdb:film/3418 rdfs:label “The Passenger” geo:2635167 gn:name “United Kingdom” geo:2635167 gn:population 62348447 geo:2635167 gn:wikipediaArticle wp:United Kingdom bm:books/0743424425 dc:creator bm:persons/Stephen+King bm:books/0743424425 rev:rating 4.7 bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer lexvo:iso639-3/eng rdfs:label “English” lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn SELECT T1 . o b j e c t FROM T as T1 , T as T2 , T as T3 , T as T4 , T as T5 WHERE T1 . p=” r d f s : l a b e l ” AND T2 . p=” movie : relatedBook ” AND T3 . p=” movie : d i r e c t o r ” AND T4 . p=” rev : r a t i n g ” AND T5 . p=” movie : d i r e c t o r n a m e ” AND T1 . s=T2 . s AND T1 . s=T3 . s AND T2 . o=T4 . s AND T3 . o=T5 . s AND T4 . o 4.0 AND T5 . o=” S t a n l e y Kubrick ” Easy to implement but too many self-joins! © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
  • 132. Exhaustive Indexing RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss et al., 2008] Strings are mapped to ids using a mapping table Original triple table Subject Property Object mdb: film/2014 rdfs:label “The Shining” mdb:film/2014 movie:initial release date “1980-05-23” mdb:director/8476 movie:director name “Stanley Kubrick” mdb:film/2685 movie:director mdb:director/8476 Encoded triple table Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 Mapping table ID Value 0 mdb: film/2014 1 rdfs:label 2 “The Shining” 3 movie:initial release date 4 “1980-05-23” 5 mdb:director/8476 6 movie:director name 7 “Stanley Kubrick” 8 mdb:film/2685 9 movie:director © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
  • 133. Exhaustive Indexing RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss et al., 2008] Strings are mapped to ids using a mapping table Triples are indexed in a clustered B+ tree in lexicographic order Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... B+ tree Easy querying through mapping table © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
  • 134. Exhaustive Indexing RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss et al., 2008] Strings are mapped to ids using a mapping table Triples are indexed in a clustered B+ tree in lexicographic order Create indexes for permutations of the three columns: SPO, SOP, PSO, POS, OPS, OSP Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... B+ tree Easy querying through mapping table © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
  • 135. Exhaustive Indexing–Query Execution Each triple pattern can be answered by a range query Joins between triple patterns computed using merge join Join order is easy due to extensive indexing Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... ID Value 0 mdb: film/2014 1 rdfs:label 2 “The Shining” 3 movie:initial release date 4 “1980-05-23” 5 mdb:director/8476 6 movie:director name 7 “Stanley Kubrick” 8 mdb:film/2685 9 movie:director © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
  • 136. Exhaustive Indexing–Query Execution Each triple pattern can be answered by a range query Joins between triple patterns computed using merge join Join order is easy due to extensive indexing Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... ID Value 0 mdb: film/2014 1 rdfs:label 2 “The Shining” 3 movie:initial release date 4 “1980-05-23” 5 mdb:director/8476 6 movie:director name 7 “Stanley Kubrick” 8 mdb:film/2685 9 movie:director Advantages Eliminates some of the joins – they become range queries Merge join is easy and fast © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
  • 137. Exhaustive Indexing–Query Execution Each triple pattern can be answered by a range query Joins between triple patterns computed using merge join Join order is easy due to extensive indexing Subject Property Object 0 1 2 0 3 4 5 6 7 8 9 5 ... ... ... ID Value 0 mdb: film/2014 1 rdfs:label 2 “The Shining” 3 movie:initial release date 4 “1980-05-23” 5 mdb:director/8476 6 movie:director name 7 “Stanley Kubrick” 8 mdb:film/2685 9 movie:director Advantages Eliminates some of the joins – they become range queries Merge join is easy and fast Disadvantages Space usage © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
  • 138. Property Tables Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea et al., 2013] Clustered property table: group together the properties that tend to occur in the same (or similar) subjects Property-class table: cluster the subjects with the same type of property into one property table Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject refs:label movie:director mob:film/2014 “The Shining” mob:director/8476 mob:film/2685 “The Clockwork Orange” mob:director/8476 Subject movie:actor name mdb:actor “Jack Nicholson” © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
  • 139. Property Tables Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea et al., 2013] Clustered property table: group together the properties that tend to occur in the same (or similar) subjects Property-class table: cluster the subjects with the same type of property into one property table Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject refs:label movie:director mob:film/2014 “The Shining” mob:director/8476 mob:film/2685 “The Clockwork Orange” mob:director/8476 Subject movie:actor name mdb:actor “Jack Nicholson” Advantages Fewer joins If the data is structured, we have a relational system – similar to normalized relations © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
  • 140. Property Tables Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea et al., 2013] Clustered property table: group together the properties that tend to occur in the same (or similar) subjects Property-class table: cluster the subjects with the same type of property into one property table Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject refs:label movie:director mob:film/2014 “The Shining” mob:director/8476 mob:film/2685 “The Clockwork Orange” mob:director/8476 Subject movie:actor name mdb:actor “Jack Nicholson” Advantages Fewer joins If the data is structured, we have a relational system – similar to normalized relations Disadvantages Potentially a lot of NULLs Clustering is not trivial Multi-valued properties are complicated © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
  • 141. Binary Tables Grouping by properties: For each property, build a two-column table, containing both subject and object, ordered by subjects [Abadi et al., 2007, 2009] Also called vertical partitioned tables n two column tables (n is the number of unique properties in the data) Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject Object mdb:film/2014 mdb:director/8476 mdb:film/2685 mdb:director/8476 movie:director Subject Object mob:film/2014 “The Shining” mob:film/2685 “The Clockwork Orange” refs:label Subject Object mdb:actor/29704 “Jack Nicholson” movie:actor name © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
  • 142. Binary Tables Grouping by properties: For each property, build a two-column table, containing both subject and object, ordered by subjects [Abadi et al., 2007, 2009] Also called vertical partitioned tables n two column tables (n is the number of unique properties in the data) Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject Object mdb:film/2014 mdb:director/8476 mdb:film/2685 mdb:director/8476 movie:director Subject Object mob:film/2014 “The Shining” mob:film/2685 “The Clockwork Orange” refs:label Subject Object mdb:actor/29704 “Jack Nicholson” movie:actor name Advantages Supports multi-valued properties No NULLs No clustering Read only needed attributes (i.e. less I/O) Good performance for subject-subject joins © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
  • 143. Binary Tables Grouping by properties: For each property, build a two-column table, containing both subject and object, ordered by subjects [Abadi et al., 2007, 2009] Also called vertical partitioned tables n two column tables (n is the number of unique properties in the data) Subject Property Object mdb:film/2014 rdfs:label “The Shining” mdb:film/2014 movie:director mdb:director/8476 mdb:film/2685 movie:director mdb:director/8476 mdb:film/2685 rdfs:label “A Clockwork Orange” mdb:actor/29704 movie:actor name “Jack Nicholson” . . . . . . . . . Subject Object mdb:film/2014 mdb:director/8476 mdb:film/2685 mdb:director/8476 movie:director Subject Object mob:film/2014 “The Shining” mob:film/2685 “The Clockwork Orange” refs:label Subject Object mdb:actor/29704 “Jack Nicholson” movie:actor name Advantages Supports multi-valued properties No NULLs No clustering Read only needed attributes (i.e. less I/O) Good performance for subject-subject joins Disadvantages Not useful for subject-object joins Expensive inserts © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
  • 144. Graph-based Approach Answering SPARQL query ≡ subgraph matching using homomorphism gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013] ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director Subgraph M atching © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
  • 145. Graph-based Approach Answering SPARQL query ≡ subgraph matching using homomorphism gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013] ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director Subgraph M atching Advantages Maintains the graph structure Full set of queries can be handled © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
  • 146. Graph-based Approach Answering SPARQL query ≡ subgraph matching using homomorphism gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013] ?m ?d movie:director ?name rdfs:label ?b movie:relatedBook “Stanley Kubrick” movie:director name ?r rev:rating FILTER(?r 4.0) mdb:film/2014 “1980-05-23” movie:initial release date “The Shining” refs:label bm:books/0743424425 4.7 rev:rating bm:offers/0743424425amazonOffer geo:2635167 “United Kingdom” gn:name 62348447 gn:population mdb:actor/29704 “Jack Nicholson” movie:actor name mdb:film/3418 “The Passenger” refs:label mdb:film/1267 “The Last Tycoon” refs:label mdb:director/8476 “Stanley Kubrick” movie:director name mdb:film/2685 “A Clockwork Orange” refs:label mdb:film/424 “Spartacus” refs:label mdb:actor/30013 movie:relatedBook scam:hasOffer foaf:based near movie:actor movie:director movie:actor movie:actor movie:actor movie:director movie:director Subgraph M atching Advantages Maintains the graph structure Full set of queries can be handled Disadvantages Graph pattern matching is expensive © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
  • 147. Two Systems gStoreSystem Architecture Offline Online Storage Input Input RDF Parser RDF Graph Builder Encoding Module VS*-tree builder RDF data RDF Triples RDF Graph Signature Graph Key-Value Store VS*-tree Store SPARQL Parser SPARQL Query Encoding Module VS*-tree Query Graph Filter Module Join Module Signature Graph Node Candidate Results Fig. 4. System Architecture itstrings, denoted as vS ig(u). We encode query Q with the ame encoding method. Consequently, the match between Q nd G can be verified by simply checking the match between orresponding encoded bitstrings. chameleon-db Structural Index ... Vertex Index Spill Index ClusterIndexStorageSystem StorageAdvisor Query Engine Plan Generation Evaluation © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 57 / 96
  • 148. gStore General Approach: Work directly on the RDF graph and the SPARQL query graph Use a signature-based encoding of each entity and class vertex to speed up matching Filter-and-evaluate Use a false positive algorithm to prune nodes and obtain a set of candidates; then do more detailed evaluation on those Use an index (VS∗-tree) over the data signature graph (has light maintenance load) for efficient pruning © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 58 / 96
  • 149. 1. Encode Q and G to Get Signature Graphs Query signature graph Q∗ 0100 0000 1000 0000 00010 0000 0100 10000 Data signature graph G∗ 0010 1000 0100 0001 00001 1000 0001 00010 0000 0100 10000 0000 1000 10000 0000 0010 10000 0000 1001 00100 0001 0001 01000 0100 1000 01000 1001 1000 01000 0001 0100 01000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 59 / 96
  • 150. 2. Filter-and-Evaluate Query signature graph Q∗ 0100 0000 1000 0000 00010 0000 0100 10000 Data signature graph G∗ 0010 1000 0100 0001 00001 1000 0001 00010 0000 0100 10000 0000 1000 10000 0000 0010 10000 0000 1001 00100 0001 0001 01000 0100 1000 01000 1001 1000 01000 0001 0100 01000 Find matches of Q∗ over signature graph G∗ Verify each match in RDF graph G © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 60 / 96
  • 151. How to Generate Candidate List Two step process: 1 For each node of Q∗ get lists of nodes in G∗ that include that node. 2 Do a multi-way join to get the candidate list © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
  • 152. How to Generate Candidate List Two step process: 1 For each node of Q∗ get lists of nodes in G∗ that include that node. 2 Do a multi-way join to get the candidate list Alternatives: © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
  • 153. How to Generate Candidate List Two step process: 1 For each node of Q∗ get lists of nodes in G∗ that include that node. 2 Do a multi-way join to get the candidate list Alternatives: Sequential scan of G∗ Both steps are inefficient © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
  • 154. How to Generate Candidate List Two step process: 1 For each node of Q∗ get lists of nodes in G∗ that include that node. 2 Do a multi-way join to get the candidate list Alternatives: Sequential scan of G∗ Both steps are inefficient Use S-trees Height-balanced tree over signatures Run an inclusion query for each node of Q∗ and get lists of nodes in G∗ that include that node. • Given query signature q and a set of data signatures S, find all data signatures si ∈ S where qsi = q Does not support second step – expensive © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
  • 155. How to Generate Candidate List Two step process: 1 For each node of Q∗ get lists of nodes in G∗ that include that node. 2 Do a multi-way join to get the candidate list Alternatives: Sequential scan of G∗ Both steps are inefficient Use S-trees Height-balanced tree over signatures Run an inclusion query for each node of Q∗ and get lists of nodes in G∗ that include that node. • Given query signature q and a set of data signatures S, find all data signatures si ∈ S where qsi = q Does not support second step – expensive VS-tree (and VS∗ -tree) Multi-resolution summary graph based on S-tree Supports both steps efficiently Grouping by vertices © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
  • 156. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  • 157. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  • 158. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 002 011 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  • 159. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 002 011 003 008 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  • 160. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 002 011 003 008 004 009 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  • 161. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 002 011 003 008 004 009 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  • 162. S-tree Solution 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 1000 00000100 0000 00010 0000 0100 10000 002 011 003 008 004 009 Possibly large join space! © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
  • 163. VS-tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 Super edge © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 63 / 96
  • 164. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  • 165. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  • 166. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 d3 2 d3 3 d3 3 d3 4 d3 1 d3 4 G3 00010 10000 01000 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  • 167. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 d3 2 d3 3 d3 3 d3 4 d3 1 d3 4 G3 00010 10000 01000 003 008 002 011 004 009 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  • 168. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 d3 2 d3 3 d3 3 d3 4 d3 1 d3 4 G3 00010 10000 01000 003 008 002 011 004 009 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  • 169. Pruning with VS-Tree 1111 1111 0110 1111 1101 1101 0000 1110 0110 1001 1100 1001 1001 1101 0000 1000 0000 0100 0000 0010 0010 1000 0100 0001 1000 0001 0000 1001 0100 1000 1001 1000 0001 0100 0001 0001 005 004 006 001 002 003 007 011 008 009 010 d1 1 d2 1 d2 2 d3 1 d3 2 d3 3 d3 4 G3 G2 G1 11101 10010 10001 01100 10000 00001 01100 00010 10000 01000 01000 10000 10000 10000 10000 00010 00100 01000 01000 01000 01000 1000 00000100 0000 00010 0000 0100 10000 d3 2 d3 3 d3 3 d3 4 d3 1 d3 4 G3 00010 10000 01000 003 008 002 011 004 009 Reduced join space! © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
  • 170. Adaptivity to Workload Applications that rely on RDF data are increasingly popular and are more varied [Verborgh et al., 2014] Data that are being handled are far more heterogeneous [Duan et al., 2011] SPARQL queries are becoming more diverse [Arias et al., 2011] and dynamic [Kirchberg et al., 2011] An experiment [Alu¸c et al., 2014a] No single system is a sole winner across all queries No single system is the sole loser across all queries, either There can be 2–5 orders of magnitude difference in the performance (i.e., query execution time) between the best and the worst system for a given query The winner in one query may timeout in another Performance difference widens as dataset size increases © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
  • 171. Adaptivity to Workload Applications that rely on RDF data are increasingly popular and are more varied [Verborgh et al., 2014] Data that are being handled are far more heterogeneous [Duan et al., 2011] SPARQL queries are becoming more diverse [Arias et al., 2011] and dynamic [Kirchberg et al., 2011] An experiment [Alu¸c et al., 2014a] No single system is a sole winner across all queries No single system is the sole loser across all queries, either There can be 2–5 orders of magnitude difference in the performance (i.e., query execution time) between the best and the worst system for a given query The winner in one query may timeout in another Performance difference widens as dataset size increases Can existing systems cope with these trends – workload diversity dynamism No! [Alu¸c et al., 2014b] Fragmented data Suboptimal pruning by indexes Unnecessarily large sets of intermediate result tuples © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
  • 172. Adaptivity to Workload Applications that rely on RDF data are increasingly popular and are more varied [Verborgh et al., 2014] Data that are being handled are far more heterogeneous [Duan et al., 2011] SPARQL queries are becoming more diverse [Arias et al., 2011] and dynamic [Kirchberg et al., 2011] An experiment [Alu¸c et al., 2014a] No single system is a sole winner across all queries No single system is the sole loser across all queries, either There can be 2–5 orders of magnitude difference in the performance (i.e., query execution time) between the best and the worst system for a given query The winner in one query may timeout in another Performance difference widens as dataset size increases Our proposal: Idea behind chameleon-db When designing and implementing an RDF data management system, assume nothing about the workload upfront Organize data dynamically and purely based on the workload © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
  • 173. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 Characteristics: Records are not necessarily of fixed length Records are not grouped into tables Records do not necessarily share the same set of RDF predicates Each record represents a very tiny part of the RDF graph © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  • 174. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?x ?y A ?y ?z ?b Figure: Query 1 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  • 175. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?x ?y A ?y ?z ?b Figure: Query 1 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  • 176. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?x ?y A ?y ?z ?b Figure: Query 1 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  • 177. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?a ?z C ?a ?x A ?a ?y B Figure: Query 2 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  • 178. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?a ?z C ?a ?x A ?a ?y B Figure: Query 2 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  • 179. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?a ?z C ?a ?x A ?a ?y B Figure: Query 2 © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  • 180. Group-by-Query Approach v1 v21 A v21 v20 B v1 v98 A v98 v30 C v1 v250 A v250 v40 D v0 v32 A v0 v52 C v0 v80 C v0 v66C v0 v47 B v6 v7 A v6 v8 B v6 v9 C C1 C2 C3 C4 C5 ?a ?z C ?a ?x A ?a ?y B Figure: Query 2 Advantages Data are physically clustered for the workload Better pruning by the indexes Fewer intermediate result tuples © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 66 / 96
  • 181. Challenges Physical Data Layout: As the workloads change, the way data are grouped together may no longer be suitable Hierarchical Clustering Algorithm [Alu¸c et al., 2015] Tunable-LSH [Alu¸c et al., 2015] Indexing: Indexing upfront is not a choice Query Evaluation: Can we execute queries efficiently even when the physical layout is constantly changing? [Alu¸c et al., 2015] © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 67 / 96
  • 182. chameleon-db Prototype system [Alu¸c et al., 2013] 35,000 lines of code in C++ under Linux (plus code for SPARQL 1.0 parser) Structural Index ... Vertex Index Spill Index ClusterIndexStorageSystem StorageAdvisor Query Engine Plan Generation Evaluation © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 68 / 96
  • 183. Outline 1 Introduction – Graph Types 2 Property Graph Processing Classification Online querying Offline analytics 3 RDF Graph Querying Data Warehousing Distributed SPARQL Execution Linked Object Data Querying © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 69 / 96
  • 184. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries – SPARQL endpoints Not all data sites can process queries © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96
  • 185. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries – SPARQL endpoints Not all data sites can process queries Alternatives © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96
  • 186. Remember the Environment Distributed environment Some of the data sites can process SPARQL queries – SPARQL endpoints Not all data sites can process queries Alternatives Data re-distribution + query decomposition © M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96