Contenu connexe
Plus de M. Tamer Özsu (6)
Graph lecture
- 1. An Overview of Graph Data Management and Analysis
M. Tamer ¨Ozsu
University of Waterloo
David R. Cheriton School of Computer Science
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 1 / 96
- 2. Graph Data are Very Common
Internet
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
- 3. Graph Data are Very Common
Social
networks
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
- 4. Graph Data are Very Common
Trade volumes
and
connections
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
- 5. Graph Data are Very Common
Biological
networks
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
- 6. Graph Data are Very Common
As of September 2011
Music
Brainz
(zitgist)
P20
Turismo
de
Zaragoza
yovisto
Yahoo!
Geo
Planet
YAGO
World
Fact-
book
El
Viajero
Tourism
WordNet
(W3C)
WordNet
(VUA)
VIVO UF
VIVO
Indiana
VIVO
Cornell
VIAF
URI
Burner
Sussex
Reading
Lists
Plymouth
Reading
Lists
UniRef
UniProt
UMBEL
UK Post-
codes
legislation
data.gov.uk
Uberblic
UB
Mann-
heim
TWC LOGD
Twarql
transport
data.gov.
uk
Traffic
Scotland
theses.
fr
Thesau-
rus W
totl.net
Tele-
graphis
TCM
Gene
DIT
Taxon
Concept
Open
Library
(Talis)
tags2con
delicious
t4gm
info
Swedish
Open
Cultural
Heritage
Surge
Radio
Sudoc
STW
RAMEAU
SH
statistics
data.gov.
uk
St.
Andrews
Resource
Lists
ECS
South-
ampton
EPrints
SSW
Thesaur
us
Smart
Link
Slideshare
2RDF
semantic
web.org
Semantic
Tweet
Semantic
XBRL
SW
Dog
Food
Source Code
Ecosystem
Linked Data
US SEC
(rdfabout)
Sears
Scotland
Geo-
graphy
Scotland
Pupils &
Exams
Scholaro-
meter
WordNet
(RKB
Explorer)
Wiki
UN/
LOCODE
Ulm
ECS
(RKB
Explorer)
Roma
RISKS
RESEX
RAE2001
Pisa
OS
OAI
NSF
New-
castle
LAAS
KISTI
JISC
IRIT
IEEE
IBM
Eurécom
ERA
ePrints dotAC
DEPLOY
DBLP
(RKB
Explorer)
Crime
Reports
UK
Course-
ware
CORDIS
(RKB
Explorer)
CiteSeer
Budapest
ACM
riese
Revyu
research
data.gov.
ukRen.
Energy
Genera-
tors
reference
data.gov.
uk
Recht-
spraak.
nl
RDF
ohloh
Last.FM
(rdfize)
RDF
Book
Mashup
Rådata
nå!
PSH
Product
Types
Ontology
Product
DB
PBAC
Poké-
pédia
patents
data.go
v.uk
Ox
Points
Ord-
nance
Survey
Openly
Local
Open
Library
Open
Cyc
Open
Corpo-
rates
Open
Calais
OpenEI
Open
Election
Data
Project
Open
Data
Thesau-
rus
Ontos
News
Portal
OGOLOD
Janus
AMP
Ocean
Drilling
Codices
New
York
Times
NVD
ntnusc
NTU
Resource
Lists
Norwe-
gian
MeSH
NDL
subjects
ndlna
my
Experi-
ment
Italian
Museums
medu-
cator
MARC
Codes
List
Man-
chester
Reading
Lists
Lotico
Weather
Stations
London
Gazette
LOIUS
Linked
Open
Colors
lobid
Resources
lobid
Organi-
sations
LEM
Linked
MDB
LinkedL
CCN
Linked
GeoData
LinkedCT
Linked
User
Feedback
LOV
Linked
Open
Numbers
LODE
Eurostat
(Ontology
Central)
Linked
EDGAR
(Ontology
Central)
Linked
Crunch-
base
lingvoj
Lichfield
Spen-
ding
LIBRIS
Lexvo
LCSH
DBLP
(L3S)
Linked
Sensor Data
(Kno.e.sis)
Klapp-
stuhl-
club
Good-
win
Family
National
Radio-
activity
JP
Jamendo
(DBtune)
Italian
public
schools
ISTAT
Immi-
gration
iServe
IdRef
Sudoc
NSZL
Catalog
Hellenic
PD
Hellenic
FBD
Piedmont
Accomo-
dations
GovTrack
GovWILD
Google
Art
wrapper
gnoss
GESIS
GeoWord
Net
Geo
Species
Geo
Names
Geo
Linked
Data
GEMET
GTAA
STITCH
SIDER
Project
Guten-
berg
Medi
Care
Euro-
stat
(FUB)
EURES
Drug
Bank
Disea-
some
DBLP
(FU
Berlin)
Daily
Med
CORDIS
(FUB)
Freebase
flickr
wrappr
Fishes
of Texas
Finnish
Munici-
palities
ChEMBL
FanHubz
Event
Media
EUTC
Produc-
tions
Eurostat
Europeana
EUNIS
EU
Insti-
tutions
ESD
stan-
dards
EARTh
Enipedia
Popula-
tion (En-
AKTing)
NHS
(En-
AKTing) Mortality
(En-
AKTing)
Energy
(En-
AKTing)
Crime
(En-
AKTing)
CO2
Emission
(En-
AKTing)
EEA
SISVU
educatio
n.data.g
ov.uk
ECS
South-
ampton
ECCO-
TCP
GND
Didactal
ia
DDC Deutsche
Bio-
graphie
data
dcs
Music
Brainz
(DBTune)
Magna-
tune
John
Peel
(DBTune)
Classical
(DB
Tune)
Audio
Scrobbler
(DBTune)
Last.FM
artists
(DBTune)
DB
Tropes
Portu-
guese
DBpedia
dbpedia
lite
Greek
DBpedia
DBpedia
data-
open-
ac-uk
SMC
Journals
Pokedex
Airports
NASA
(Data
Incu-
bator)
Music
Brainz
(Data
Incubator)
Moseley
Folk
Metoffice
Weather
Forecasts
Discogs
(Data
Incubator)
Climbing
data.gov.uk
intervals
Data
Gov.ie
data
bnf.fr
Cornetto
reegle
Chronic-
ling
America
Chem2
Bio2RDF
Calames
business
data.gov.
uk
Bricklink
Brazilian
Poli-
ticians
BNB
UniSTS
UniPath
way
UniParc
Taxono
my
UniProt
(Bio2RDF)
SGD
Reactome
PubMed
Pub
Chem
PRO-
SITE
ProDom
Pfam
PDB
OMIM
MGI
KEGG
Reaction
KEGG
Pathway
KEGG
Glycan
KEGG
Enzyme
KEGG
Drug
KEGG
Com-
pound
InterPro
Homolo
Gene
HGNC
Gene
Ontology
GeneID
Affy-
metrix
bible
ontology
BibBase
FTS
BBC
Wildlife
Finder
BBC
Program
mes BBC
Music
Alpine
Ski
Austria
LOCAH
Amster-
dam
Museum
AGROV
OC
AEMET
US Census
(rdfabout)
Media
Geographic
Publications
Government
Cross-domain
Life sciences
User-generated content
Linked data
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 2 / 96
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch.
http://lod-cloud.net/
- 7. Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 3 / 96
- 8. Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 4 / 96
- 9. Graph Types
Property graph
film 2014
(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425
(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167
(name, “United Kingdom”)
(population, 62348447) actor 29704
(actor name, “Jack Nicholson”)
film 3418
(label, “The Passenger”)
film 1267
(label, “The Last Tycoon”)
director 8476
(director name, “Stanley Kubrick”)
film 2685
(label, “A Clockwork Orange”)
film 424
(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)
(actor)
(director) (actor)
(actor) (actor)
(director) (director)
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
- 10. Graph Types
RDF graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
- 11. Graph Types
Property graph
film 2014
(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425
(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167
(name, “United Kingdom”)
(population, 62348447) actor 29704
(actor name, “Jack Nicholson”)
film 3418
(label, “The Passenger”)
film 1267
(label, “The Last Tycoon”)
director 8476
(director name, “Stanley Kubrick”)
film 2685
(label, “A Clockwork Orange”)
film 424
(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)
(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Workload: Online queries and
analytic workloads
Query execution: Varies
RDF graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Workload: SPARQL queries
Query execution: subgraph
matching by homomorphism
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 5 / 96
- 12. Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 6 / 96
- 13. Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 7 / 96
- 14. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 15. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Focus here is on the
dynamism of the
graphs in whether or
not they change and
how they change.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 16. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Focus here is on the
dynamism of the
graphs in whether or
not they change and
how they change.
Focus here is on the
how algorithms behave
as their input changes.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 17. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Focus here is on the
dynamism of the
graphs in whether or
not they change and
how they change.
Focus here is on the
how algorithms behave
as their input changes.
The types of workloads
that the approaches are
designed to handle.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 18. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 19. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 20. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
Graphs change
and we are
interested in
their changes.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 21. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
Graphs change
and we are
interested in
their changes.
Dynamic
graphs with
high veloc-
ity changes –
not possible to
see the entire
graph at once.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 22. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Graphs do not
change or we
are not inter-
ested in their
changes – only
a snapshot is
considered.
Graphs change
and we are
interested in
their changes.
Dynamic
graphs with
high veloc-
ity changes –
not possible to
see the entire
graph at once.
Dynamic
graphs with un-
known changes
– requires re-
discovery of
the graph (e.g.,
LOD).
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 23. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 24. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Computation accesses a
portion of the graph
and the results are
computed for a subset
of vertices; e.g., point-
to-point shortest path,
subgraph matching,
reachability, SPARQL.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 25. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Computation accesses a
portion of the graph
and the results are
computed for a subset
of vertices; e.g., point-
to-point shortest path,
subgraph matching,
reachability, SPARQL.
Computation accesses
the entire graph and
may require multiple
iterations; e.g., PageR-
ank, clustering, graph
colouring, all pairs
shortest path.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 26. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 27. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 28. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 29. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 30. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
Online algo-
rithm with
some info
about forth-
coming input.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 31. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
Online algo-
rithm with
some info
about forth-
coming input.
Sees the en-
tire input
in advance,
which may
change; an-
swers computed
as change oc-
curs.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 32. Classification [Ammar and ¨Ozsu, 2015]
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Sees the en-
tire input in
advance.
Sees the input
piece-meal as it
executes.
One-pass on-
line algorithm
with limited
memory.
Online algo-
rithm with
some info
about forth-
coming input.
Sees the en-
tire input
in advance,
which may
change; an-
swers computed
as change oc-
curs.
Similar to dy-
namic, but
computation
happens in
batches of
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 8 / 96
- 33. Example Design Points
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Compute the query result/perform analytic computation over the graph
as it exists.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
- 34. Example Design Points
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Compute the query result/perform analytic computation over the graph
as it is revealed.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
- 35. Example Design Points
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Compute the query result/perform analytic computation on each snap-
shot from scratch.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
- 36. Example Design Points
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Continuously compute the query result/perform analytic computation as
the input changes.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
- 37. Example Design Points
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Compute the query result/perform analytic computation after a batch of
input changes.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 9 / 96
- 38. Example Design Points – Not all alternatives make sense
Graph Dynamism
Static
Graphs
Dynamic
Graphs
Streaming
Graphs
Evolving
Graphs
Algorithm Types
Offline Online
Streaming Incremental
Dynamic
Batch
Dynamic
Workload Types
Online
Queries
Analytics
Workloads
Dynamic (or batch-dynamic) algorithms do not make sense for static
graphs.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 10 / 96
- 39. Graph Processing Systems
System Memory/
Disk
Architecture
Computing
paradigm
Supported
Workloads
Hadoop Disk Parallel/Distributed MapReduce Analytical
Haloop Disk Parallel/Distributed MapReduce Analytical
Pegasus Disk Parallel/Distributed MapReduce Analytical
GraphX Disk Parallel/Distributed
MapReduce
(Spark)
Analytical
Pregel/Giraph Memory Parallel/Distributed Vertex-Centric Analytical
GraphLab Memory Parallel/Distributed Vertex-Centric Analytical
GraphChi Disk Single machine Vertex-Centric Analytical
Stream Disk Single machine Edge-Centric Analytical
Trinity Memory Parallel/Distributed
Flexible using K-V
store on DSM
Online &
Analytical
Titan Disk Parallel/Distributed
K-V store
(Cassandra)
Online
Neo4J Disk Single machine
Procedural/
Linked-list
Online
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 11 / 96
- 40. Graph Workloads
Online graph querying
Reachability
Single source shortest-path
Subgraph matching
SPARQL queries
Offline graph analytics
PageRank
Clustering
Strongly connected
components
Diameter finding
Graph colouring
All pairs shortest path
Graph pattern mining
Machine learning algorithms
(Belief propagation, Gaussian
non-negative matrix
factorization)
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 12 / 96
- 41. Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 13 / 96
- 42. Reachability Queries
film 2014
(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425
(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167
(name, “United Kingdom”)
(population, 62348447) actor 29704
(actor name, “Jack Nicholson”)
film 3418
(label, “The Passenger”)
film 1267
(label, “The Last Tycoon”)
director 8476
(director name, “Stanley Kubrick”)
film 2685
(label, “A Clockwork Orange”)
film 424
(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)
(actor)
(director) (actor)
(actor) (actor)
(director) (director)
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
- 43. Reachability Queries
film 2014
(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425
(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167
(name, “United Kingdom”)
(population, 62348447) actor 29704
(actor name, “Jack Nicholson”)
film 3418
(label, “The Passenger”)
film 1267
(label, “The Last Tycoon”)
director 8476
(director name, “Stanley Kubrick”)
film 2685
(label, “A Clockwork Orange”)
film 424
(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)
(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Can you reach film 1267 from film 2014?
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
- 44. Reachability Queries
film 2014
(initial release date, “1980-05-23”)
(label, “The Shining”)
books 0743424425
(rating, 4.7)
offers 0743424425amazonOffer
geo 2635167
(name, “United Kingdom”)
(population, 62348447) actor 29704
(actor name, “Jack Nicholson”)
film 3418
(label, “The Passenger”)
film 1267
(label, “The Last Tycoon”)
director 8476
(director name, “Stanley Kubrick”)
film 2685
(label, “A Clockwork Orange”)
film 424
(label, “Spartacus”)
actor 30013
(relatedBook)
(hasOffer)
(based near)
(actor)
(director) (actor)
(actor) (actor)
(director) (director)
Is there a book whose rating is > 4.0 associated with a film that was
directed by Stanley Kubrick?
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 14 / 96
- 46. Subgraph Matching
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?r
rev:rating
FILTER(?r > 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Su
bgraphMatching
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 15 / 96
- 47. Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 16 / 96
- 48. PageRank Computation
A web page is important if it is pointed to by other important
pages.
P1 P2
P3
P5P6
P4
r(Pi ) =
Pj ∈BPi
r(Pj )
|FPj
|
r(P2) =
r(P1)
2
+
r(P3)
3
rk+1(Pi ) =
Pj ∈BPi
rk(Pj )
|FPj
|
BPi
: in-neighbours of Pi
FPi
: out-neighbours of Pi
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 17 / 96
- 49. PageRank Computation
A web page is important if it is pointed to by other important
pages.
P1 P2
P3
P5P6
P4
rk+1(Pi ) =
Pj ∈BPi
rk(Pj )
|FPj
|
Iteration 0 Iteration 1 Iteration 2
Rank at
Iter. 2
r0(P1) = 1/6 r1(P1) = 1/18 r2(P1) = 1/36 5
r0(P2) = 1/6 r1(P2) = 5/36 r2(P2) = 1/18 4
r0(P3) = 1/6 r1(P3) = 1/12 r2(P3) = 1/36 5
r0(P4) = 1/6 r1(P4) = 1/4 r2(P4) = 17/72 1
r0(P5) = 1/6 r1(P5) = 5/36 r2(P5) = 11/72 3
r0(P6) = 1/6 r1(P6) = 1/6 r2(P6) = 14/72 2
Iterative processing.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 17 / 96
- 50. Some Alternative Computational Models for Offline
Analytics
Vertex-centric (Scatter-Gather)
Specify (a) computation at each vertex, and (b) communication with
neighbour vertices
Synchronous – Pregel [Malewicz et al., 2010], Giraph
Asynchronous – GraphLab [Low et al., 2012]
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
- 51. Some Alternative Computational Models for Offline
Analytics
Vertex-centric (Scatter-Gather)
Specify (a) computation at each vertex, and (b) communication with
neighbour vertices
Synchronous – Pregel [Malewicz et al., 2010], Giraph
Asynchronous – GraphLab [Low et al., 2012]
Block-centric
Similar to vertex-centric but on blocks for communication
Connected subgraph of the graph
Blogel [Yan et al., 2014]
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
- 52. Some Alternative Computational Models for Offline
Analytics
Vertex-centric (Scatter-Gather)
Specify (a) computation at each vertex, and (b) communication with
neighbour vertices
Synchronous – Pregel [Malewicz et al., 2010], Giraph
Asynchronous – GraphLab [Low et al., 2012]
Block-centric
Similar to vertex-centric but on blocks for communication
Connected subgraph of the graph
Blogel [Yan et al., 2014]
MapReduce
Need to save in HDFS intermediate results of each iteration – both
good and bad
Hadoop, Haloop [Bu et al., 2012]
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
- 53. Some Alternative Computational Models for Offline
Analytics
Vertex-centric (Scatter-Gather)
Specify (a) computation at each vertex, and (b) communication with
neighbour vertices
Synchronous – Pregel [Malewicz et al., 2010], Giraph
Asynchronous – GraphLab [Low et al., 2012]
Block-centric
Similar to vertex-centric but on blocks for communication
Connected subgraph of the graph
Blogel [Yan et al., 2014]
MapReduce
Need to save in HDFS intermediate results of each iteration – both
good and bad
Hadoop, Haloop [Bu et al., 2012]
Modified MapReduce
Based on Spark [Zaharia et al., 2010; Zaharia, 2016]
Keep intermediate states in memory
Provide fault-tolerance by keeping lineage
GraphX [Gonzalez et al., 2014]
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 18 / 96
- 54. Vertex-Centric Computation
“Think like a vertex”
vertex_scatter(vertex v)
Push local computation to
neighbours on the out-bound
edges
vertex_gather(vertex v)
Gather local computation from
neighbours on the in-bound edges
Continue until all vertices are
inactive
?
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 19 / 96
- 55. Vertex-Centric Computation
“Think like a vertex”
vertex_scatter(vertex v)
Push local computation to
neighbours on the out-bound
edges
vertex_gather(vertex v)
Gather local computation from
neighbours on the in-bound edges
Continue until all vertices are
inactive
Vertex state machine
?
Active Inactive
Vote halt
Message received
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 19 / 96
- 58. Synchronous Vertex-Centric Computation
Machine 1
Machine 2
Machine 3
Communication
Barrier
Each machine performs
vertex-centric computation
on its graph partition
Superstep 1 Superstep 2 Superstep 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
- 59. Synchronous Vertex-Centric Computation
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Communication
Barrier
Each machine performs
vertex-centric computation
on its graph partition
Communication
Barrier
Superstep 1 Superstep 2 Superstep 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
- 60. Synchronous Vertex-Centric Computation
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
Communication
Barrier
Each machine performs
vertex-centric computation
on its graph partition
Communication
Barrier
Superstep 1 Superstep 2 Superstep 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 20 / 96
- 61. Asynchronous Vertex-Centric Computation
No communication barriers.
Uses the most recent vertex values.
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
- 62. Asynchronous Vertex-Centric Computation
No communication barriers.
Uses the most recent vertex values.
Implemented via distributed locking
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
v0
v1 v2
v3 v4
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
- 63. Asynchronous Vertex-Centric Computation
No communication barriers.
Uses the most recent vertex values.
Implemented via distributed locking
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
v0
v1 v2
v3 v4
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
- 64. Asynchronous Vertex-Centric Computation
No communication barriers.
Uses the most recent vertex values.
Implemented via distributed locking
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
v0
v1 v2
v3 v4
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
- 65. Asynchronous Vertex-Centric Computation
No communication barriers.
Uses the most recent vertex values.
Implemented via distributed locking
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
v0
v1 v2
v3 v4
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
- 66. Asynchronous Vertex-Centric Computation
No communication barriers.
Uses the most recent vertex values.
Implemented via distributed locking
Machine 1
Machine 2
Machine 3
Machine 1
Machine 2
Machine 3
v0
v1 v2
v3 v4
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 21 / 96
- 67. Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
- 68. Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
64 machines TW UK
Giraph (byte array) 5.8GB 7.0GB
GraphLab (sync) 4.5GB 14GB
TW 16 machines 128 machines
Giraph (byte array) 8.5GB 5.8GB
GraphLab (sync) 11GB 3.3GB
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
- 69. Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –
Performance degrades as more machines are used due to lock
contention, termination scheme, lack of message batching
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
- 70. Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –
Performance degrades as more machines are used due to lock
contention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
- 71. Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –
Performance degrades as more machines are used due to lock
contention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
No Mutations
Time Memory
Byte array
Hash map
With Mutations (DMST)
Time Memory
Byte array
Hash map
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
- 72. Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –
Performance degrades as more machines are used due to lock
contention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
4 Message processing optimizations are very important.
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
- 73. Summary of an Experiment [Han et al., 2014]
A large study comparing Giraph, GraphLab, GPS, Mizan.
1 Giraph scales better across graphs;
GraphLab scales better across more machines.
2 Distributed locking for asynchronous execution is not scalable –
Performance degrades as more machines are used due to lock
contention, termination scheme, lack of message batching
3 Graph storage should be memory and mutation efficient.
4 Message processing optimizations are very important.
5 Workloads have different resource demands
Algorithm CPU Memory Network
PageRank Medium Medium High
SSSP Low Low Low
WCC Low Medium Medium
DMST High High Medium
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 22 / 96
- 74. Block-Centric Computation
Blogel [Yan et al., 2014]: “Think like a block”; also “think like a
graph” [Tian et al., 2013]
Vertex-centric assumes all vertices communicate over the network;
this is not efficient
Read-world graphs have skewed vertex degree distribution
Common in power-law graphs
Problem: imbalanced communication workloads
Real-world graphs have large diameters
Common in road networks, web graphs, terrain meshes
Problem: one superstep per hop ⇒ too many supersteps
Real-world graphs have high average vertex degree
Common in social networks, mobile communication networks
Problem: heavy average communication workloads
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 23 / 96
- 75. Blogel Principles
Exploit the partitioning of the graph
Message exchanges only among blocks
Block: a connected subgraph of the graph
Within a block, run a serial in-memory algorithm; no need to follow a
BSP model
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 24 / 96
- 76. Benefits of Block-Centric Computation
High-degree vertices inside a block send no messages
Fewer number of supersteps
Fewer number of blocks than vertices
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 25 / 96
- 77. Example: Weakly Connected Component
Algorithm exchanges vertex id’s
with neighbours
id(vi ) ← min{vi , vj , . . . , vk}
where vj , . . . , vk are neighbours
of vi
Vertex-centric requires every
vertex sends to its neighbours
until every vertex is reached
Block-centric needs two
iterations:
1 All vertices in partition A
exchange ids; X and Y send
ids to neighbours in partition
B
2 All vertices in partition B
exchange ids
A B
0
X
Y
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 26 / 96
- 78. Block Construction
The partitioning algorithm needs to maximize number of vertices that
have all their edges in the same partition
Hash partitioning is not suitable because many vertices will probably
have at least one cut-edge
URL partitioner
For web graphs: based on domain names of web page nodes
2D partitioner
For spatial networks: based on coordinates of node
Graph Voronoi diagram partitioner
For general graphs
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 27 / 96
- 79. MapReduce Basics [Li et al., 2014]
For data analysis of very large data sets
Highly dynamic, irregular, schemaless, etc.
SQL too heavy
“Embarrassingly parallel problems”
New, simple parallel programming model
Data structured as (key, value) pairs
E.g. (doc-id, content), (word, count), etc.
Functional programming style with two functions to be given:
Map(k1,v1) → list(k2,v2)
Reduce(k2, list (v2)) → list(v3)
Implemented on a distributed file system (e.g., Google File System)
on very large clusters
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 28 / 96
- 80. MapReduce Processing
...
Inputdataset
Map
Map
Map
Map
(k1, v)
(k2, v)
(k2, v)
(k2, v)
(k1, v)
(k1, v)
(k2, v)
Group by k
Group by k
(k1, (v, v, v))
(k1, (v, v, v, v)) Reduce
Reduce
Outputdataset
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 29 / 96
- 81. MapReduce Architecture
Scheduler
Master
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Input Module
Map Module
Combine Module
Partition Module
Map Process
Worker
Group Module
Reduce Module
Output Module
Reduce Process
Worker
Group Module
Reduce Module
Output Module
Reduce Process
Worker
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 30 / 96
- 82. Execution Flow with Architecture [Dean and Ghemawat, 2008]
MapReduce: Simplified Data Processing on Large Clusters
split 0
split 1
split 2
split 3
split 4
(1) fork
(3) read
(4) local write
(1) fork
(1) fork
(6) write
worker
worker
worker
Master
User
Program
output
file 0
output
file 1
worker
worker
(2)
assign
map
(2)
assign
reduce
(5) remote
(5) read
Input
files
Map
phasr
Intermediate files
(on local disks)
Reduce
phase
Output
files
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 31 / 96
- 83. Hadoop
Most popular MapReduce implementation – developed by Yahoo!
Two components
Processing engine
HDFS: Hadoop Distributed Storage System – others possible
Can be deployed on the same machine or on different machines
Processes
Job tracker: hosted on the master node and implements the schedule
Task tracker: hosted on the worker nodes and accepts tasks from job tracker
and executes them
HDFS
Name node: stores how data are partitioned, monitors the status of data
nodes, and data dictionary
Data node: Stores and manages data chunks assigned to it
Task Tracker Job Tracker Task Tracker
Data Node Name Node Data Node
Worker 1 Name Node Worker n
MapReduce
HDFS
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 32 / 96
- 84. HaLoop [Bu et al., 2012]
Overcome MapReduce shortcomings for iterative jobs
Having to save data in HDFS in between each iteration
Checking the fixpoint requires a new job at each iteration
Scheduler change: assign to the same machine the map reduce
tasks that occur in different iterations but access the same data
Cache invariant data
Cache reduce output to easily check for fixpoint
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 33 / 96
- 85. Spark System
MapReduce does not perform well in iterative computations
Workflow model is acyclic
Have to write to HDFS after each iteration and have to read from
HDFS at the beginning of next iteration
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
- 86. Spark System
MapReduce does not perform well in iterative computations
Workflow model is acyclic
Have to write to HDFS after each iteration and have to read from
HDFS at the beginning of next iteration
Spark objectives
Better support for iterative programs
Provide a complete ecosystem
Similar abstraction (to MapReduce) for programming
Maintain MapReduce fault-tolerance and scalability
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
- 87. Spark System
MapReduce does not perform well in iterative computations
Workflow model is acyclic
Have to write to HDFS after each iteration and have to read from
HDFS at the beginning of next iteration
Spark objectives
Better support for iterative programs
Provide a complete ecosystem
Similar abstraction (to MapReduce) for programming
Maintain MapReduce fault-tolerance and scalability
Fundamental concepts
RDD: Reliable Distributed Datasets
Caching of working set
Maintaining lineage for fault-tolerance
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 34 / 96
- 88. Spark Ecosystem [Michiardi, 2015]
Native
Spark
Apps
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph
processing)
Apache Spark
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 35 / 96
- 89. Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? Cache
Yes
Transform
RDD?
No
Process
No
Transform
Yes
HDFS
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
- 90. Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? Cache
Yes
Transform
RDD?
No
Process
No
Transform
Yes
HDFS
Each transform generates a
new RDD that may also be
cached or processed
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
- 91. Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? Cache
Yes
Transform
RDD?
No
Process
No
Transform
Yes
HDFS
Each transform generates a
new RDD that may also be
cached or processed
Created from HDFS or parallelized arrays;
Partitioned across worker machines;
May be made persistent lazily;
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
- 92. Spark Programming Model [Zaharia et al., 2010, 2012]
HDFS
Create RDD
· · ·
RDD
Cache? Cache
Yes
Transform
RDD?
No
Process
No
Transform
Yes
HDFS
Each transform generates a
new RDD that may also be
cached or processed
Created from HDFS or parallelized arrays;
Partitioned across worker machines;
May be made persistent lazily;
Processing done on one of the RDDs;
Done in parallel across workers;
First processing on a RDD is from disk;
Subsequent processing of the same RDD from cache
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 36 / 96
- 93. Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
- 94. Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
CreateRDD
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
- 95. Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
Transform RDD
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
- 96. Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
Another transform
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
- 97. Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
Cache results
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
- 98. Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter( .contains(foo)).count
Action
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
- 99. Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter( .contains(foo)).count
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
Tasks
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
- 100. Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter( .contains(foo)).count
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
- 101. Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter( .contains(foo)).count
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
- 102. Example – Log Mining [Zaharia et al., 2010, 2012]
Load log messages from a file system, create a new file by filtering the
error messages, read this file into memory, then interactively search for
various patterns
lines = spark.textFile(hdfs://...)
errors = lines.filter( .startsWith(ERROR))
messages = errors.map( .split(‘t ’)(2))
cachedMsgs = messages.cache()
cachedMsgs.filter( .contains(foo)).count
cachedMsgs.filter( .contains(bar)).count
Another Action
accesses cache
Driver
WorkerWorkerWorker
Block 1 Block 2 Block 3
TasksResults
Cache Cache Cache
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 37 / 96
- 103. RDD and Processing
HDFS
lines = spark.textFile(hdfs://...)
lines
Error, msg1
Warn, msg2
Error, msg1
Info, msg8
Warn, msg2
Info, msg8
Error, msg3
Info, msg5
Info, msg5
Error, msg4
Warn, msg9
Error, msg1
errors
errors = lines.filter( .startsWith(ERROR))
Error, msg1
Error, msg1
Error, msg3 Error, msg4
Error, msg1
messages
messages = errors.map .split(‘t ’)(2)
msg1
msg1
msg3 msg4
msg1
Thesearenotyetgenerated
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 38 / 96
- 104. RDD and Processing
lines
errors
messages
msg1
msg1
msg3 msg4
msg1
lines
messages.filter( .contains(foo)).count
errors
messages
msg1
msg1
msg3 msg4
msg1
NowtheRDDsarematerialized;
Commandnotyetexecuted
Driver
messages.filter( .contains(foo)).count
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 38 / 96
- 105. GraphX [Gonzalez et al., 2014]
Built on top of Spark
Objective is to combine data analytics with graph processing
Unify computation on tables and graphs
Carefully convert graph to tabular representation
Native GraphX API or can accommodate vertex-centric computation
Native
Spark
Apps
Spark
SQL
Spark
Streaming
MLlib
(machine
learning)
GraphX
(graph
processing)
Apache Spark
Vertex-
centric API
AppApp
App App
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 39 / 96
- 107. GraphX: Representation of Graphs as Tables
Partition 1
Partition 2
A
B
C
D
E
F
G
H
I
J
Edge-disjoint
partitioning
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
- 108. GraphX: Representation of Graphs as Tables
Partition 1
Partition 2
Machine1Machine2
Vertex Table
(RDD)
v-prop:vertex prop.
A
B
C
D
E
F
G
H
I
J
Edge-disjoint
partitioning
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
- 109. GraphX: Representation of Graphs as Tables
Partition 1
Partition 2
Machine1Machine2
Vertex Table
(RDD)
v-prop:vertex prop.
Edge Table
(RDD)
e-prop:edge prop.
A
B
C
D
E
F
G
H
I
J
Edge-disjoint
partitioning
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
- 110. GraphX: Representation of Graphs as Tables
Partition 1
Partition 2
Machine1Machine2
Vertex Table
(RDD)
v-prop:vertex prop.
Edge Table
(RDD)
e-prop:edge prop.
A
B
C
D
E
F
G
H
I
J
Edge-disjoint
partitioning
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
Joining vertices
and edges
Move vertices to edges
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
- 111. GraphX: Representation of Graphs as Tables
Partition 1
Partition 2
Machine1Machine2
Vertex Table
(RDD)
v-prop:vertex prop.
Edge Table
(RDD)
e-prop:edge prop.
Routing
Table
(RDD)
A
B
C
D
E
F
G
H
I
J
Edge-disjoint
partitioning
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
A 1 2
B 1
...
I 1
F 1 2
D 2
E 2
J 2
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 40 / 96
- 112. GraphX: Computation Model
Machine1Machine2
Vertex Table Edge Table
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
- 113. GraphX: Computation Model
Machine1Machine2
Vertex Table Edge Table
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
First Phase: Join
Vertex table Edge table
Triples View
A v-prop e-prop B v-prop
A v-prop e-prop C v-prop
C v-prop e-prop G v-prop
...
E v-prop e-prop G v-prop
J v-prop e-prop G v-prop
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
- 114. GraphX: Computation Model
Machine1Machine2
Vertex Table Edge Table
A v-prop
B v-prop
...
I v-prop
D v-prop
E v-prop
F v-prop
J v-prop
A e-prop B
A e-prop C
...
F e-prop G
A e-prop D
A e-prop E
...
E e-prop F
Triples View
A v-prop e-prop B v-prop
A v-prop e-prop C v-prop
C v-prop e-prop G v-prop
...
E v-prop e-prop G v-prop
J v-prop e-prop G v-prop
Second Phase: Compute neighbourhood
Group-by aggregate
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 41 / 96
- 115. GraphX: Operators
Table transform operators – inherited from Spark
map(func) Return a new RDD formed by passing each element
of the source through a function func
filter(func) Return a new RDD formed by selecting those
elements of the source on which func returns true
flatMap(func) Similar to map, but each input item can be mapped
to 0 or more output items
mapPartitions(func) Similar to map, but runs separately on each partition
(block) of the RDD, so func must be of type Iterator
sample(repl, fraction,
seed)
Sample a fraction fraction of the data, with or
without replacement (set repl accordingly), using a
given random number generator seed
union(otherDataset)
intersection()
Return a new RDD containing the union/intersection
of the elements in the source RDD and the argument
groupByKey() Operates on a RDD of (K, V) pairs, returns a RDD
of (K, IterableV) pairs
reduceByKey(func, . . .) Operates on a RDD of (K, V) pairs, returns a RDD
of (K, V) pairs where the values for each key are
aggregated using the given reduce function func
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 42 / 96
- 116. GraphX: Operators
Table transform operators – inherited from Spark
Graph operators
Graph(vertex coll,
edge coll)
Logically binds together a pair of vertex and edge
property collections into a property graph; verifies
that each vertex occurs only once and edges connect
existing vertices
triplets(vertex coll,
vertex coll, edge coll)
Returns the triplets view of the graph
mrTriplets(map,reduce) MapReduce triplets - encodes the two-stage process
of join to create triplets and group by
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 42 / 96
- 117. Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 43 / 96
- 118. RDF Introduction
Everything is an uniquely named
resource
http://data.linkedmdb.org/resource/actor/JN29704
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
- 119. RDF Introduction
Everything is an uniquely named
resource
Prefixes can be used to shorten the
names
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
- 120. RDF Introduction
Everything is an uniquely named
resource
Prefixes can be used to shorten the
names
Properties of resources can be defined
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
- 121. RDF Introduction
Everything is an uniquely named
resource
Prefixes can be used to shorten the
names
Properties of resources can be defined
Relationships with other resources can
be defined
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
- 122. RDF Introduction
Everything is an uniquely named
resource
Prefixes can be used to shorten the
names
Properties of resources can be defined
Relationships with other resources can
be defined
Resource descriptions can be
contributed by different people/groups
and can be located anywhere in the web
Integrated web “database”
xmlns:y=http://data.linkedmdb.org/resource/actor/
y:JN29704
y:JN29704:hasName “Jack Nicholson”
y:JN29704:BornOnDate “1937-04-22”
y:TS2014:title “The Shining”
y:TS2014:releaseDate “1980-05-23”
y:TS2014
JN29704:movieActor
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 44 / 96
- 123. RDF Data Model
Triple: Subject, Predicate (Property), Object
(s, p, o)
Subject: the entity that is described (URI
or blank node)
Predicate: a feature of the entity (URI)
Object: value of the feature (URI, blank
node or literal)
(s, p, o) ∈ (U ∪ B) × U × (U ∪ B ∪ L)
Set of RDF triples is called an RDF graph
U
Subject Object
U B U B L
U: set of URIs
B: set of blank nodes
L: set of literals
Predicate
Subject Predicate Object
http://...imdb.../film/2014 rdfs:label “The Shining”
http://...imdb.../film/2014 movie:releaseDate “1980-05-23”
http://...imdb.../29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 45 / 96
- 124. RDF Example Instance
Prefixes: mdb=http://data.linkedmdb.org/resource/; geo=http://sws.geonames.org/
bm=http://wifo5-03.informatik.uni-mannheim.de/bookmashup/
lexvo=http://lexvo.org/id/;wp=http://en.wikipedia.org/wiki/
Subject Predicate Object
mdb: film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”’
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2014 movie:actor mdb:actor/29704
mdb:film/2014 movie:actor mdb: actor/30013
mdb:film/2014 movie:music contributor mdb: music contributor/4110
mdb:film/2014 foaf:based near geo:2635167
mdb:film/2014 movie:relatedBook bm:0743424425
mdb:film/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:film/424 movie:director mdb:director/8476
mdb:film/424 rdfs:label “Spartacus”
mdb:actor/29704 movie:actor name “Jack Nicholson”
mdb:film/1267 movie:actor mdb:actor/29704
mdb:film/1267 rdfs:label “The Last Tycoon”
mdb:film/3418 movie:actor mdb:actor/29704
mdb:film/3418 rdfs:label “The Passenger”
geo:2635167 gn:name “United Kingdom”
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer
lexvo:iso639-3/eng rdfs:label “English”
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
URI Literal
URI
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 46 / 96
- 125. RDF Graph
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 47 / 96
- 126. RDF Query Model – SPARQL
Query Model - SPARQL Protocol and RDF Query Language
Given U (set of URIs), L (set of literals), and V (set of variables), a
SPARQL expression is defined recursively:
an atomic triple pattern, which is an element of
(U ∪ V ) × (U ∪ V ) × (U ∪ V ∪ L)
?x rdfs:label “The Shining”
P FILTER R, where P is a graph pattern expression and R is a built-in
SPARQL condition (i.e., analogous to a SQL predicate)
?x rev:rating ?p FILTER(?p 3.0)
P1 AND/OPT/UNION P2, where P1 and P2 are graph pattern
expressions
Example:
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r 4.0)
}
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 48 / 96
- 127. SPARQL Queries
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r 4.0)
}
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?r
rev:rating
FILTER(?r 4.0)
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 49 / 96
- 128. Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 50 / 96
- 129. Na¨ıve Triple Store Design
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r 4.0)
}
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2014 movie:actor mdb:actor/29704
mdb:film/2014 movie:actor mdb: actor/30013
mdb:film/2014 movie:music contributor mdb: music contributor/4110
mdb:film/2014 foaf:based near geo:2635167
mdb:film/2014 movie:relatedBook bm:0743424425
mdb:film/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:film/424 movie:director mdb:director/8476
mdb:film/424 rdfs:label “Spartacus”
mdb:actor/29704 movie:actor name “Jack Nicholson”
mdb:film/1267 movie:actor mdb:actor/29704
mdb:film/1267 rdfs:label “The Last Tycoon”
mdb:film/3418 movie:actor mdb:actor/29704
mdb:film/3418 rdfs:label “The Passenger”
geo:2635167 gn:name “United Kingdom”
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer
lexvo:iso639-3/eng rdfs:label “English”
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
- 130. Na¨ıve Triple Store Design
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r 4.0)
}
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2014 movie:actor mdb:actor/29704
mdb:film/2014 movie:actor mdb: actor/30013
mdb:film/2014 movie:music contributor mdb: music contributor/4110
mdb:film/2014 foaf:based near geo:2635167
mdb:film/2014 movie:relatedBook bm:0743424425
mdb:film/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:film/424 movie:director mdb:director/8476
mdb:film/424 rdfs:label “Spartacus”
mdb:actor/29704 movie:actor name “Jack Nicholson”
mdb:film/1267 movie:actor mdb:actor/29704
mdb:film/1267 rdfs:label “The Last Tycoon”
mdb:film/3418 movie:actor mdb:actor/29704
mdb:film/3418 rdfs:label “The Passenger”
geo:2635167 gn:name “United Kingdom”
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer
lexvo:iso639-3/eng rdfs:label “English”
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c t
FROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p=” r d f s : l a b e l ”
AND T2 . p=” movie : relatedBook ”
AND T3 . p=” movie : d i r e c t o r ”
AND T4 . p=” rev : r a t i n g ”
AND T5 . p=” movie : d i r e c t o r n a m e ”
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o 4.0
AND T5 . o=” S t a n l e y Kubrick ”
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
- 131. Na¨ıve Triple Store Design
SELECT ?name
WHERE {
?m r d f s : l a b e l ?name . ?m movie : d i r e c t o r ?d .
?d movie : director name ” Stanley Kubrick ” .
?m movie : relatedBook ?b . ?b rev : r a t i n g ? r .
FILTER(? r 4.0)
}
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2014 movie:actor mdb:actor/29704
mdb:film/2014 movie:actor mdb: actor/30013
mdb:film/2014 movie:music contributor mdb: music contributor/4110
mdb:film/2014 foaf:based near geo:2635167
mdb:film/2014 movie:relatedBook bm:0743424425
mdb:film/2014 movie:language lexvo:iso639-3/eng
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:film/424 movie:director mdb:director/8476
mdb:film/424 rdfs:label “Spartacus”
mdb:actor/29704 movie:actor name “Jack Nicholson”
mdb:film/1267 movie:actor mdb:actor/29704
mdb:film/1267 rdfs:label “The Last Tycoon”
mdb:film/3418 movie:actor mdb:actor/29704
mdb:film/3418 rdfs:label “The Passenger”
geo:2635167 gn:name “United Kingdom”
geo:2635167 gn:population 62348447
geo:2635167 gn:wikipediaArticle wp:United Kingdom
bm:books/0743424425 dc:creator bm:persons/Stephen+King
bm:books/0743424425 rev:rating 4.7
bm:books/0743424425 scom:hasOffer bm:offers/0743424425amazonOffer
lexvo:iso639-3/eng rdfs:label “English”
lexvo:iso639-3/eng lvont:usedIn lexvo:iso3166/CA
lexvo:iso639-3/eng lvont:usesScript lexvo:script/Latn
SELECT T1 . o b j e c t
FROM T as T1 , T as T2 , T as T3 ,
T as T4 , T as T5
WHERE T1 . p=” r d f s : l a b e l ”
AND T2 . p=” movie : relatedBook ”
AND T3 . p=” movie : d i r e c t o r ”
AND T4 . p=” rev : r a t i n g ”
AND T5 . p=” movie : d i r e c t o r n a m e ”
AND T1 . s=T2 . s
AND T1 . s=T3 . s
AND T2 . o=T4 . s
AND T3 . o=T5 . s
AND T4 . o 4.0
AND T5 . o=” S t a n l e y Kubrick ”
Easy to implement
but
too many self-joins!
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 51 / 96
- 132. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss
et al., 2008]
Strings are mapped to ids using a mapping table
Original triple table
Subject Property Object
mdb: film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:initial release date “1980-05-23”
mdb:director/8476 movie:director name “Stanley Kubrick”
mdb:film/2685 movie:director mdb:director/8476
Encoded triple table
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
Mapping table
ID Value
0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
- 133. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss
et al., 2008]
Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographic order
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
B+ tree
Easy querying
through mapping
table
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
- 134. Exhaustive Indexing
RDF-3X [Neumann and Weikum, 2008, 2009], Hexastore [Weiss
et al., 2008]
Strings are mapped to ids using a mapping table
Triples are indexed in a clustered B+ tree in lexicographic order
Create indexes for permutations of the three columns: SPO, SOP,
PSO, POS, OPS, OSP
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
B+ tree
Easy querying
through mapping
table
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 52 / 96
- 135. Exhaustive Indexing–Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
- 136. Exhaustive Indexing–Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
Advantages
Eliminates some of the joins – they become range queries
Merge join is easy and fast
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
- 137. Exhaustive Indexing–Query Execution
Each triple pattern can be answered by a range query
Joins between triple patterns computed using merge join
Join order is easy due to extensive indexing
Subject Property Object
0 1 2
0 3 4
5 6 7
8 9 5
...
...
...
ID Value
0 mdb: film/2014
1 rdfs:label
2 “The Shining”
3 movie:initial release date
4 “1980-05-23”
5 mdb:director/8476
6 movie:director name
7 “Stanley Kubrick”
8 mdb:film/2685
9 movie:director
Advantages
Eliminates some of the joins – they become range queries
Merge join is easy and fast
Disadvantages
Space usage
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 53 / 96
- 138. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea
et al., 2013]
Clustered property table: group together the properties that tend to
occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type of
property into one property table
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject refs:label movie:director
mob:film/2014 “The Shining” mob:director/8476
mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor name
mdb:actor “Jack Nicholson”
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
- 139. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea
et al., 2013]
Clustered property table: group together the properties that tend to
occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type of
property into one property table
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject refs:label movie:director
mob:film/2014 “The Shining” mob:director/8476
mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor name
mdb:actor “Jack Nicholson”
Advantages
Fewer joins
If the data is structured, we have a relational system – similar to
normalized relations
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
- 140. Property Tables
Grouping by entities; Jena [Wilkinson, 2006], DB2-RDF [Bornea
et al., 2013]
Clustered property table: group together the properties that tend to
occur in the same (or similar) subjects
Property-class table: cluster the subjects with the same type of
property into one property table
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject refs:label movie:director
mob:film/2014 “The Shining” mob:director/8476
mob:film/2685 “The Clockwork Orange” mob:director/8476
Subject movie:actor name
mdb:actor “Jack Nicholson”
Advantages
Fewer joins
If the data is structured, we have a relational system – similar to
normalized relations
Disadvantages
Potentially a lot of NULLs
Clustering is not trivial
Multi-valued properties are complicated
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 54 / 96
- 141. Binary Tables
Grouping by properties: For each property, build a two-column table,
containing both subject and object, ordered by subjects [Abadi et al.,
2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in the data)
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject Object
mdb:film/2014 mdb:director/8476
mdb:film/2685 mdb:director/8476
movie:director
Subject Object
mob:film/2014 “The Shining”
mob:film/2685 “The Clockwork Orange”
refs:label
Subject Object
mdb:actor/29704 “Jack Nicholson”
movie:actor name
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
- 142. Binary Tables
Grouping by properties: For each property, build a two-column table,
containing both subject and object, ordered by subjects [Abadi et al.,
2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in the data)
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject Object
mdb:film/2014 mdb:director/8476
mdb:film/2685 mdb:director/8476
movie:director
Subject Object
mob:film/2014 “The Shining”
mob:film/2685 “The Clockwork Orange”
refs:label
Subject Object
mdb:actor/29704 “Jack Nicholson”
movie:actor name
Advantages
Supports multi-valued properties
No NULLs
No clustering
Read only needed attributes (i.e. less I/O)
Good performance for subject-subject joins
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
- 143. Binary Tables
Grouping by properties: For each property, build a two-column table,
containing both subject and object, ordered by subjects [Abadi et al.,
2007, 2009]
Also called vertical partitioned tables
n two column tables (n is the number of unique properties in the data)
Subject Property Object
mdb:film/2014 rdfs:label “The Shining”
mdb:film/2014 movie:director mdb:director/8476
mdb:film/2685 movie:director mdb:director/8476
mdb:film/2685 rdfs:label “A Clockwork Orange”
mdb:actor/29704 movie:actor name “Jack Nicholson”
. . . . . . . . .
Subject Object
mdb:film/2014 mdb:director/8476
mdb:film/2685 mdb:director/8476
movie:director
Subject Object
mob:film/2014 “The Shining”
mob:film/2685 “The Clockwork Orange”
refs:label
Subject Object
mdb:actor/29704 “Jack Nicholson”
movie:actor name
Advantages
Supports multi-valued properties
No NULLs
No clustering
Read only needed attributes (i.e. less I/O)
Good performance for subject-subject joins
Disadvantages
Not useful for subject-object joins
Expensive inserts
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 55 / 96
- 144. Graph-based Approach
Answering SPARQL query ≡ subgraph matching using homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013]
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?r
rev:rating
FILTER(?r 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Subgraph
M
atching
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
- 145. Graph-based Approach
Answering SPARQL query ≡ subgraph matching using homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013]
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?r
rev:rating
FILTER(?r 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Subgraph
M
atching
Advantages
Maintains the graph structure
Full set of queries can be handled
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
- 146. Graph-based Approach
Answering SPARQL query ≡ subgraph matching using homomorphism
gStore [Zou et al., 2011, 2014], chameleon-db [Alu¸c et al., 2013]
?m ?d
movie:director
?name
rdfs:label
?b
movie:relatedBook
“Stanley Kubrick”
movie:director name
?r
rev:rating
FILTER(?r 4.0)
mdb:film/2014
“1980-05-23”
movie:initial release date
“The Shining”
refs:label
bm:books/0743424425
4.7
rev:rating
bm:offers/0743424425amazonOffer
geo:2635167
“United Kingdom”
gn:name
62348447
gn:population
mdb:actor/29704
“Jack Nicholson”
movie:actor name
mdb:film/3418
“The Passenger”
refs:label
mdb:film/1267
“The Last Tycoon”
refs:label
mdb:director/8476
“Stanley Kubrick”
movie:director name
mdb:film/2685
“A Clockwork Orange”
refs:label
mdb:film/424
“Spartacus”
refs:label
mdb:actor/30013
movie:relatedBook
scam:hasOffer
foaf:based near
movie:actor
movie:director
movie:actor
movie:actor movie:actor
movie:director movie:director
Subgraph
M
atching
Advantages
Maintains the graph structure
Full set of queries can be handled
Disadvantages
Graph pattern matching is expensive
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 56 / 96
- 147. Two Systems
gStoreSystem Architecture
Offline Online
Storage
Input Input
RDF Parser
RDF Graph
Builder
Encoding
Module
VS*-tree
builder
RDF data
RDF Triples
RDF Graph
Signature Graph
Key-Value
Store
VS*-tree
Store
SPARQL
Parser
SPARQL Query
Encoding
Module
VS*-tree
Query Graph
Filter
Module
Join
Module
Signature Graph
Node Candidate
Results
Fig. 4. System Architecture
itstrings, denoted as vS ig(u). We encode query Q with the
ame encoding method. Consequently, the match between Q
nd G can be verified by simply checking the match between
orresponding encoded bitstrings.
chameleon-db
Structural Index
...
Vertex Index
Spill Index
ClusterIndexStorageSystem
StorageAdvisor
Query
Engine Plan Generation Evaluation
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 57 / 96
- 148. gStore
General Approach:
Work directly on the RDF graph and the SPARQL query graph
Use a signature-based encoding of each entity and class vertex to
speed up matching
Filter-and-evaluate
Use a false positive algorithm to prune nodes and obtain a set of
candidates; then do more detailed evaluation on those
Use an index (VS∗-tree) over the data signature graph (has light
maintenance load) for efficient pruning
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 58 / 96
- 149. 1. Encode Q and G to Get Signature Graphs
Query signature graph Q∗
0100 0000 1000 0000
00010
0000 0100
10000
Data signature graph G∗
0010 1000
0100 0001
00001
1000 0001
00010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
0001 0001
01000
0100 1000
01000
1001 1000
01000
0001 0100
01000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 59 / 96
- 150. 2. Filter-and-Evaluate
Query signature graph Q∗
0100 0000 1000 0000
00010
0000 0100
10000
Data signature graph G∗
0010 1000
0100 0001
00001
1000 0001
00010
0000 0100
10000
0000 1000
10000
0000 0010
10000
0000 1001
00100
0001 0001
01000
0100 1000
01000
1001 1000
01000
0001 0100
01000
Find matches of Q∗ over
signature graph G∗
Verify each match in
RDF graph G
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 60 / 96
- 151. How to Generate Candidate List
Two step process:
1 For each node of Q∗
get lists of nodes in G∗
that include that node.
2 Do a multi-way join to get the candidate list
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
- 152. How to Generate Candidate List
Two step process:
1 For each node of Q∗
get lists of nodes in G∗
that include that node.
2 Do a multi-way join to get the candidate list
Alternatives:
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
- 153. How to Generate Candidate List
Two step process:
1 For each node of Q∗
get lists of nodes in G∗
that include that node.
2 Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G∗
Both steps are inefficient
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
- 154. How to Generate Candidate List
Two step process:
1 For each node of Q∗
get lists of nodes in G∗
that include that node.
2 Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G∗
Both steps are inefficient
Use S-trees
Height-balanced tree over signatures
Run an inclusion query for each node of Q∗
and get lists of nodes in
G∗
that include that node.
• Given query signature q and a set of data signatures S, find all
data signatures si ∈ S where qsi = q
Does not support second step – expensive
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
- 155. How to Generate Candidate List
Two step process:
1 For each node of Q∗
get lists of nodes in G∗
that include that node.
2 Do a multi-way join to get the candidate list
Alternatives:
Sequential scan of G∗
Both steps are inefficient
Use S-trees
Height-balanced tree over signatures
Run an inclusion query for each node of Q∗
and get lists of nodes in
G∗
that include that node.
• Given query signature q and a set of data signatures S, find all
data signatures si ∈ S where qsi = q
Does not support second step – expensive
VS-tree (and VS∗
-tree)
Multi-resolution summary graph based on S-tree
Supports both steps efficiently
Grouping by vertices
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 61 / 96
- 156. S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
- 157. S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
- 158. S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000 002
011
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
- 159. S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000 002
011
003
008
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
- 160. S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000 002
011
003
008
004
009
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
- 161. S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000 002
011
003
008
004
009
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
- 162. S-tree Solution
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
1000 00000100 0000
00010
0000 0100
10000 002
011
003
008
004
009
Possibly large join space!
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 62 / 96
- 163. VS-tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
Super edge
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 63 / 96
- 164. Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
- 165. Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
- 166. Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
d3
2
d3
3
d3
3
d3
4
d3
1
d3
4
G3
00010 10000
01000
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
- 167. Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
d3
2
d3
3
d3
3
d3
4
d3
1
d3
4
G3
00010 10000
01000
003
008
002
011
004
009
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
- 168. Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
d3
2
d3
3
d3
3
d3
4
d3
1
d3
4
G3
00010 10000
01000
003
008
002
011
004
009
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
- 169. Pruning with VS-Tree
1111 1111
0110 1111 1101 1101
0000 1110 0110 1001 1100 1001 1001 1101
0000 1000
0000 0100 0000 0010
0010 1000
0100 0001
1000 0001
0000 1001
0100 1000
1001 1000
0001 0100
0001 0001
005
004 006
001
002
003
007
011
008
009
010
d1
1
d2
1 d2
2
d3
1 d3
2 d3
3 d3
4
G3
G2
G1
11101
10010
10001 01100
10000 00001 01100
00010
10000
01000
01000
10000
10000
10000
10000
00010
00100
01000
01000
01000
01000
1000 00000100 0000
00010
0000 0100
10000
d3
2
d3
3
d3
3
d3
4
d3
1
d3
4
G3
00010 10000
01000
003
008
002
011
004
009
Reduced join space!
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 64 / 96
- 170. Adaptivity to Workload
Applications that rely on RDF data are increasingly popular and are
more varied [Verborgh et al., 2014]
Data that are being handled are far more heterogeneous [Duan et al.,
2011]
SPARQL queries are becoming more diverse [Arias et al., 2011] and
dynamic [Kirchberg et al., 2011]
An experiment [Alu¸c et al., 2014a]
No single system is a sole winner across all queries
No single system is the sole loser across all queries, either
There can be 2–5 orders of magnitude difference in the performance
(i.e., query execution time) between the best and the worst system for
a given query
The winner in one query may timeout in another
Performance difference widens as dataset size increases
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
- 171. Adaptivity to Workload
Applications that rely on RDF data are increasingly popular and are
more varied [Verborgh et al., 2014]
Data that are being handled are far more heterogeneous [Duan et al.,
2011]
SPARQL queries are becoming more diverse [Arias et al., 2011] and
dynamic [Kirchberg et al., 2011]
An experiment [Alu¸c et al., 2014a]
No single system is a sole winner across all queries
No single system is the sole loser across all queries, either
There can be 2–5 orders of magnitude difference in the performance
(i.e., query execution time) between the best and the worst system for
a given query
The winner in one query may timeout in another
Performance difference widens as dataset size increases
Can existing systems cope with these trends – workload diversity
dynamism
No! [Alu¸c et al., 2014b]
Fragmented data
Suboptimal pruning by indexes
Unnecessarily large sets of intermediate result tuples
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
- 172. Adaptivity to Workload
Applications that rely on RDF data are increasingly popular and are
more varied [Verborgh et al., 2014]
Data that are being handled are far more heterogeneous [Duan et al.,
2011]
SPARQL queries are becoming more diverse [Arias et al., 2011] and
dynamic [Kirchberg et al., 2011]
An experiment [Alu¸c et al., 2014a]
No single system is a sole winner across all queries
No single system is the sole loser across all queries, either
There can be 2–5 orders of magnitude difference in the performance
(i.e., query execution time) between the best and the worst system for
a given query
The winner in one query may timeout in another
Performance difference widens as dataset size increases
Our proposal: Idea behind chameleon-db
When designing and implementing an RDF data management system,
assume nothing about the workload upfront
Organize data dynamically and purely based on the workload
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 65 / 96
- 181. Challenges
Physical Data Layout: As the workloads change, the way data are
grouped together may no longer be suitable
Hierarchical Clustering Algorithm [Alu¸c et al., 2015]
Tunable-LSH [Alu¸c et al., 2015]
Indexing: Indexing upfront is not a choice
Query Evaluation: Can we execute queries efficiently even when the
physical layout is constantly changing? [Alu¸c et al., 2015]
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 67 / 96
- 182. chameleon-db
Prototype system [Alu¸c et al., 2013]
35,000 lines of code in C++ under Linux (plus code for SPARQL 1.0
parser)
Structural Index
...
Vertex Index
Spill Index
ClusterIndexStorageSystem
StorageAdvisor
Query
Engine Plan Generation Evaluation
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 68 / 96
- 183. Outline
1 Introduction – Graph Types
2 Property Graph Processing
Classification
Online querying
Offline analytics
3 RDF Graph Querying
Data Warehousing
Distributed SPARQL Execution
Linked Object Data Querying
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 69 / 96
- 184. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries –
SPARQL endpoints
Not all data sites can process
queries
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96
- 185. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries –
SPARQL endpoints
Not all data sites can process
queries
Alternatives
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96
- 186. Remember the Environment
Distributed environment
Some of the data sites can
process SPARQL queries –
SPARQL endpoints
Not all data sites can process
queries
Alternatives
Data re-distribution + query
decomposition
© M. Tamer ¨Ozsu Croucher ASI (2015/12/16-18) 70 / 96