Virtual data integration with just-in-time query recompilation

Supporting virtual integration of
Linked Data
with just-in-time query recompilation
Amsterdam, The Netherlands, September 12 2017
Alessandro Adamou1, Mathieu d’Aquin2, Carlo Allocca13, Enrico Motta1
1 Knowledge Media Institute, The Open University, UK
2 Insight Centre for Data Analytics, NUI Galway, Ireland
3 now Samsung Inc.

Outline
•  Motivation
•  Just-in-time query recompilation
•  Implementation
•  Experiments
•  Perspectives

Virtual data integration
•  No ETL process
•  Naturally keeps data up-to-date
•  Unlike data federation, there is still a designated node
•  Favours project economies relying on networking rather
than storage space
•  Serious performance issues!
•  Generally considered less robust
•  Acquiring momentum in industry, per 2016 Gartner
report
•  Maintenance??

Pay-as-you-go integration
1.  Establish mappings between source
schemas and global schema (go)
2.  Obtain feedback on mapping results, e.g.
in terms of precision and recall (pay)
3.  Reﬁne 1
Really “pays off” in virtual data integration
N.W. Paton, K. Christodoulou, A. A. A. Fernandes, B. Parsia, C. Hedeler: Pay-as-you-go data
integra0on for linked data: opportuni0es, challenges and architectures. SWIM 2012: 3

•  Target query language other than
SPARQL
– Conjunctive, star-shaped queries (good for
Web APIs)
{hostname}{/attribute/value}+?
{attribute}{&attribute}+
e.g.
http://example.org/api/type/actor/
name/clint_eastwood?filmwork

type/actor/name/clint_eastwood
SELECT DISTINCT ?filmwork WHERE { 
{ dbr:Clint_Eastwood a dbo:Actor
; ^(dbo:director|dbo:starring) ?filmwork
} UNION {
{ ?x owl:sameAs dbr:Clint_Eastwood
} UNION {
?x (movie:actor_name|movie:director_name) ?s
FILTER (str(?s) = "Clint Eastwood") 
}
. ?x foaf:made|^(movie:director|movie:actor) ?filmwork
}}
Equivalent SPARQL query for federated engine
that supports DBpedia and LinkedMDB

SELECT DISTINCT ?filmwork ?eq WHERE {
VALUES(?x) {
( dbr:Clint_Eastwood )
( dbr:Clint_Eastwood_(actor) )
} ?x a dbo:Actor
; ^(dbo:director|dbo:starring) ?filmwork
. ?filmwork owl:sameAs|^owl:sameAs ?eq
}
Alternative approach: DBpedia

SELECT DISTINCT ?filmwork ?eq WHERE {
{ VALUES(?x0) {
( dbr:Clint_Eastwood )
( dbr:Clint_Eastwood_(actor) )
} ?x0 ^owl:sameAs ?x
} UNION { ?x movie:actor_name "Clint Eastwood"
} UNION { ?x movie:director_name "Clint Eastwood"
} . {
{ ?x foaf:made ?filmwork
} UNION { ?filmwork movie:director ?x
} UNION { ?filmwork movie:actor ?x } 
} . ?filmwork owl:sameAs|^owl:sameAs ?eq
}
Alternative approach: LinkedMDB

•  Encode integrator’s knowledge of a
dataset schema into a set of primitives,
which will serve as “compilation units”.
•  Managing the compilation units of a query
using two types of structure:
– Microcompilers
– Query skeletons (or templates)

Microcompiler
Let W be the set of all the a;ribute-value pairs
and Σ the alphabet of a language; a
microcompiler is a func0on φ : ℘W → Σ∗ that
transforms sets of a;ribute-value pairs into a
sequence of symbols in that language.

Microcompiler (JS ex.)
mc_x_dbp = function (type,name) {
var pref = ‘http://dbpedia.org/resource/’;
var idd = name.replace(/b[a-z]/g,
function(f){return f.toUpperCase()});
return ‘VALUES(?x_dbp){ ‘
+ ‘( <’ + pref + idd + ‘> )’
+ ‘( <’ + pref + idd + ‘_(’ + type + ‘)> )’
+ ‘} ?x_dbp’
}
VALUES(?x_dbp) {
( <http://dbpedia.org/resource/Clint_Eastwood> )
( <http://dbpedia.org/resource/Clint_Eastwood_(actor)> )
} ?x_dbp

Microcompiler (JS ex. II)
mc_x_lmdb = function (type,name) {
if( [‘actor’,’director’].indexof(type) >= 0 ) {
var sa = mc_x_dbp(type,name) + ‘ ^owl:sameAs ?x_lmdb’;
var nam = makename(name); // omitted for simplicity
return ‘{ ‘ + sa + ’ } UNION ’
+ ‘{ ?x_lmdb movie:actor_name “’+nam+’” } UNION’
+ ‘{ ?x_lmdb movie:director_name “’+nam+’” }’ //…
}}
{ VALUES(?x_dbp) {
( <http://dbpedia.org/resource/Clint_Eastwood> )
( <http://dbpedia.org/resource/Clint_Eastwood_(actor)> )
} ?x_dbp ^owl:sameAs ?x_lmdb }
UNION {?x_lmdb movie:actor_name “Clint Eastwood” }
UNION {?x_lmdb movie:director_name “Clint Eastwood” }

Query skeleton
A query skeleton, or query template, t is a
member of (Σ∪C)∗, where C is an alphabet
called set of control symbols.
<[name]> ^(dbo:director|dbo:starring) ?[filmwork]?
{
{ <[name]> foaf:made ?[filmwork]? }
UNION { ?[filmwork]? movie:director ?x_lmdb }
UNION { ?[filmwork]? movie:actor ?x_lmdb } 
} . ?[filmwork]? owl:sameAs|ôwl:sameAs ?eq
Example II (LinkedMDB, {filmwork,name}
Example I (DBpedia, {filmwork,name}

JIT framework
Data source
selecLon
strategy
micro
compilers
query
skeletons
micro
compilers
query
skeletons
micro
compilers
query
skeletons
…
Compiler
compiler: funcLon Φ × ℘(Σ∪C)∗ × ℘W → L
Target
queries
Source
query
Target
queries Target
queries

Compilation strategies
•  A manifest is a pair of sets of
microcompilers and query skeletons
•  Grouping into manifests for:
– (a) data sources;
– (b) entity types
•  Data source selection algorithm produces
a set of datasource-query pairs by ﬁnding
satisﬁable query skeletons (on paper)

M.d'Aquin, A. Adamou, E. Daga, S. Liu, K. Thomas, E. MoTa: Dealing
with Diversity in a Smart-City Datahub. SemanLcs for Smarter CiLes
@ISWC 2014: 68-82
Big Data for Milton Keynes
as a Smart City
EnLty-centric data API based on a simpliﬁed language from
the one of this presentaLon

•  hTps://datahub.mksmart.org
•  hTps://github.com/mk-smart/enLty-centric-api

Implementation
•  Reference open source implementation
written in Java
–  With support for SPARQL and HTTP
dereferencing of RDF
–  Includes JIT logic, custom experimental VDIS and
HTTP API
•  Accepts microcompilers in JavaScript
•  Apache CouchDB map-reduce for atomically
retrieving candidate compilation units

Experiments
What is the price paid to turn a federated
query engine into a virtual data integration
system using JIT recompilation?

Experiments
1.  Benchmark of FedBench1 queries translated into our target
language
2.  Take a federated query engine (FedX)2
3.  Measure the time taken by FedX to execute the original
FedBench SPARQL query
–  On the live endpoints whenever possible
4.  Take the translated query and recompile them into one or
more SPARQL queries (at most one per data source)
–  Execute each query with FedX
5.  Measure for each:
–  Increase in size of “correct” result set
–  Recompilation overhead
–  Overall turnaround time of queries
1 hTp://fedbench.ﬂuidops.net
2 hTps://www.ﬂuidops.com/en/company/knowledge/open_source

Experiments
Example: FedBench Cross-domain CD3
CD3 (original)
SELECT ?pres ?party ?page WHERE {
?pres rdf:type dbpedia-owl:President .
?pres dbpedia-owl:nationality dbpedia:United_States .
?pres dbpedia-owl:party ?party
. ?x nytimes:topicPage ?page
. ?x owl:sameAs ?pres
}

CD3C:
type/president/country/united_states?party&webpage

Results I
Query Result set VDI boost Notes
FedBench Cross-Domain
CD1C
m * 1.387
CD2C
52 new results Plain FedX yielded no results
CD3C
67 new results Plain FedX yielded no results, has SERVICE clause
CD4C
m * 4480.0

Some microcompilers perform queries
CD5C
m * 1.0 No increment from recompilaLon
FedBench Life Sciences
LS1C
m * 1.0 Query could not be expanded
LS2C
m * 1.0 No increment from recompilaLon
LS3C
70981 results Plain FedX crashed
FedBench Linked Data
LD5C
m ∗ 3.677
LD9C
4 new results Plain FedX yielded no results
LD10C
m * 17.0
LD11C
m * 1.65

Results II
Query Time (ms) - FedX Time (ms) – FedX+JIT JIT overhead Query TAT
FedBench Cross-Domain
CD1C 300 ± 050 420 ± 109 400 ± 020 800 ± 120
CD2C 175 ± 005 475 ± 055 432 ± 009 1500 ± 123
CD3C 158 ± 004 446 ± 076 408 ± 106 1067 ± 048
CD4C 8835 ± 954 420 ± 100 787 ± 165 7480 ± 569
CD5C 851 ± 319 519 ± 145 448 ± 031 548 ± 061
FedBench Life Sciences
LS1C
795 ± 371 892 ± 043 query could not be expanded
LS2C
484 ± 166 420 ± 100 444 ± 061 370 ± 061
LS3C
!ERROR 6653 ± 861 query could not be expanded
FedBench Linked Data
LD5C
795 ± 371 801 ± 078 486 ± 017 1028 ± 099
LD9C
484 ± 166 407 ± 023 390 ± 039 318 ± 061
LD10C
189 ± 036 440 ± 018 416 ± 017 658 ± 101
LD11C
387 ± 067 861 ± 057 406 ± 020 762 ± 095

Discussion
•  Can compile star-shaped input queries into more
complex target queries
•  Overhead is mostly a standard cost
•  Proves to be mostly efﬁcient when also effective (i.e.
there is query expansion)
•  Cannot still substitute query federation optimisation
strategies
•  Manageability? We knew exactly how to proceed…
–  However we worked with ~ |A| · |MS| + |MT| microcompilers
and query skeletons, where it could have been up to |MT +
A| · |MS| + |MT|

Future work
•  Optimisations to abate JIT overhead
•  Application to chain-shaped queries and other
query types
•  Investigate other target languages
•  Investigate templating languages for query
skeletons
•  Cascaded mappings applied at query time (no
knowledge of dataset content or structure)

Thank You
Amsterdam, The Netherlands, September 12 2017
Alessandro Adamou1, Mathieu d’Aquin2, Carlo Allocca13, Enrico Motta1
1 Knowledge Media Institute, The Open University, UK
2 Insight Centre for Data Analytics, NUI Galway, Ireland
3 now Samsung Inc.

Virtual data integration with just-in-time query recompilation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Virtual data integration with just-in-time query recompilation

Similaire à Virtual data integration with just-in-time query recompilation (20)

Plus de semanticsconference

Plus de semanticsconference (20)

Dernier

Dernier (20)

Virtual data integration with just-in-time query recompilation