This document discusses supporting virtual integration of Linked Data through just-in-time query recompilation. It presents a technique for compiling input queries into target SPARQL queries over individual data sources. Microcompilers encode knowledge of data schemas and query skeletons provide templates. Experiments show overhead is mostly standard but it enables queries not otherwise possible and is efficient when expanding queries. Future work includes optimizing overhead and investigating other languages and templating.
Virtual data integration with just-in-time query recompilation
1. Supporting virtual integration of
Linked Data
with just-in-time query recompilation
Amsterdam, The Netherlands, September 12 2017
Alessandro Adamou1, Mathieu d’Aquin2, Carlo Allocca13, Enrico Motta1
1 Knowledge Media Institute, The Open University, UK
2 Insight Centre for Data Analytics, NUI Galway, Ireland
3 now Samsung Inc.
3. Virtual data integration
• No ETL process
• Naturally keeps data up-to-date
• Unlike data federation, there is still a designated node
• Favours project economies relying on networking rather
than storage space
• Serious performance issues!
• Generally considered less robust
• Acquiring momentum in industry, per 2016 Gartner
report
• Maintenance??
4. Pay-as-you-go integration
1. Establish mappings between source
schemas and global schema (go)
2. Obtain feedback on mapping results, e.g.
in terms of precision and recall (pay)
3. Refine 1
Really “pays off” in virtual data integration
N.W. Paton, K. Christodoulou, A. A. A. Fernandes, B. Parsia, C. Hedeler: Pay-as-you-go data
integra0on for linked data: opportuni0es, challenges and architectures. SWIM 2012: 3
6. • Target query language other than
SPARQL
– Conjunctive, star-shaped queries (good for
Web APIs)
{hostname}{/attribute/value}+?
{attribute}{&attribute}+
e.g.
http://example.org/api/type/actor/
name/clint_eastwood?filmwork
7. type/actor/name/clint_eastwood
SELECT DISTINCT ?filmwork WHERE {
{ dbr:Clint_Eastwood a dbo:Actor
; ^(dbo:director|dbo:starring) ?filmwork
} UNION {
{ ?x owl:sameAs dbr:Clint_Eastwood
} UNION {
?x (movie:actor_name|movie:director_name) ?s
FILTER (str(?s) = "Clint Eastwood")
}
. ?x foaf:made|^(movie:director|movie:actor) ?filmwork
}}
Equivalent SPARQL query for federated engine
that supports DBpedia and LinkedMDB
8. type/actor/name/clint_eastwood
SELECT DISTINCT ?filmwork WHERE {
{ dbr:Clint_Eastwood a dbo:Actor
; ^(dbo:director|dbo:starring) ?filmwork
} UNION {
{ ?x owl:sameAs dbr:Clint_Eastwood
} UNION {
?x (movie:actor_name|movie:director_name) ?s
FILTER (str(?s) = "Clint Eastwood")
}
. ?x foaf:made|^(movie:director|movie:actor) ?filmwork
}}
Equivalent SPARQL query for federated engine
that supports DBpedia and LinkedMDB
9. type/actor/name/clint_eastwood
SELECT DISTINCT ?filmwork WHERE {
{ dbr:Clint_Eastwood a dbo:Actor
; ^(dbo:director|dbo:starring) ?filmwork
} UNION {
{ ?x owl:sameAs dbr:Clint_Eastwood
} UNION {
?x (movie:actor_name|movie:director_name) ?s
FILTER (str(?s) = "Clint Eastwood")
}
. ?x foaf:made|^(movie:director|movie:actor) ?filmwork
}}
Equivalent SPARQL query for federated engine
that supports DBpedia and LinkedMDB
12. • Encode integrator’s knowledge of a
dataset schema into a set of primitives,
which will serve as “compilation units”.
• Managing the compilation units of a query
using two types of structure:
– Microcompilers
– Query skeletons (or templates)
15. Microcompiler (JS ex. II)
mc_x_lmdb = function (type,name) {
if( [‘actor’,’director’].indexof(type) >= 0 ) {
var sa = mc_x_dbp(type,name) + ‘ ^owl:sameAs ?x_lmdb’;
var nam = makename(name); // omitted for simplicity
return ‘{ ‘ + sa + ’ } UNION ’
+ ‘{ ?x_lmdb movie:actor_name “’+nam+’” } UNION’
+ ‘{ ?x_lmdb movie:director_name “’+nam+’” }’ //…
}}
type/actor/name/clint_eastwood
{ VALUES(?x_dbp) {
( <http://dbpedia.org/resource/Clint_Eastwood> )
( <http://dbpedia.org/resource/Clint_Eastwood_(actor)> )
} ?x_dbp ^owl:sameAs ?x_lmdb }
UNION {?x_lmdb movie:actor_name “Clint Eastwood” }
UNION {?x_lmdb movie:director_name “Clint Eastwood” }
16. Query skeleton
A query skeleton, or query template, t is a
member of (Σ∪C)∗, where C is an alphabet
called set of control symbols.
<[name]> ^(dbo:director|dbo:starring) ?[filmwork]?
{
{ <[name]> foaf:made ?[filmwork]? }
UNION { ?[filmwork]? movie:director ?x_lmdb }
UNION { ?[filmwork]? movie:actor ?x_lmdb }
} . ?[filmwork]? owl:sameAs|^owl:sameAs ?eq
Example II (LinkedMDB, {filmwork,name}
Example I (DBpedia, {filmwork,name}
18. Compilation strategies
• A manifest is a pair of sets of
microcompilers and query skeletons
• Grouping into manifests for:
– (a) data sources;
– (b) entity types
• Data source selection algorithm produces
a set of datasource-query pairs by finding
satisfiable query skeletons (on paper)
21. Implementation
• Reference open source implementation
written in Java
– With support for SPARQL and HTTP
dereferencing of RDF
– Includes JIT logic, custom experimental VDIS and
HTTP API
• Accepts microcompilers in JavaScript
• Apache CouchDB map-reduce for atomically
retrieving candidate compilation units
23. Experiments
What is the price paid to turn a federated
query engine into a virtual data integration
system using JIT recompilation?
24. Experiments
1. Benchmark of FedBench1 queries translated into our target
language
2. Take a federated query engine (FedX)2
3. Measure the time taken by FedX to execute the original
FedBench SPARQL query
– On the live endpoints whenever possible
4. Take the translated query and recompile them into one or
more SPARQL queries (at most one per data source)
– Execute each query with FedX
5. Measure for each:
– Increase in size of “correct” result set
– Recompilation overhead
– Overall turnaround time of queries
1 hTp://fedbench.fluidops.net
2 hTps://www.fluidops.com/en/company/knowledge/open_source
26. Results I
Query Result set VDI boost Notes
FedBench Cross-Domain
CD1C
m * 1.387
CD2C
52 new results Plain FedX yielded no results
CD3C
67 new results Plain FedX yielded no results, has SERVICE clause
CD4C
m * 4480.0
Some microcompilers perform queries
CD5C
m * 1.0 No increment from recompilaLon
FedBench Life Sciences
LS1C
m * 1.0 Query could not be expanded
LS2C
m * 1.0 No increment from recompilaLon
LS3C
70981 results Plain FedX crashed
FedBench Linked Data
LD5C
m ∗ 3.677
LD9C
4 new results Plain FedX yielded no results
LD10C
m * 17.0
LD11C
m * 1.65
29. Discussion
• Can compile star-shaped input queries into more
complex target queries
• Overhead is mostly a standard cost
• Proves to be mostly efficient when also effective (i.e.
there is query expansion)
• Cannot still substitute query federation optimisation
strategies
• Manageability? We knew exactly how to proceed…
– However we worked with ~ |A| · |MS| + |MT| microcompilers
and query skeletons, where it could have been up to |MT +
A| · |MS| + |MT|
30. Future work
• Optimisations to abate JIT overhead
• Application to chain-shaped queries and other
query types
• Investigate other target languages
• Investigate templating languages for query
skeletons
• Cascaded mappings applied at query time (no
knowledge of dataset content or structure)
31. Thank You
Amsterdam, The Netherlands, September 12 2017
Alessandro Adamou1, Mathieu d’Aquin2, Carlo Allocca13, Enrico Motta1
1 Knowledge Media Institute, The Open University, UK
2 Insight Centre for Data Analytics, NUI Galway, Ireland
3 now Samsung Inc.