Heterogeneous Data Aggregation and Querying at Web Scale Using Semantic alignment Technics
1. 1Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, FranceFranck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
F. Michel
Université Côte d’Azur, CNRS, Inia, laboratore I3S
Défi MASTODONS - Les Big Data en recherche, 13 Juin 2019
Heterogeneous Data Aggregation and Querying
at Web Scale
Using Semantic alignment Technics
2. 2Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
More data sources More Data Integration opportunities
3. 3Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Hortus Sanitatis.
First Natural History encyclopaedia, 1485.
4. 4Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Data Integration ex. in Digital Humanities
Archaeological excavationConservation biology*
*http://www.lynxeds.com/hmw/plate/family-delphinidae-ocean-dolphins
Hortus Sanitatis, 1485.
5. 5Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Data Integration ex. in Digital Humanities
Archaeological excavationConservation biology*
*http://www.lynxeds.com/hmw/plate/family-delphinidae-ocean-dolphins
First Natural History Encycloedia, 1485.
Knowledge formalization
Controlled vocabularies,
taxonomies,
domain ontologies…
6. 6Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
fédération de données et de
ConnaissancEs Distribuées en Imagerie BiomédicaLE
Scientific annual workshops 2012, 2013, 2014, 2015
Issues:
High heterogeneity
Increasing amount/number of sources
Need for cross-factor analysis
Sensitive (privacy, access policies)
Methods:
Knowledge formalization
Semantic alignment
Mediation towards common formats
Distributed querying
7. 7Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
How to enable RDF-based integration
of heterogeneous data sources?
8. 8Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
RDF-based Data Integration
Graph
Materialization
(ETL like)
Virtual Graph
Query
rewriting
SPARQL
SPARQL
Heterogeneous
data sources
ID NAME
9. 9Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Many methods for many types of data sources
AstroGrid-D, SPARQL2XQuery, XSPARQL
XML
XLWrap, Linked CSV, CSVW, RML
CSV/TSV/Spreadsheets
D2RQ, R2O, Ultrawrap, Triplify, SM
R2RML: Morph-RDB, ontop, Virtuoso
Relational Databases
RML, TARQL, Apache Any23, DataLift,
SPARQL-Generate
Multiple formats
RDFa, Microformats, JSON-LD
HTML
TARQL, JSON-LD, RML
JSON
xR2RML (MongoDB), ontop (MongoDB),
[Mugnier et al, 2016] (key-value stores)
NoSQL
M.L. Mugnier, M.C. Rousset, and F. Ulliana. “Ontology-Mediated Queries for NOSQL Databases.” In Proc. AAAI. 2016.
SPARQL Micro-services, Linked REST APIs
Web APIs
10. 10Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Agenda
xR2RML: Generic translation of
heterogeneous data sources into RDF
SPARQL micro-services:
Bridging Web APIs and the Web of Data
Applications in the biodiversity domain
11. 11Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Agenda
xR2RML: Generic translation of
heterogeneous data sources into RDF
SPARQL micro-services:
Bridging Web APIs and the Web of Data
Applications in the biodiversity domain
12. 12Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
The generic translation of
heterogeneous data sources into RDF
requires a generic mapping description.
13. 13Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
TEACHERS
ID FNAME TEACHES
7 Catherine Semantic Web
8 Philippe Software Engineering
… … …
http://example.org/teacher/7
Catherine
foaf:name ex:teaches
https://www.wikidata.org/
entity/Q54837
Mapping description
14. 14Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
The xR2RML mapping language
Uniform description of mappings from
most common types of DB to RDF
Extends R2RML, the W3C recommendation
for RDBs, and RML
Rich iteration model to accommodate
nested, hierarchical documents
Flexibility:
• Allow any query language
• Allow any syntax to reference data elements
from query results
http://i3s.unice.fr/~fmichel/xr2rml_specification_v5.html
15. 15Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
How to query a data source with SPARQL
using such a mapping description?
16. 16Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
SPARQL rewriting techniques for SQL and XQuery
Semantics-preserving 1-to-1 rewriting
Closely coupled with the target QL capabilities:
Support of joins, unions, nested queries, filtering, string fctn, etc.
Optimization:
Enforced on the target query,
or delegated to the DB query-processing engine
SQL: Bizer & Cyganiak, 2006; Unbehauen et al., 2013a; Priyatna et al., 2014; Rodríguez-Muro & Rezk, 2015
XQuery: Bikakis et al., 2015
Optimization: Unbehauen et al., 2013b; Rodríguez-Muro & Rezk, 2015; Elliott et al., 2009; Sequeda & Miranker, 2013
17. 17Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
How much of the SPARQL rewriting process can be
done in a DB-agnostic yet optimized manner?
18. 18Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Abstract Query Language (AQL)
Embark enough information for translation towards “any” DB QL.
Early optimizations
Self-Join Elimination, Self-Union Elimination, Filter propagation
SPARQL
query
xR2RML
mappings
Abstract
query
Concrete DB
query
19. 19Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Application to
AQL-to-MongoDB rewriting challenging:
Expressiveness gap: SPARQL AQL MongoDB
Joins not supported, nested query hardly supported, limited filter expressions
Semantic ambiguity
20. 20Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Filling the gap between the two worlds is not straightforward
Yet, NoSQL DBs are a huge, quickly increasing source of data.
Potential for RDF-based data integration and publication in the Web of Data.
SemanticWeb vs. NoSQL
Semantic Web NoSQL
highly connected graphs isolated documents, joins hardly supported
rich query expressiveness low expressiveness
reasoning _
? high throughput, high availability
_ horizontal elasticity
21. 21Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Generic approach suitable
when direct access to the data source
Graph
Materialization
query rewriting
ID NAME
What if we access the data source via an API?
SPARQL
22. 22Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Agenda
xR2RML: Generic translation of
heterogeneous data sources into RDF
SPARQL micro-services:
Bridging Web APIs and the Web of Data
Applications in the biodiversity domain
23. 23Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Web APIs: APIs all over the web
21,700+ Web APIs are registered on ProgrammableWeb.com (Jun. 2019)
Limitations:
• Standard formats (e.g. JSON, XML)
but proprietary vocabularies
• Documented in web pages but
not machine-processable,
no explicit semantics
• Internal resource identifiers,
no hyperlinks to resources
• Partial view over the database by
means of predefined services
24. 24Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
The SPARQL Micro-ServiceArchitecture
Lightweight method to query a Web API with SPARQL
SPARQL
Client
SPARQL
Micro-Service
(1) SPARQL
query
(2) Web API
query(4) SPARQL
response
(3) Web API
response
25. 25Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Bridging Web APIs and the Web of Data
Assign dereferenceable
URIs to Web API resources
Brooklyn Bridge sunset
schema:name
schema:contentUtl
unlock
http://example.org/photo/53735656
SPARQL
µ-service
Expose in the Web of Data
resources locked in a silo
26. 26Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Agenda
xR2RML: Generic translation of
heterogeneous data sources into RDF
SPARQL micro-services:
Bridging Web APIs and the Web of Data
Applications in the biodiversity domain
27. 27Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Use case
TAXREF
French TAXonomic REFerence for fauna, flora, fungus
maintained by the Muséum National d’Histoire Naturelle.
570,000+ scientific names, 260,000+ taxa
Mainland France and overseas territories,
Web site, Web service, downloadable text file
28. 28Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Biodiversity studies (e.g. impact of global warming
on species distributions) require mashing up data
from multiple stakeholders
How to make biodiversity data FAIR?
29. 29Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
TAXREF-LD
Linking Open Data cloud diagram, 2019. J.P. McCrae, A. Abele,
P. Buitelaar, R. Cyganiak, A. Jentzsch, V. Andryushechkin and J.
Debattista. http://lod-cloud.net/
http://taxref.mnhn.fr/sparql
Several steps involved…
• Modelling of taxonomic
information as Linked Data
• Write and enact xR2RML
mappings
(JSON MongoDB RDF)
• Publish on the Web of Data
30. 30Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Web app.
SPARQL HTML
SPARQL
Micro-services
TAXREF-LD
NCBI
TaxonConcept
Agrovoc
31. 31Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
SPARQL micro-services to compare TAXREF
information with 7 biodiversity sources:
• FishBase
• Global Biodiversity Information Framework
• World Register of Marine Species
• Pan-European Species directoris Infrstructure
• Index Fungorum
• Tropicos
• Sandre – Service d’Administration National des
Donées et Référentiels de l’Eau
32. 32Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
http://sms.i3s.unice.fr/demo-sms?param=Delphinapterus+leucas
33. 33Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Take-aways
More data sources => new data
integration scenarios
Need for explicit, machine-processable
data semantics
The SW provides tools to do that
RDF, SPARQL, ontologies…
Various methods to translate
heterogeneous data sources to RDF
Mapping language-based
Wrapper-based
More research needed to:
• Allow automatic discovery of data sources,
e.g. data portals, search engines…
• Automatic generation of federated queries
• Automate semantic alignment of data
sources represented in RDF
These technics are a way to achieve
Open Data, Open Science, FAIRness
34. 34Franck MICHEL - Université Côte d’Azur, CNRS, Inria, I3S, France
Related publications
Generic translation to RDF
Michel F., Djimenou L., Faron-Zucker C. & Montagnat J. (2015). Translation of Relational
and Non-Relational Databases into RDF with xR2RML. In Proceeding of the WebIST, pp.
443–454. Lisbon, Portugal.
Michel F., Faron-Zucker C. & Montagnat J. (2016). A Generic Mapping-Based Query
Translation from SPARQL to Various Target Database Query Languages. In Proceeding of
WebIST vol. 2, pp. 147–158. Rome, Italy.
Michel F., Faron-Zucker C. & Montagnat J. (2016). A Mapping-based Method to Query
MongoDB Documents with SPARQL. In Proceedings of DEXA vol. 9828, LNCS, pp. 52–67.
Porto, Portugal.
Michel F., Catherine F. Z. & Montagnat J. (2018). Bridging the Semantic Web and NoSQL
Worlds: Generic SPARQL Query Translation and Application to MongoDB. Transactions
on Large-Scale Data- and Knowledge-Centered Systems (LNCS 11360):125–165.
Biodiversity
Michel F., Gargominy O., Tercerie S. & Faron-Zucker C. (2017). A Model to Represent
Nomenclatural and Taxonomic Information as Linked Data. Application to the French
Taxonomic Register, TAXREF. In Proceedings of the ISWC2017 workshop on Semantics for
Biodiversity (S4BioDiv) vol. 1933. Vienna, Austria.
Michel F., Faron-Zucker C., Tercerie S. & Olivier G. (2018). Modelling Biodiversity Linked
Data: Pragmatism May Narrow Future Opportunities. In Biodiversity Information Science
and Standards, TDWG 2018 Proceedings vol. 2, p. e26235. Dunedin, New Zealand.
SPARL micro-services
Michel F., Faron-Zucker C. & Gandon F (2018). SPARQL Micro-Services: Lightweight
Integration of Web APIs and Linked Data. In Proceedings of the Linked Data on the Web
Workshop (LDOW2018). Lyon, France.
Michel F., Zucker C., Gargominy O. & Gandon F. (2018). Integration of Web APIs and Linked
Data Using SPARQL Micro-Services—Application to Biodiversity Use Cases. Information
9(12):310.
F. Michel, C. Faron-Zucker, O. Corby & F. Gandon. Enabling Automatic Discovery and Querying
of Web APIs at Web Scale using Linked Data Standards. In Companion Proceedings of the
2019 World Wide Web Conference
(WWW ’19 Companion), 2019, San Francisco, CA, USA.