Ce diaporama a bien été signalé.
Le téléchargement de votre SlideShare est en cours. ×

SFScon22 - Francesco Corcoglioniti - Integrating Dynamically-Computed Data and Web APIs into Virtual Databases Knowledge Graph.pdf

Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité
Publicité

Consultez-les par la suite

1 sur 17 Publicité

SFScon22 - Francesco Corcoglioniti - Integrating Dynamically-Computed Data and Web APIs into Virtual Databases Knowledge Graph.pdf

Télécharger pour lire hors ligne

Enabling transparent SQL/SPARQL access to both static and dynamically-computed data

Query languages for databases (e.g., SQL) and knowledge graphs (e.g., SPARQL) provide a concise, declarative, and highly flexible mechanism to access stored data. Yet, many use cases also involve dynamically-computed data available through web APIs or other forms of external services. In such settings, data access is comparatively less flexible (e.g., due to restrictions on available input/output methods), convenient, and sometimes prohibitively slow for users interactively querying data. In this talk, we discuss these problems and present open source solutions that enable querying dynamically-computed data as a “virtual” (since not fully materialized) relational database via SQL, or as a “virtual” knowledge graph via SPARQL, at the same time providing pre-computation and caching solutions to speed up data access. The core components presented in the talk have been developed in the context of the HIVE “Fusion Grant” project and the OntoCRM project, both involving UNIBZ and Ontopic srl. In both projects, we aim at extending virtual knowledge graphs to dynamically-computed data, with a particular focus on applications in the domains of environmental sustainability and climate risk management.

Enabling transparent SQL/SPARQL access to both static and dynamically-computed data

Query languages for databases (e.g., SQL) and knowledge graphs (e.g., SPARQL) provide a concise, declarative, and highly flexible mechanism to access stored data. Yet, many use cases also involve dynamically-computed data available through web APIs or other forms of external services. In such settings, data access is comparatively less flexible (e.g., due to restrictions on available input/output methods), convenient, and sometimes prohibitively slow for users interactively querying data. In this talk, we discuss these problems and present open source solutions that enable querying dynamically-computed data as a “virtual” (since not fully materialized) relational database via SQL, or as a “virtual” knowledge graph via SPARQL, at the same time providing pre-computation and caching solutions to speed up data access. The core components presented in the talk have been developed in the context of the HIVE “Fusion Grant” project and the OntoCRM project, both involving UNIBZ and Ontopic srl. In both projects, we aim at extending virtual knowledge graphs to dynamically-computed data, with a particular focus on applications in the domains of environmental sustainability and climate risk management.

Publicité
Publicité

Plus De Contenu Connexe

Plus par South Tyrol Free Software Conference (20)

Plus récents (20)

Publicité

SFScon22 - Francesco Corcoglioniti - Integrating Dynamically-Computed Data and Web APIs into Virtual Databases Knowledge Graph.pdf

  1. 1. Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs Enabling transparent SQL/SPARQL access to both static and dynamically-computed data Francesco Corcoglioniti 2022-11-11 postdoc @ KRDB, Free University of Bolzano, supported by HIVE Fusion Grant project (2021-2022), OntoCRM project (2022-2024), and Ontopic s.r.l
  2. 2. Background Data is increasingly available via Web APIs • access to 3rd-party and/or dynamically-computed data • access to data-related services, e.g., text search Some APIs’ statisticsa • 83% of all Internet traffic belongs to API-based services • 2M+ API repositories on GitHub • 90% of developers use APIs • 30% of development time spent on coding APIs Complex data access problem for applications operating on data from both databases and APIs a https://nordicapis.com/20-impressive-api-economy-statistics/ RDB Sources API Sources SQL calls Application complex data access problem Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 1/16
  3. 3. Simplify API Access via “Virtual” Databases (VDBs) or “Virtual” Knowledge Graphs (VKGs) RDB Sources Virtual Database (VDB) API Sources SQL SQL calls Application RDB Sources Virtual Knowledge Graph (VKG) API Sources SPARQL SQL calls Application • unified data access: applications operate on a single DB/KG data source via a declarative data manipulation language (DML) • virtual DB/KG: its data is (mostly) kept in the original sources (no ETL) • data federation setting: VDB/VKG queries run by orchestrating source sub-queries and API calls Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 2/16
  4. 4. Example Scenario – Extend Open Data Hub (ODH) with Semantic Search Answer hybrid queries like: • get (plot) IRI, description, rating & location of accommodations ... • whose rating is 3 stars or more (structured constraint) and ... • whose EN description matches the search string “horse riding” (text constraint) Semantic search: improved text search that aims at capturing and leveraging text meaning (vs term matching only) • e.g., via BERT-based model from Sentence Transformers library Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 3/16
  5. 5. VDB Specification – SQL/MED SQL/MED allows federating multiple sources in a virtual database (VDB) • standardized SQL extension supported by some data federation systems like Teiid • VDB as a set of schemas mapped to foreign data sources accessed via wrappers/translators • we extend Teiid with a new service translator for accessing APIs Example using Teiid with our extensions: CREATE DATABASE vdb_example OPTIONS ( "... connection options for federated sources ..." ); USE DATABASE vdb_example; CREATE SERVER db_source FOREIGN DATA WRAPPER postgresql; -- define RDB source with schema 'db' CREATE SCHEMA db SERVER db_source; -- using 'postgresql' translator to access it CREATE SERVER srv_source FOREIGN DATA WRAPPER service; -- define API source with schema 'srv' CREATE SCHEMA srv SERVER srv_source; -- using 'service' translator to access it IMPORT FOREIGN SCHEMA public FROM SERVER db_source INTO db OPTIONS ( importer.catalog 'public' ); SET SCHEMA srv; -- CREATE FOREIGN TABLE / PROCEDURE statements mapped to API operations (API bindings) Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 4/16
  6. 6. VDB Specification – API Bindings API operations as SQL/MED procedures • input tuple → 0..n output tuples • URL, method, request/response templates CREATE FOREIGN PROCEDURE api_semsearch_query ( query VARCHAR ) RETURNS TABLE ( query VARCHAR, id VARCHAR, score DOUBLE, excerpt VARCHAR ) OPTIONS ( "method" 'post', "url" 'http://semsearch:8080/query', "requestBody" '{"query": "{query}", "n": 100}', "responseBody" '{"matches": [{ "id": "{id}", "score": "{score}", "excerpt": "{excerpt}" }] }' ); API data as SQL/MED virtual tables • linked to API operations/procedures • each procedure defines an access pattern CREATE FOREIGN TABLE vt_semsearch_match ( query VARCHAR NOT NULL, id VARCHAR NOT NULL, score DOUBLE NOT NULL, excerpt VARCHAR NOT NULL, PRIMARY KEY (query, id) ) OPTIONS ( "select" 'api_semsearch_query' ); CREATE FOREIGN TABLE vt_semsearch_index ( id VARCHAR PRIMARY KEY, text VARCHAR NOT NULL ) OPTIONS ( "UPDATABLE" 'true', "upsert" 'api_semsearch_store', "delete" 'api_semsearch_clear' ); Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 5/16
  7. 7. Query Translation & Execution Given a VDB defined using SQL/MED + API Bindings and an input query over the VDB • Teiid splits the query into sub-queries based on translator capabilities and cost heuristics • sub-queries are sent to translators & Teiid handles remaining operations (e.g., federated joins) Example SQL query SELECT s.score, s.excerpt, a."AccoCategoryId", a."AccoDetail-en-Name", a."AccoDetail-en-City" FROM srv.vt_semsearch_match AS s JOIN db.v_accommodationsopen AS a ON s.id = a."Id" WHERE s.query = 'horse riding' ORDER BY s.score DESC LIMIT 10 Execution plan LimitNode (limit = 10) SortNode (s.score DESC) ProjectNode (s.score, ... a."AccoDetail-en-City") JoinNode (s.id = a."Id", merge join strategy) AccessNode (API) SELECT id, excerpt, score FROM vt_semsearch_match WHERE query = ’horse riding’ AccessNode (RDB) SELECT "Id", "AccoDetail-en-Name", "AccoDetail-en-City", FROM v_accommodationsopen Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 6/16
  8. 8. Query Translation & Execution – Push-down of Projection, Filtering, Sorting, Slicing Special input attributes map API capabilities related to standard relational operators • filtering: return/process only objects matching some criteria (e.g., attribute = or ≥ constant) • projection: include/exclude certain attributes in returned results • sorting: sort results according to a certain attribute and direction (ascending/descending) • slicing: return only a given page of all possible results CREATE FOREIGN PROCEDURE api_station_data_from_to ( stype VARCHAR NOT NULL, sname VARCHAR NOT NULL, tname VARCHAR NOT NULL, __min_inclusive__mvaliddate DATE NOT NULL, -- filter push down (conditions min <= mvaliddate <= max) __max_inclusive__mvaliddate DATE NOT NULL, __limit__ INTEGER -- slicing push down ) RETURNS TABLE ( ... ) ) OPTIONS ( ... ); Partial/complete push down of these operators whenever possible • allows offloading computation to the API (e.g., sorting) • allows reducing costs by manipulating & transferring less data Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 7/16
  9. 9. Query Translation & Execution – Exploiting Bulk API Operations Bulk API operations operate on multiple input tuples, such as lookup by set of IDs or bulk store • their use enables better performance due to less API calls • useful to speed-up dependent joins (using IN operator) between RDBMS and API data A A RDBMS table R virtual table S bulk API operation (A input attribute) ⨝R.A = S.A SELECT A, … FROM R WHERE … 1 SELECT A, … FROM S WHERE A IN (a1, a2, …) AND … 3 2 Extract values of join attribute A: a1, a2, … API bindings 4 Bulk API calls with multiple input tuples for different values of A: a1, a2, … Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 8/16
  10. 10. Data Materialization Data materialization: required by API operations that cannot be invoked at query time • operations too expensive to call at query time (e.g., align API and DB identifiers) • operations instrumental to the use of external APIs (e.g., text indexing in a search engine) Solution #1: materialized views in Teiid (or other data federation system used) Solution #2: dedicated materialization engine for flexibly executing arbitrary materialization rules: • identifier – for documentation & diagnostics • target – the system-managed computed table (possibly virtual) where data is stored • source – arbitrary SQL query (over any tables) that produces the data to store rules: - id: index_accommodation_texts target: vt_semsearch_index source: |- SELECT "Id" AS id, "AccoDetail-en-Longdesc" AS text FROM v_accommodationsopen WHERE "AccoDetail-en-Longdesc" IS NOT NULL - ... other rules ... Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 9/16
  11. 11. Data Materialization (cont’d) Rules (their SQL source queries) are analyzed to derive a rule dependency graph, which is mapped to an execution plan using fixpoint rule evaluation for strongly connected components R1 R2 R3 R4 R5 R1 R2 R3 R4 R5 sequence ( parallel ( R1, sequence ( R2, fixpoint ( parallel ( R3, R4 ) ) ) ), R5 ) Rule / Table Dependencies Rule Dependencies Execution Plan Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 10/16
  12. 12. VKG over APIs – Ontology-Based Data Access (OBDA) & Ontop OBDA builds a VKG on an RDB source • an ontology defines the VKG classes and properties (TBox) • mappings define how to populate each class/property with RDB data (ABox) • query rewriting maps VKG queries (SPARQL) into native queries (SQL) over the source • Ontop open-source system Idea: build a VDB over APIs, then apply OBDA to convert it into a VKG • Ontop + Teiid/service translator VKGs for Data Access Ontop and Ontopic Developments NL Knowledge Extraction Query answering by query rewriting Ontology Mappings Data Sources . . . . . . . . . . . . Ontological Query q Rewritten Query SQL Relational Answer Ontological Answer Rewriting Unfolding Evaluation Result Translation Diego Calvanese, Francesco Corcoglioniti, Guohui Xiao (unibz) VGKs for Data Access and Integration Huawei – 03/08/202 Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 11/16
  13. 13. VKG over APIs – Ontology & Mappings Example Ontology schema:Accommodation a owl:Class ; rdfs:subClassOf schema:Place ; rdfs:label "Accommodation"@en ; ... schema:name a owl:DatatypeProperty ; ... hive:Match a owl:Class ... Current ontology formalism (OWL 2 QL) reused as is, but now also models data from APIs Mappings mappingId Semantic Search target data:match/accommodation/{id}/{query} a hive:Match; hive:query {query}^^xsd:string; hive:resource data:accommodation/{id}; hive:excerpt {excerpt}@en; hive:score {score}^^xsd:decimal. source SELECT * FROM hiveodh.srv.vt_semsearch_match Current VKG mapping formalism reused as is, but data may now come from API virtual tables Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 12/16
  14. 14. VKG over APIs – Query Rewriting & Evaluation Example User-supplied SPARQL query SELECT ?h ?posLabel ?rating ?pos { [] a hive:Match ; hive:query "horse riding"^^xsd:string ; hive:resource ?h ; hive:excerpt ?excerpt ; hive:score ?score . ?h a schema:LodgingBusiness ; geo:defaultGeometry/geo:asWKT ?pos ; schema:name ?name ; schema:description ?description ; schema:starRating/schema:ratingValue ?rating. FILTER (?rating >= 3 && lang(?name) = 'en' && lang(?description) = 'en') BIND (CONCAT(?name, " <br><br>...", ?excerpt, "...<br><br>", ?description) AS ?posLabel) } ORDER BY DESC(?score) LIMIT 10 SQL query rewritten by Ontop SELECT v1.id, v1.excerpt, -- fields used v2."AccoDetail-en-Name", -- for deriving v2."AccoDetail-en-Longdesc", -- ?posLabel ... complex expression computing rating ..., ST_ASTEXT(v2."Geometry") FROM hiveodh.srv.vt_semsearch_match v1, hiveodh.db.v_accommodationsopen v2 WHERE v1."id" = v2."Id" AND CAST(v1."query" AS TEXT) = 'horse riding' AND ... complex condition on rating >= 3 ... AND ... nonnull conditions for output columns ... ORDER BY CAST(v1."score" AS DECIMAL) DESC LIMIT 10 SQL query evaluated on the VDB by Teiid Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 13/16
  15. 15. VKG over APIs – ODH with Semantic Search Demo Data sources DB with ODH tourism data + Semantic search API to index & query accommodations texts System Ontop embedding Teiid + materialization engine Demo https://hive.inf.unibz.it/ odh/vkg/ Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 14/16
  16. 16. Overall Framework & Ongoing Work Virtual DB (VDB) Teiid + service translator VKG Mappings including virtual tables, used for query rewriting Materialization Rules pre-compute results of expensive API calls → VDB/VKG no more fully “virtual” API Bindings define how to query/update a virtual table via API calls, if possible → limited access patterns RDB Sources API Sources Virtual Knowledge Graph (VKG) Ontop SQL SQL calls Application (VKG-based) Application (VDB-based) SQL SPARQL VKG Ontology formalizes the classes/properties (the “schema”) of the VKG, enabling reasoning 1 3 2 Ongoing work: 1. query rewriting tuned to VDB + APIs 2. service translator improvements 3. change data capture tools (e.g. Debezium) for incremental materialization 4. application to analysis of static + dynamic data in the domain of climate risk management (OntoCRM project) Integrating Dynamically-Computed Data and Web APIs into “Virtual” Databases and Knowledge Graphs 15/16
  17. 17. Thanks for attending! Ontop: https://ontop-vkg.org/ Teiid: https://teiid.io/ our extensions: https://hive.inf.unibz.it/

×