Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Generating Executable Mappings from RDF Data Cube Data Structure Definitions

109 vues

Publié le

Data processing is increasingly the subject of various internal and external regulations, such as GDPR which has recently come into effect. Instead of assuming that such processes avail of data sources (such as files and relational databases), we approach the problem in a more abstract manner and view these processes as taking datasets as input. These datasets are then created by pulling data from various data sources. Taking a W3C Recommendation for prescribing the structure of and for describing datasets, we investigate an extension of that vocabulary for the generation of executable R2RML mappings. This results in a top-down approach where one prescribes the dataset to be used by a data process and where to find the data, and where that prescription is subsequently used to retrieve the data for the creation of the dataset “just in time”. We argue that this approach to the generation of an R2RML mapping from a dataset description is the first step towards policy-aware mappings, where the generation takes into account regulations to generate mappings that are compliant. In this paper, we describe how one can obtain an R2RML mapping from a data structure definition in a declarative manner using SPARQL CONSTRUCT queries, and demonstrate it using a running example. Some of the more technical aspects are also described.

Reference: Christophe Debruyne, Dave Lewis, Declan O'Sullivan: Generating Executable Mappings from RDF Data Cube Data Structure Definitions. OTM Conferences (2) 2018: 333-350

Publié dans : Sciences
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Generating Executable Mappings from RDF Data Cube Data Structure Definitions

  1. 1. Generating Executable Mappings from RDF Data Cube Data Structure Definitions Christophe Debruyne, Dave Lewis, Declan O’Sullivan Trinity College Dublin 2018-10-23 @ ODBASE The ADAPT Centre is funded under the SFI Research Centres Programme (Grant 13/RC/2106) and is co-funded under the European Regional Development Fund.
  2. 2. www.adaptcentre.ieIntroduction • Data processing is increasingly the subject of various internal and external regulations – e.g., GDPR. • Datasets are created and used for a particular purpose. E.g., sending newsletters or using the purchase history of users to suggest recommendations. In the context of GDPR, these purposes require a user’s informed consent. • Can we generate datasets for a particular purpose “just in time” that complies with informed consent? 2018-10-23 2
  3. 3. www.adaptcentre.ieIntroduction • R2RML is a convenient way to transform (relational) non- RDF data into RDF to create these datasets. • One can create mappings from databases to vocabularies, ontologies, etc. for data processing activities. • We, however, chose to adopt the RDF Data Cube Vocabulary (QB) for representing datasets. 2018-10-23 3
  4. 4. www.adaptcentre.ieIntroduction • QB is an ontology for multi-dimensional datasets. A Data Structure Definition prescribes how a Dataset and its Observations are structure. An Observation is identified by Dimensions and captures a value for a Measure. • QB’s foundations is rooted in a schema for statistical datasets and the ontology seemingly complicated, but the RDF vocabulary is useful for other types of datasets as well. • Our choice was also influenced by projects in the health domain where statistical processing of data is key* *AVERT project: https://www.tcd.ie/medicine/thkc/avert/index.php/ 2018-10-23 4
  5. 5. www.adaptcentre.ieResearch Question • From “Can we generate datasets for a particular purpose “just in time” that complies with informed consent?” • To: “If we have a DSD for a particular purpose, how can we create an executable R2RML mapping to generate a dataset that complies with that DSD’s structure?” • A solution could is subsequently be extended to take into account policies so as to generate mapping that is compliant. In other words: “policy-aware”. To be reported. 2018-10-23 5
  6. 6. www.adaptcentre.ieApproach • R2DQB – pronounced R-2-D-cube • Data Structure Definitions • Dimensions • Measures • Attributes • References to tables • References to columns • Transformation functions • … Mapping Engine R2RML Mapping R2RML Processor Data Cube Dataset extended with according to 1 2 3 Validation 4 Provenance Information captured with 5 2018-10-23 6
  7. 7. www.adaptcentre.ieApproach Step 1: annotating DSDs • May be done in a separate graph (separation of concerns) • We chose to reuse R2RML to assess the feasibility in this study. A bespoke vocabulary may be considered in the future. (example from RDF Data Vocabulary Recommendation) 2018-10-23 7
  8. 8. @base <http://www.example.org/> <#refPeriod> a rdf:Property, qb:DimensionProperty; rdfs:subPropertyOf sdmx-dimension:refPeriod . <#refArea> a rdf:Property, qb:DimensionProperty; rdfs:subPropertyOf sdmx-dimension:refArea . <#lifeExpectancy> a rdf:Property, qb:MeasureProperty; rdfs:subPropertyOf sdmx-measure:obsValue; rdfs:range xsd:decimal . sdmx-dimension:sex a rdf:Property, qb:DimensionProperty . <#dsd-le> a qb:DataStructureDefinition; # The dimensions qb:component [ qb:dimension <#refArea> ]; qb:component [ qb:dimension <#refPeriod> ]; qb:component [ qb:dimension sdmx-dimension:sex ]; # The measure(s) qb:component [ qb:measure <#lifeExpectancy> ] . @base <http://www.example.org/> <#refPeriod> rr:column "period"; <#refArea> rr:column "area"; <#lifeExpectancy> rr:column "lifeexpectancy"; sdmx-dimension:sex rr:column "sex" . <#dsd-le> rr:tableName "statssimple"; The DSD The annotations Note: prefixes omitted for brevity.
  9. 9. www.adaptcentre.ieApproach Step 2: Generating the R2RML mapping • Adopting a declarative approach with SPARQL CONSTRUCT queries: 1. Generating a triples map for each DSD 2. Generating a subject map for each DSD and a predicate object map for linking observations to dataset Subject map is based on dimensions, as observations are identified by those. 3. Generating predicate object maps from measures 4. Generating predicate object maps from dimensions 5. Generating a link between dataset and DSD 2018-10-23 9
  10. 10. 1. CONSTRUCT { 2. ?tm rr:subjectMap [ 3. rr:class qb:Observation ; 4. rr:termType rr:BlankNode ; 5. rr:template ?x ; 6. ] . 7. ?tm rr:predicateObjectMap [ 8. rr:predicate qb:dataSet ; 9. rr:object ?ds; 10. ] . 11.} WHERE { 12. ?tm pam:correspondsWith ?dsd ; 13. rr:logicalTable [ rr:tableName ?t ] ; 14. BIND(IRI(?t) AS ?ds) 15. { 16. SELECT 17. (CONCAT("{", GROUP_CONCAT(?c; SEPARATOR="}-{"), "}") as ?x) { 18. ?dsd qb:component ?component . 19. { ?component qb:dimension [ rr:column ?c ] } 20. UNION 21. # OMITTED FOR CLARITY (SEE PAPER) 22. } GROUP BY ?dsd 23. } 24.} Constructing a subject map for observations and a predicate object map for linking observations to a dataset. All queries can be found in the paper.
  11. 11. www.adaptcentre.ieApproach Step 2: Generating the R2RML mapping 1. [ pam:correspondsWith <http://www.example.org/#dsd-le> ; 2. rr:logicalTable [ rr:tableName "statssimple" ] ; 3. rr:predicateObjectMap [ 4. rr:objectMap [ rr:column "area" ] ; 5. rr:predicate <http://www.example.org/#refArea> 6. ] ; 7. # Omitted 8. rr:predicateObjectMap [ 9. rr:object <statssimple> ; 10. rr:predicate qb:dataSet 11. ] ; 12. rr:subjectMap [ 13. rr:class qb:Observation ; 14. rr:template "{area}-{period}-{sex}" ; 15. rr:termType rr:BlankNode 16. ] 17.] . Result CONSTRUCT query previous slide. 2018-10-23 11
  12. 12. www.adaptcentre.ieApproach Step 3: Executing the R2RML Mapping – straightforward We did use our implementation of R2RML which extends the specification with JavaScript functions called R2RML-F Step 4: Validating the generated RDF Using the integrity constraints specified by the RDF Data Cube Vocabulary Recommendation 2018-10-23 12
  13. 13. www.adaptcentre.ieApproach Step 5: Provenance Information Keep track of activities and intermediate results with PROV-O. This will become key for a posteriori compliance analysis in future work. pam:Validation_Report pam:DSD_Document pam:Generate_Mapping pam:Execute_Mapping pam:Validate_Dataset pam:Mapping_Generator pam:R2RML_Processor pam:DSD_Document pam:R2RML_Mapping pam:Validatorowl:Thing prov:Entity prov:Agent prov:SoftwareAgent prov:Activity 2018-10-23 13
  14. 14. www.adaptcentre.ieFeatures Mapping values onto URIs, and Inclusion of data transformation functions • Mapping languages such as D2R had so-called translation tables, which mapped elements of one set to elements of another. Ideal for mapping values to IRIs. R2RML has no such functionality. That is why we choose to adopt R2RML-F, where such “translation tables” can be written in a JavaScript function. • R2RML-F also allows for transformation functions to be written when the underlying database technology has not support for that. Possibility to interlink with external datasets provided by R2RML 2018-10-23 14
  15. 15. www.adaptcentre.ieRelated Work Related Work – generation of R2RML to the best of our knowledge limited. • Skjaeveland et al. 2015 proposed a method to generate an ontology, rules and a mapping from one description • TabLinker and CSV2DataCube are two tools for generating QB graphs from Excel files (in a certain format) and CSV data respectively • The Open Cube Toolkit has a built-in R2RML compliant D2R server, but it relies on a bespoke XML that maps source and DSD. 2018-10-23 15
  16. 16. www.adaptcentre.ieConclusions • We argued that datasets are used for a purpose and that datasets should be built suitable for a purpose, including any policies it should comply with. • Before we can do the latter, we investigated the former by trying to answer the question: “Can we generate an R2RML mapping from a data structure definition?” • The answer is yes and we presented the R2DQB approach showing how. We strived for a declarative approach using SPARQL CONSTRUCT queries. A demonstration of the approach is presented in the paper. 2018-10-23 16
  17. 17. www.adaptcentre.ieFuture work Tackling the problem of policy-aware mapping, which would complement research on post-hoc compliance analysis (e.g., Harsh et al. 2017). To be reported. The Metadata Vocabulary for Tabular Data (W3C Rec.). A vocabulary for describing the “schemas” of tabular data, including constraints. This might be another representation worth considering (future work) 2018-10-23 17