SlideShare une entreprise Scribd logo
1  sur  17
Télécharger pour lire hors ligne
Digital Enterprise Research Institute                                          www.deri.ie




            Capturing interactive data transformation
             operations using provenance workflows

             Tope Omitola, Andre Freitas, Edward Curry, Sean
             O'Riain, Nicholas Gibbins and Nigel Shadbolt



  SWPM Workshop 28.05.2012, Herakleion, Crete


 Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
Outline
Digital Enterprise Research Institute                 www.deri.ie




           Motivation
           Interactive data transformations (IDTs)
           IDT & Provenance
           Modelling IDTs
           Provenance Representation
           Provenance Capture
           Case Study
           Conclusion
Motivation
Digital Enterprise Research Institute                                  www.deri.ie




           Dataspaces:
                 High number of heterogeneous data sources
                 Complex data transformation environment
                 Need for both repeatable data transformations and once-
                  off transformations
           Traditional    ETL     approaches                 for     data
            transformation/integration:
                 Based on scripting/programming
                 Focus on repeatable data transformation processes
Interactive Data Transformation (IDTs)
Digital Enterprise Research Institute                   www.deri.ie




        Based on user interaction paradigms for user
         creation of data transformations
        Explores    GUI    elements    mapping   to   data
         transformation operations
        Instant feedback of each iteration
        Complementary to existing ETL tools
        Lower the barriers for non-programmers (reduces
         programming effort) of doing data transformations
        Example platforms: Google Refine, Potters Wheel,
         Wrangler
Interactive Data Transformation (IDTs)
Digital Enterprise Research Institute      www.deri.ie
Challenges
Digital Enterprise Research Institute                            www.deri.ie




           How to model IDTs?

           Facilitating the reuse of previous IDTs

           Representing IDTs
                                                           Provenance

           Making IDT platforms provenance-aware

           Enabling transportability across IDT and ETL
            platforms
IDT & Provenance
Digital Enterprise Research Institute                     www.deri.ie




           Provenance supports representation of interactive
            data transformations
           Output: a provenance descriptor which shows the
            relationship between the inputs, the outputs, and
            the applied transformation operations
           Both retrospective and prospective provenance
IDT
Digital Enterprise Research Institute        www.deri.ie




           IDT model
           Formal model (Algebra for IDT)
           Provenance representation
           Provenance capture of IDTs
IDT Model: Core Elements
Digital Enterprise Research Institute                       www.deri.ie




           Schema and instance data
           Set of predefined operations
           GUI elements mapping to predefined operations
           User actions
                 Operation selection
                 Parameter selection
                 Operation composition (workflow)
IDT Model
Digital Enterprise Research Institute   www.deri.ie
Formalizing the mapping from IDT to
     Provenance
Digital Enterprise Research Institute                        www.deri.ie




           Definition 1: A provenance-based interactive data
            transformation engine, consists of a set of
            transformations (or activities) on a set of datasets
            generating outputs in the form of other datasets or
            events which may trigger further transformations

           Definition 2: An interactive data transformation
            event, consists of the input dataset, the output
            dataset(s), the applied transformation function,
            and the time the transformation took place
Formalizing the mapping from IDT to
        Provenance
Digital Enterprise Research Institute                       www.deri.ie




           Definition 3: A run is a function from time to
            dataset(s) and the transformation applied to those
            dataset(s)

           Definition 4: A trace is the sequence of pairs of a
            run and the time the run was made
Provenance Representation
Digital Enterprise Research Institute                      www.deri.ie




           Proposed in Representing Interoperable Provenance
            Descriptions for ETL Workflows

           Three-layered provenance model:
                 Open Provenance Model Vocabulary Layer
                 Cogs ETL Provenance Vocabulary
                 Domain-Specific Model Layer


           Linked Data standards
Provenance Capture Layers
Digital Enterprise Research Institute   www.deri.ie
Provenance Event-Capture Sequence Flow
Digital Enterprise Research Institute    www.deri.ie
Case study
Digital Enterprise Research Institute                                                                                    www.deri.ie




        Implementation over the GR Platform
        Example descriptor

   @prefix grf: <http://127.0.0.1:3333/project/1402144365904/> .

   grf :MassCellChange-1092380975 rdf:type opmv:Process,
   cogs:ColumnOperation, cogs:Transformation;                                 Mapping to the actual program
   cogs:operationName "MassCellChange"^^xsd:string;
   cogs:programUsed "com.google.refine.operations.cell.MassEditOperation"^^xsd:string;                  Process
   rdfs:label "Mass edit 1 cells in column ==List of winners=="^^xsd:string.

   grf:MassCellChange-1092380975/1_0 rdf:type opmv:Artifact ;                                                       Input Artifact
   rdfs:label "* '''1955 [[Meena Kumari]]'[[Parineeta (1953 film)|Parineeta]]''''' as '''Lolita'''"^^xsd:string.

   grf:MassCellChange-1092380975/1_1 rdf:type opmv:Artifact;                                                       Output Artifact
   rdfs:label "* '''John Wayne'''"^^xsd:string.
                                                                                                            Workflow structure
   grf:MassCellChange-1092380975/1_1 opmv:wasDerivedFrom grf:MassCellChange-1092380975/1_0.
   grf:MassCellChange-1092380975 opmv:used grf:MassCellChange-1092380975/1_0.
   grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedBy grf:MassCellChange-1092380975.
   grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedAt "2011-11-16T11:2:14"^xsd: dateTime.
Conclusion
Digital Enterprise Research Institute                     www.deri.ie




           The proposed approach provides low impact on the
            existing IDT process
           Provenance representation supports different data
            models
           Preliminary implementation of a Google Refine
            provenance extension

Contenu connexe

Similaire à Capturing Interactive Data Transformation Operations using Provenance Workflows

Approximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous EventsApproximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous Events
Edward Curry
 
Approximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous EventsApproximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous Events
Souleiman Hasan
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Yael Garten
 
Webinar september 2013
Webinar september 2013Webinar september 2013
Webinar september 2013
Marc Gille
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
Yael Garten
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
Carole Goble
 

Similaire à Capturing Interactive Data Transformation Operations using Provenance Workflows (20)

Approximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous EventsApproximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous Events
 
Approximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous EventsApproximate Semantic Matching of Heterogeneous Events
Approximate Semantic Matching of Heterogeneous Events
 
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
Considerations for Abstracting Complexities of a Real-Time ML Platform, Zhenz...
 
Data virtualization an introduction
Data virtualization an introductionData virtualization an introduction
Data virtualization an introduction
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
Towards Lightweight Cyber-Physical Energy Systems using Linked Data, the Web ...
 
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software ComponentsFIWARE Global Summit - IDS Implementation with FIWARE Software Components
FIWARE Global Summit - IDS Implementation with FIWARE Software Components
 
An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018
 
Flash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lonFlash session -streaming--ses1243-lon
Flash session -streaming--ses1243-lon
 
Usage Landscape of Enterprise Open Source Data Integration
Usage Landscape of Enterprise Open Source Data IntegrationUsage Landscape of Enterprise Open Source Data Integration
Usage Landscape of Enterprise Open Source Data Integration
 
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & KubeflowMLOps - Build pipelines with Tensor Flow Extended & Kubeflow
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...
 
Why Data Virtualization? An Introduction
Why Data Virtualization? An IntroductionWhy Data Virtualization? An Introduction
Why Data Virtualization? An Introduction
 
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
Denodo DataFest 2016: Data Science: Operationalizing Analytical Models in Rea...
 
Webinar september 2013
Webinar september 2013Webinar september 2013
Webinar september 2013
 
Architecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystemArchitecting for change: LinkedIn's new data ecosystem
Architecting for change: LinkedIn's new data ecosystem
 
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystemStrata 2016 - Architecting for Change: LinkedIn's new data ecosystem
Strata 2016 - Architecting for Change: LinkedIn's new data ecosystem
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 
FAIR Computational Workflows
FAIR Computational WorkflowsFAIR Computational Workflows
FAIR Computational Workflows
 

Plus de Andre Freitas

AI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
AI & Scientific Discovery in Oncology: Opportunities, Challenges & TrendsAI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
AI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
Andre Freitas
 
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeSchema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Andre Freitas
 
Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...
Andre Freitas
 

Plus de Andre Freitas (20)

AI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
AI & Scientific Discovery in Oncology: Opportunities, Challenges & TrendsAI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
AI & Scientific Discovery in Oncology: Opportunities, Challenges & Trends
 
AI Systems @ Manchester
AI Systems @ ManchesterAI Systems @ Manchester
AI Systems @ Manchester
 
AI Beyond Deep Learning
AI Beyond Deep LearningAI Beyond Deep Learning
AI Beyond Deep Learning
 
Building AI Applications using Knowledge Graphs
Building AI Applications using Knowledge GraphsBuilding AI Applications using Knowledge Graphs
Building AI Applications using Knowledge Graphs
 
Open IE tutorial 2018
Open IE tutorial 2018Open IE tutorial 2018
Open IE tutorial 2018
 
Effective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP SystemsEffective Semantics for Engineering NLP Systems
Effective Semantics for Engineering NLP Systems
 
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
SemEval-2017 Task 5: Fine-Grained Sentiment Analysis on Financial Microblogs ...
 
Semantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering SystemsSemantic Perspectives for Contemporary Question Answering Systems
Semantic Perspectives for Contemporary Question Answering Systems
 
Semantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and RefinementSemantic Relation Classification: Task Formalisation and Refinement
Semantic Relation Classification: Task Formalisation and Refinement
 
Categorization of Semantic Roles for Dictionary Definitions
Categorization of Semantic Roles for Dictionary DefinitionsCategorization of Semantic Roles for Dictionary Definitions
Categorization of Semantic Roles for Dictionary Definitions
 
Word Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology ClassesWord Tagging with Foundational Ontology Classes
Word Tagging with Foundational Ontology Classes
 
Different Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering SystemsDifferent Semantic Perspectives for Question Answering Systems
Different Semantic Perspectives for Question Answering Systems
 
WiSS Challenge - Day 2
WiSS Challenge - Day 2WiSS Challenge - Day 2
WiSS Challenge - Day 2
 
WISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked DataWISS QA Do it yourself Question answering over Linked Data
WISS QA Do it yourself Question answering over Linked Data
 
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web ChallengeSchema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
Schema-Agnostic Queries (SAQ-2015): Semantic Web Challenge
 
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...
How hard is this Query? Measuring the Semantic Complexity of Schema-agnostic ...
 
Semantics at Scale: A Distributional Approach
Semantics at Scale: A Distributional ApproachSemantics at Scale: A Distributional Approach
Semantics at Scale: A Distributional Approach
 
Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...Schema-agnositc queries over large-schema databases: a distributional semanti...
Schema-agnositc queries over large-schema databases: a distributional semanti...
 
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...A Semantic Web Platform for Automating the Interpretation of Finite Element ...
A Semantic Web Platform for Automating the Interpretation of Finite Element ...
 
How Semantic Technologies can help to cure Hearing Loss?
How Semantic Technologies can help to cure Hearing Loss?How Semantic Technologies can help to cure Hearing Loss?
How Semantic Technologies can help to cure Hearing Loss?
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 

Dernier (20)

presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 

Capturing Interactive Data Transformation Operations using Provenance Workflows

  • 1. Digital Enterprise Research Institute www.deri.ie Capturing interactive data transformation operations using provenance workflows Tope Omitola, Andre Freitas, Edward Curry, Sean O'Riain, Nicholas Gibbins and Nigel Shadbolt SWPM Workshop 28.05.2012, Herakleion, Crete  Copyright 2009 Digital Enterprise Research Institute. All rights reserved.
  • 2. Outline Digital Enterprise Research Institute www.deri.ie  Motivation  Interactive data transformations (IDTs)  IDT & Provenance  Modelling IDTs  Provenance Representation  Provenance Capture  Case Study  Conclusion
  • 3. Motivation Digital Enterprise Research Institute www.deri.ie  Dataspaces:  High number of heterogeneous data sources  Complex data transformation environment  Need for both repeatable data transformations and once- off transformations  Traditional ETL approaches for data transformation/integration:  Based on scripting/programming  Focus on repeatable data transformation processes
  • 4. Interactive Data Transformation (IDTs) Digital Enterprise Research Institute www.deri.ie  Based on user interaction paradigms for user creation of data transformations  Explores GUI elements mapping to data transformation operations  Instant feedback of each iteration  Complementary to existing ETL tools  Lower the barriers for non-programmers (reduces programming effort) of doing data transformations  Example platforms: Google Refine, Potters Wheel, Wrangler
  • 5. Interactive Data Transformation (IDTs) Digital Enterprise Research Institute www.deri.ie
  • 6. Challenges Digital Enterprise Research Institute www.deri.ie  How to model IDTs?  Facilitating the reuse of previous IDTs  Representing IDTs Provenance  Making IDT platforms provenance-aware  Enabling transportability across IDT and ETL platforms
  • 7. IDT & Provenance Digital Enterprise Research Institute www.deri.ie  Provenance supports representation of interactive data transformations  Output: a provenance descriptor which shows the relationship between the inputs, the outputs, and the applied transformation operations  Both retrospective and prospective provenance
  • 8. IDT Digital Enterprise Research Institute www.deri.ie  IDT model  Formal model (Algebra for IDT)  Provenance representation  Provenance capture of IDTs
  • 9. IDT Model: Core Elements Digital Enterprise Research Institute www.deri.ie  Schema and instance data  Set of predefined operations  GUI elements mapping to predefined operations  User actions  Operation selection  Parameter selection  Operation composition (workflow)
  • 10. IDT Model Digital Enterprise Research Institute www.deri.ie
  • 11. Formalizing the mapping from IDT to Provenance Digital Enterprise Research Institute www.deri.ie  Definition 1: A provenance-based interactive data transformation engine, consists of a set of transformations (or activities) on a set of datasets generating outputs in the form of other datasets or events which may trigger further transformations  Definition 2: An interactive data transformation event, consists of the input dataset, the output dataset(s), the applied transformation function, and the time the transformation took place
  • 12. Formalizing the mapping from IDT to Provenance Digital Enterprise Research Institute www.deri.ie  Definition 3: A run is a function from time to dataset(s) and the transformation applied to those dataset(s)  Definition 4: A trace is the sequence of pairs of a run and the time the run was made
  • 13. Provenance Representation Digital Enterprise Research Institute www.deri.ie  Proposed in Representing Interoperable Provenance Descriptions for ETL Workflows  Three-layered provenance model:  Open Provenance Model Vocabulary Layer  Cogs ETL Provenance Vocabulary  Domain-Specific Model Layer  Linked Data standards
  • 14. Provenance Capture Layers Digital Enterprise Research Institute www.deri.ie
  • 15. Provenance Event-Capture Sequence Flow Digital Enterprise Research Institute www.deri.ie
  • 16. Case study Digital Enterprise Research Institute www.deri.ie  Implementation over the GR Platform  Example descriptor @prefix grf: <http://127.0.0.1:3333/project/1402144365904/> . grf :MassCellChange-1092380975 rdf:type opmv:Process, cogs:ColumnOperation, cogs:Transformation; Mapping to the actual program cogs:operationName "MassCellChange"^^xsd:string; cogs:programUsed "com.google.refine.operations.cell.MassEditOperation"^^xsd:string; Process rdfs:label "Mass edit 1 cells in column ==List of winners=="^^xsd:string. grf:MassCellChange-1092380975/1_0 rdf:type opmv:Artifact ; Input Artifact rdfs:label "* '''1955 [[Meena Kumari]]'[[Parineeta (1953 film)|Parineeta]]''''' as '''Lolita'''"^^xsd:string. grf:MassCellChange-1092380975/1_1 rdf:type opmv:Artifact; Output Artifact rdfs:label "* '''John Wayne'''"^^xsd:string. Workflow structure grf:MassCellChange-1092380975/1_1 opmv:wasDerivedFrom grf:MassCellChange-1092380975/1_0. grf:MassCellChange-1092380975 opmv:used grf:MassCellChange-1092380975/1_0. grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedBy grf:MassCellChange-1092380975. grf:MassCellChange-1092380975/1_1 opmv:wasGeneratedAt "2011-11-16T11:2:14"^xsd: dateTime.
  • 17. Conclusion Digital Enterprise Research Institute www.deri.ie  The proposed approach provides low impact on the existing IDT process  Provenance representation supports different data models  Preliminary implementation of a Google Refine provenance extension