SlideShare une entreprise Scribd logo
1  sur  19
Télécharger pour lire hors ligne
Detecting Duplicate Records in
 Scientific Workflow Results


Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1
               1University of Manchester

                2University of Newcastle
Scientific Workflows
                  Scientific workflows are increasingly
                  used by scientists as a means for
                  specifying and enacting their
                  experiments.
                  They tend to be data intensive

                  The data sets obtained as a result of
                  their enactment can be stored in
                  public repositories to be queried,
                  analyzed and used to feed the
                  execution of other workflows.
2   IPAW 2012
Duplicates in Workflow Results

      The datasets obtained as a result of workflow execution often
       contain duplicates.
      As a result:
         The analysis and interpretation of workflow results may become
          tedious.
         The presence of duplicates also unnecessarily increases the size
          of workflow results.



3   IPAW 2012
Duplicate Record Detection
      Research in duplicate record detection has been active for
        more than three decades.
          Elmagarmid et al., 2007 conducted a comprehensive survey of
            the topics.

      We do not aim to design yet another algorithm for
        comparing and matching records.
      Rather, we investigate how provenance traces produced as a
        result of workflow executions can be used to guide the
        detection of duplicate records in workflow results.
    Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE
    Trans. Knowl. Data Eng., 19(1):1–16,2007.
4   IPAW 2012
Outline

      Data-Driven Workflows and Provenance Trace


      A method for guiding duplicates detection in workflow
       results based on provenance traces.

      Preliminary validation using real-world workflows.




5   IPAW 2012
Preliminaries: Data-Driven Workflows
      A data driven workflow can be defined as a directed graph:

                         wf = N, E
      A node represent an analysis operation, which has a set of
       input and output parameters.
                      op, Iop , Oop  ∈ N
      The edges are dataflow dependencies:

                  op, o, op , i ∈ E
6   IPAW 2012
Preliminaries: Provenance Trace
    The execution of workflows gives rise to provenance trace,
    which we capture using two relations.
      Transformation: to specify that the execution of an
    operation took as input a given ordered set of records and
    generated another ordered set of records.
      op, o1 , ro1 , . . . , op, om , rom     op, i1 , ri1 , . . . , op, in , rin
                                   OutBop     InBop

      Transfer: to specify transfer of records along the edges of
    the workflow.                op , i , r   op, o, r


7   IPAW 2012
Outline

      Data-Driven Workflows and Provenance Trace


      A method for guiding duplicates detection in workflow
       results based on provenance traces.

      Preliminary validation using real-world workflows.




8   IPAW 2012
Provenance-Guided Detection of
    Duplicates: Approach
    To guide the detection of duplicates in workflow results we
      explore the following fact:

      An operation that is known to be deterministic produces
       identical output bindings given the same input binding.

    deterministic op      OutBop    InBop   T    OutBop    InBop    T
                                                    id OutBop , OutBop




9   IPAW 2012
Provenance-Guided Detection of
     Duplicates: Example
     i                                  o        i’                     o’
                 IdentifyProtein                       GetGOTerm

     Ri                                 Ro       R’i                    R’o
     1.  The set of records Ri that are bound to the input parameter
         of the starting operation are compared to identify duplicate
         records.

          The result of this phase is a partition of disjoint sets of
          identical records.
                                   Ri       R1
                                             i         Rn
                                                        i
10   IPAW 2012
Provenance-Guided Detection of
      Duplicates: Example
     i                                     o          i’                             o’
                  IdentifyProtein                               GetGOTerm

     Ri                                   Ro         R’i                             R’o
      2.  The sets of records Ro, R’i and R’o are partitioned into sets
          of identical records based on the partitioning of Ri. For
          example:               1          n
                                    Ro     Ro              Ro
     Ri
      o   ro Ro s.t. ri Ri ,
                         i          IdentifyProtein, o, ro        IdentifyProtein, i, ri



11    IPAW 2012
Provenance-Guided Detection of
     Duplicates: Example
       In the example just described, the operations that compose
        the workflow have exactly one input and one output
        parameter.
          However, the algorithm presented in the paper supports
           operations with multiple input and output parameters.

       Notice that we assumes that the analysis operations that
        compose the workflow are deterministic. This is not always
        the case.
          This raises the question as to how to determine that a given
           operation is deterministic.
12   IPAW 2012
Verifying The Determinism of Analysis
     Operations
       To verify the determinism of operations, we use an approach
       whereby operations are probed.
     1.  Given an operation op, we select examples values that can
           be used by the inputs of op, and invoke op using those
           values multiple times.
     2.     If op produces identical output values given identical input
           values, then it is likely to be deterministic, otherwise, it is
           not deterministic.



13   IPAW 2012
Collection-Based Workflows
     To support duplicates detection in collection based workflows we
     need to be able to:
       Identify when two collections are identical
        Two collections Ri and Rj are identical if they are of the same size and
        there is a bijective mapping:
                                map : Ri          Rj
        that maps each record ri in Ri to a record rj in Rj such that ri and rj are
        identical
       Identify duplicates records between two collections that
        are known to be identical
        Identify a bijective mapping that maps every ri in Ri to an identical
        rj in Rj.
14   IPAW 2012
Outline

       Data-Driven Workflows and Provenance Trace


       A method for guiding duplicates detection in workflow
        results based on provenance traces.

       Preliminary validation using real-world workflows.




15   IPAW 2012
Validation
       The method that we presented in this paper can be applied when the operations
        are deterministic.

       To have an insight on the degree to which the operations that compose the
        workflows are deterministic, we run en experiments

       Datasets: 15 bioinformatics workflows that cover a wide range of analyzes,
        namely biological pathway analysis, sequence alignment, molecular interaction
        analysis

       Process: To identify which of these operations are deterministic, we run each
        of them 3 times using example values that were found either within
        myExperiment or Biocatalogue


16   IPAW 2012
Validation
       After manual analysis of the results, it transpires that 5 operations out of
         the 151 operations that compose the wokflows are not deterministic.

       Note that many of the operations that we analyzed access and use
         underlying data sources in their computation. Therefore updates to such
         sources may break the determinism assumption (Chirigati and Freire,
         2012).

       This suggests that the determinism holds within a window of time
         during which the underlying sources remain the same, and that there is a
         need for monitoring techniques to identify such windows.
     Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical
     Approach . IPAW, 2012.
17   IPAW 2012
Conclusions and Future Work

      we described a method that can be used to guide duplicate
       detection in workflow results.

       Monitoring the determinism of analysis operations


       Extending the method to support duplicate detection across
        the results of different workflows.



18   IPAW 2012
Detecting Duplicate Records in
 Scientific Workflow Results


Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1
               1University of Manchester

                2University of Newcastle

Contenu connexe

Similaire à Detecting Duplicate Records in Scientific Workflow Results

An Ontological Formulation and an OPM profile for Causality in Planning Appli...
An Ontological Formulation and an OPM profile for Causality in Planning Appli...An Ontological Formulation and an OPM profile for Causality in Planning Appli...
An Ontological Formulation and an OPM profile for Causality in Planning Appli...Daniele Dell'Aglio
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScienceKhalid Belhajjame
 
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs Lisong Guo
 
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...ESEM 2014
 
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsSherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsDacong (Tony) Yan
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Anubhav Jain
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and modelsmyGrid team
 
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)Stian Soiland-Reyes
 
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)Stian Soiland-Reyes
 
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...Jan Aerts
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
 
Equivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholdermhaendel
 
HyQue: Evaluating scientific Hypotheses using semantic web technologies
HyQue: Evaluating scientific Hypotheses using semantic web technologiesHyQue: Evaluating scientific Hypotheses using semantic web technologies
HyQue: Evaluating scientific Hypotheses using semantic web technologiesMichel Dumontier
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...dgarijo
 

Similaire à Detecting Duplicate Records in Scientific Workflow Results (20)

2013-01-17 Research Object
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research Object
 
An Ontological Formulation and an OPM profile for Causality in Planning Appli...
An Ontological Formulation and an OPM profile for Causality in Planning Appli...An Ontological Formulation and an OPM profile for Causality in Planning Appli...
An Ontological Formulation and an OPM profile for Causality in Planning Appli...
 
Privacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eSciencePrivacy-Preserving Data Analysis Workflows for eScience
Privacy-Preserving Data Analysis Workflows for eScience
 
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs SherLog:  Error Diagnosis Through Connecting Clues from Run-time Logs
SherLog: Error Diagnosis Through Connecting Clues from Run-time Logs
 
ISMB Workshop 2014
ISMB Workshop 2014ISMB Workshop 2014
ISMB Workshop 2014
 
Ikc 2015
Ikc 2015Ikc 2015
Ikc 2015
 
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
178 - A replicated study on duplicate detection: Using Apache Lucene to searc...
 
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time LogsSherLog: Error Diagnosis by Connecting Clues from Run-time Logs
SherLog: Error Diagnosis by Connecting Clues from Run-time Logs
 
Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...Natural Language Processing for Data Extraction and Synthesizability Predicti...
Natural Language Processing for Data Extraction and Synthesizability Predicti...
 
Reproducibility 1
Reproducibility 1Reproducibility 1
Reproducibility 1
 
NETTAB 2013
NETTAB 2013NETTAB 2013
NETTAB 2013
 
The beauty of workflows and models
The beauty of workflows and modelsThe beauty of workflows and models
The beauty of workflows and models
 
2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)2013 06-24 Wf4Ever: Annotating research objects (PDF)
2013 06-24 Wf4Ever: Annotating research objects (PDF)
 
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)2013 06-24 Wf4Ever: Annotating research objects (PPTX)
2013 06-24 Wf4Ever: Annotating research objects (PPTX)
 
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
S Carbon - AmiGO2: document-oriented approach to ontology software and escapi...
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 
Equivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholderEquivalence is in the (ID) of the beholder
Equivalence is in the (ID) of the beholder
 
HyQue: Evaluating scientific Hypotheses using semantic web technologies
HyQue: Evaluating scientific Hypotheses using semantic web technologiesHyQue: Evaluating scientific Hypotheses using semantic web technologies
HyQue: Evaluating scientific Hypotheses using semantic web technologies
 
icpr_2012
icpr_2012icpr_2012
icpr_2012
 
From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...From Scientific Workflows to Research Objects: Publication and Abstraction of...
From Scientific Workflows to Research Objects: Publication and Abstraction of...
 

Plus de Khalid Belhajjame

Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsKhalid Belhajjame
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsKhalid Belhajjame
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsKhalid Belhajjame
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsKhalid Belhajjame
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Khalid Belhajjame
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...Khalid Belhajjame
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in SepublicaKhalid Belhajjame
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenanceKhalid Belhajjame
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Khalid Belhajjame
 

Plus de Khalid Belhajjame (18)

Provenance witha purpose
Provenance witha purposeProvenance witha purpose
Provenance witha purpose
 
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based WorkflowsLineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
 
Irpb workshop
Irpb workshopIrpb workshop
Irpb workshop
 
Aussois bda-mdd-2018
Aussois bda-mdd-2018Aussois bda-mdd-2018
Aussois bda-mdd-2018
 
Converting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objectsConverting scripts into reproducible workflow research objects
Converting scripts into reproducible workflow research objects
 
A Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its ExtensionsA Sightseeing Tour of Prov and Some of its Extensions
A Sightseeing Tour of Prov and Some of its Extensions
 
Anr cair meeting feb 2016
Anr cair meeting feb 2016Anr cair meeting feb 2016
Anr cair meeting feb 2016
 
Linking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scriptsLinking the prospective and retrospective provenance of scripts
Linking the prospective and retrospective provenance of scripts
 
Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014Introduction to ProvBench @ Provenance Week 2014
Introduction to ProvBench @ Provenance Week 2014
 
Edbt2014 talk
Edbt2014 talkEdbt2014 talk
Edbt2014 talk
 
Credible workshop
Credible workshopCredible workshop
Credible workshop
 
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...Small Is Beautiful:  Summarizing Scientific Workflows  Using Semantic Annotat...
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
 
Why Workflows Break
Why Workflows BreakWhy Workflows Break
Why Workflows Break
 
D-prov use-case
D-prov use-caseD-prov use-case
D-prov use-case
 
Research Object Model in Sepublica
Research Object Model in SepublicaResearch Object Model in Sepublica
Research Object Model in Sepublica
 
Case studyworkshoponprovenance
Case studyworkshoponprovenanceCase studyworkshoponprovenance
Case studyworkshoponprovenance
 
Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)Intégration incrémentale de données (Valenciennes juin 2010)
Intégration incrémentale de données (Valenciennes juin 2010)
 
Edbt 2010, Belhajjame
Edbt 2010, BelhajjameEdbt 2010, Belhajjame
Edbt 2010, Belhajjame
 

Dernier

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 

Dernier (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 

Detecting Duplicate Records in Scientific Workflow Results

  • 1. Detecting Duplicate Records in Scientific Workflow Results Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1 1University of Manchester 2University of Newcastle
  • 2. Scientific Workflows   Scientific workflows are increasingly used by scientists as a means for specifying and enacting their experiments.   They tend to be data intensive   The data sets obtained as a result of their enactment can be stored in public repositories to be queried, analyzed and used to feed the execution of other workflows. 2 IPAW 2012
  • 3. Duplicates in Workflow Results   The datasets obtained as a result of workflow execution often contain duplicates.   As a result:   The analysis and interpretation of workflow results may become tedious.   The presence of duplicates also unnecessarily increases the size of workflow results. 3 IPAW 2012
  • 4. Duplicate Record Detection   Research in duplicate record detection has been active for more than three decades.   Elmagarmid et al., 2007 conducted a comprehensive survey of the topics.   We do not aim to design yet another algorithm for comparing and matching records.   Rather, we investigate how provenance traces produced as a result of workflow executions can be used to guide the detection of duplicate records in workflow results. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. Du-plicate record detection: A survey. IEEE Trans. Knowl. Data Eng., 19(1):1–16,2007. 4 IPAW 2012
  • 5. Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows. 5 IPAW 2012
  • 6. Preliminaries: Data-Driven Workflows   A data driven workflow can be defined as a directed graph: wf = N, E   A node represent an analysis operation, which has a set of input and output parameters. op, Iop , Oop ∈ N   The edges are dataflow dependencies: op, o, op , i ∈ E 6 IPAW 2012
  • 7. Preliminaries: Provenance Trace The execution of workflows gives rise to provenance trace, which we capture using two relations.   Transformation: to specify that the execution of an operation took as input a given ordered set of records and generated another ordered set of records. op, o1 , ro1 , . . . , op, om , rom op, i1 , ri1 , . . . , op, in , rin OutBop InBop   Transfer: to specify transfer of records along the edges of the workflow. op , i , r op, o, r 7 IPAW 2012
  • 8. Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows. 8 IPAW 2012
  • 9. Provenance-Guided Detection of Duplicates: Approach To guide the detection of duplicates in workflow results we explore the following fact:   An operation that is known to be deterministic produces identical output bindings given the same input binding. deterministic op OutBop InBop T OutBop InBop T id OutBop , OutBop 9 IPAW 2012
  • 10. Provenance-Guided Detection of Duplicates: Example i o i’ o’ IdentifyProtein GetGOTerm Ri Ro R’i R’o 1.  The set of records Ri that are bound to the input parameter of the starting operation are compared to identify duplicate records. The result of this phase is a partition of disjoint sets of identical records. Ri R1 i Rn i 10 IPAW 2012
  • 11. Provenance-Guided Detection of Duplicates: Example i o i’ o’ IdentifyProtein GetGOTerm Ri Ro R’i R’o 2.  The sets of records Ro, R’i and R’o are partitioned into sets of identical records based on the partitioning of Ri. For example: 1 n Ro Ro Ro Ri o ro Ro s.t. ri Ri , i IdentifyProtein, o, ro IdentifyProtein, i, ri 11 IPAW 2012
  • 12. Provenance-Guided Detection of Duplicates: Example   In the example just described, the operations that compose the workflow have exactly one input and one output parameter.   However, the algorithm presented in the paper supports operations with multiple input and output parameters.   Notice that we assumes that the analysis operations that compose the workflow are deterministic. This is not always the case.   This raises the question as to how to determine that a given operation is deterministic. 12 IPAW 2012
  • 13. Verifying The Determinism of Analysis Operations To verify the determinism of operations, we use an approach whereby operations are probed. 1.  Given an operation op, we select examples values that can be used by the inputs of op, and invoke op using those values multiple times. 2.  If op produces identical output values given identical input values, then it is likely to be deterministic, otherwise, it is not deterministic. 13 IPAW 2012
  • 14. Collection-Based Workflows To support duplicates detection in collection based workflows we need to be able to:   Identify when two collections are identical Two collections Ri and Rj are identical if they are of the same size and there is a bijective mapping: map : Ri Rj that maps each record ri in Ri to a record rj in Rj such that ri and rj are identical   Identify duplicates records between two collections that are known to be identical Identify a bijective mapping that maps every ri in Ri to an identical rj in Rj. 14 IPAW 2012
  • 15. Outline   Data-Driven Workflows and Provenance Trace   A method for guiding duplicates detection in workflow results based on provenance traces.   Preliminary validation using real-world workflows. 15 IPAW 2012
  • 16. Validation   The method that we presented in this paper can be applied when the operations are deterministic.   To have an insight on the degree to which the operations that compose the workflows are deterministic, we run en experiments   Datasets: 15 bioinformatics workflows that cover a wide range of analyzes, namely biological pathway analysis, sequence alignment, molecular interaction analysis   Process: To identify which of these operations are deterministic, we run each of them 3 times using example values that were found either within myExperiment or Biocatalogue 16 IPAW 2012
  • 17. Validation   After manual analysis of the results, it transpires that 5 operations out of the 151 operations that compose the wokflows are not deterministic.   Note that many of the operations that we analyzed access and use underlying data sources in their computation. Therefore updates to such sources may break the determinism assumption (Chirigati and Freire, 2012).   This suggests that the determinism holds within a window of time during which the underlying sources remain the same, and that there is a need for monitoring techniques to identify such windows. Fernando Chirigati and Juliana Freire. Towards Integrating Workflow and Database Provenance: A Practical Approach . IPAW, 2012. 17 IPAW 2012
  • 18. Conclusions and Future Work  we described a method that can be used to guide duplicate detection in workflow results.   Monitoring the determinism of analysis operations   Extending the method to support duplicate detection across the results of different workflows. 18 IPAW 2012
  • 19. Detecting Duplicate Records in Scientific Workflow Results Khalid Belhajjame1, Paolo Missier2, and Carole A. Goble1 1University of Manchester 2University of Newcastle