SlideShare une entreprise Scribd logo
1  sur  24
Provenance of workflow data products

               Paolo Missier
            School of Computing,
           Newcastle University, UK




  TAPP’11 workshop
   Heraklion, Crete
   June 20-21, 2011
Workflow provenance
Taverna type system:
- strings + nested lists
- “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ]


      Dataflow model:

      - data-driven execution
      - services activate when input is ready


      Raw provenance:
      A detailed trace of workflow execution
      - tasks performed, data transformations
      - inputs used, outputs produced


                                   Linköping, Sweden -- January 2010
Workflow provenance
Taverna type system:
- strings + nested lists
- “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ]


      Dataflow model:

      - data-driven execution
      - services activate when input is ready


      Raw provenance:
      A detailed trace of workflow execution
      - tasks performed, data transformations
      - inputs used, outputs produced


                                   Linköping, Sweden -- January 2010
Workflow provenance
                                  Taverna type system:
                                  - strings + nested lists
                                  - “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ]

        lister                          Dataflow model:
                  get pathways
                   by genes1
                                        - data-driven execution
                                        - services activate when input is ready
                 merge pathways



     gene_id                            Raw provenance:
                                        A detailed trace of workflow execution
    concat gene pathway ids
                                        - tasks performed, data transformations
        output
                                        - inputs used, outputs produced

pathway_genes

                                                                     Linköping, Sweden -- January 2010
Implicit iteration in Taverna
                           c = [c1 ... ck]

a = [a1 ... an]   X1   X2     X3         b = [b1 ... bm]

                       P
                       Y




                                                           Linköping, Sweden -- January 2010
Implicit iteration in Taverna
                                    c = [c1 ... ck]
                  (0,1)        (1,1)     (0,1)
a = [a1 ... an]           X1    X2     X3         b = [b1 ... bm]

                                P
                                Y




                                                                    Linköping, Sweden -- January 2010
Implicit iteration in Taverna
                                    c = [c1 ... ck]
                  (0,1)        (1,1)     (0,1)
a = [a1 ... an]           X1    X2     X3         b = [b1 ... bm]

                                P
                                Y



       y = [ [y11 ... y1n],
               ...
             [ym1 ... ymn] ]




                                                                    Linköping, Sweden -- January 2010
Implicit iteration in Taverna
                                    c = [c1 ... ck]
                  (0,1)        (1,1)     (0,1)
a = [a1 ... an]           X1    X2     X3         b = [b1 ... bm]

                                P
                                Y



       y = [ [y11 ... y1n],
               ...
             [ym1 ... ymn] ]


How y is computed at P:
      let I = a ⊗ b = [ [ <ai, bj> | bj ∈ b ] | ai ∈ a ] // cross product

      I’ = [ [ <ai, c, bj> | bj ∈ b ] | ai ∈ a ] // same product but with c interleaved

y = (map (map P) I’) = [(map P [ <a1,c, b1> ... <a1,c, bm>]), ...,
                        (map P [ <an,c, b1> ... <an,c, bm>]) ] =
                       [ [y11 ... y1n], ... [yn1 ... ynm] ]
                                                                    Linköping, Sweden -- January 2010
Implicit iteration in Taverna
                                    c = [c1 ... ck]
                  (0,1)        (1,1)     (0,1)
a = [a1 ... an]           X1    X2     X3         b = [b1 ... bm]

                                P
                                Y



       y = [ [y11 ... y1n],                              bottom line:
               ...                            yij depends only on values ai, c, bj
             [ym1 ... ymn] ]


How y is computed at P:
      let I = a ⊗ b = [ [ <ai, bj> | bj ∈ b ] | ai ∈ a ] // cross product

      I’ = [ [ <ai, c, bj> | bj ∈ b ] | ai ∈ a ] // same product but with c interleaved

y = (map (map P) I’) = [(map P [ <a1,c, b1> ... <a1,c, bm>]), ...,
                        (map P [ <an,c, b1> ... <an,c, bm>]) ] =
                       [ [y11 ... y1n], ... [yn1 ... ynm] ]
                                                                         Linköping, Sweden -- January 2010
Fine-grained (precise?) provenance

                                        []


                            [0]              [1]        [2]




                    ...                        ...                           ...


                                                            []


                            [0]                             [1]                    [2]

                    [0,0]       [0,1]              [1,0]         [1,1]      [2,1]     [2,2]



                                  []


        [0]                       [1]                             [2]

    [0,0]   [0,1]         [1,0]        [1,1]           [2,1]        [2,2]




                                                       []


                          [0]                        [1]                      [2]

                [0,0]       [0,1]              [1,0]        [1,1]        [2,1]      [2,2]     4
The Open Provenance Model aims to capture the causal       make this dependency explicit, it is required to assert that
                                             Which provenance model/language?
 endencies between the artifacts, processes, and agents.
erefore, a provenance graph is defined as a directed
                                                            artifact A2 was derived from another artifact A1 . This
                                                            edge gives us a dataflow oriented view of provenance.
 ph, whose nodes are artifacts, processes and agents,           It is also recognized that we may not be aware of the
          • Let’s toedge represents Open depen- exact artifact generated bya starting point was
 icted in Figure 1. An
                          one at the
                                        a causal
                                                   Provenance Modela as another process P . Process P
d whose edges belong look of the following categories
                                                            some
                                                                   artifact that process P2 used, but that there
                                                                                                          1            2
 cy, between its source, denoting the effect, and its des-   is then said to have been triggered by P1 . In contrast to
ation, denoting the cause.                                  edge was derived from, a was triggered by edge allows for
                                                            a process oriented view of past executions to be adopted.
                                                            (Since these edges summarize some activities for which all
                                                        Core OPM not being exposed, it was felt that it was not
                                                            details are
                                                     •  agnostic wrtassociate a role with them.) types
                                                            necessary to
                                                                            Artifact, Processor
                                                                As far as conventions are concerned, we note that causal-
                                                     •  roles: edges use past tense to indicaterelationsrefer to past
                                                            ity annotations on binary that they
                                                            execution. Causal relationships are defined as follows.
                                                     • extensions by subclassing
                                                           Definition 4 (Causal Relationship). A causal relation-
                                                        • node represented by an arc and denotes the presence of a
                                                           ship is types,
                                                        • relation types between arc source of the arc (the effect)
                                                           causal dependency
                                                           and the destination of the
                                                                                      the
                                                                                          (the cause).

                                                               Five causal relationships are recognized: a process used an
                                                               artifact, an artifact was generated by a process, a process
                                                               was triggered by a process, an artifact was derived from
                                                               an artifact, and a process was controlled by an agent. By
                                                               means of annotations (see Section 8), we allow edges to be
                                                               further subtyped from these five categories.
                                                Formal (temporal) semantics hopefully available soon
ure 1: Edges in the Open Provenance Model: sources are effects,
  destinations causes
                                                                   Multiple notions of causal dependencies were consid-
                                                               ered for OPM. A very strong notion of causal dependency
 The first two edges express that a process used an arti-       would express that a set of entities was necessary and suffi-
t and that an artifact was generated by a process. Since       cient to explain the existence of another entity. It was felt
 rocess may have used several artifacts, it is important       that such a notion was not practical, since, with an open
 dentify the roles under which these artifacts were used.      world assumption, one could always argue that additional
      5
oles are denoted by letter ‘R’ in Figure 1.) Likewise,         factors may have influenced an outcome (e.g. electricity
                                                               was used, temperature range allowed computer to work,
OPM relations in the workflow context
                   Single User’s View (Alice)

                                                     Workflow Specification (WA)
                             in             out             in                   out
                       X             A                 Y               B                Z
                                                in
                                                                                              Process space
                                                           Data Binding & Enactment

                                                       Data Storage and Binding

                    X / [x1, x2]!                     SA                    Z / [z1, z2]!
                                                                                              Data space

                                                           Execution & Trace Capture

                                            read            Provenance Trace (TA)


                        x2
                             read
                                      a2
                                                write
                                                           y2          b2          z2
                                                                                              Provenance space
                                                      idep
                             x1            a1                   y1          b1          z1

                                                                ddep



                           ans(X) :- ddep*(X, z1).! Provenance Queries

Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable
     Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
Data and invocation dependencies
    - read, write are natural observables for a workflow run
    - possible additional relations (recorded or inferred):


    •  invocation dependencies:
    Explicit or via:

     a2 depends on a1 because a1 has written data d, a2 has read d


    •  data dependencies:
    Explicit or via:
     d2 depends on d1
    ! because some actor invocation a read d1 prior to writing d2




7
Provenance queries
    •  Closure queries:
    •  operate on the transitive closure ddep* over ddep:




    But also:
    - queries on the workflow structure
    - queries on the data structures (e.g. collections)

    and importantly:
    use workflow graphs to justify/explain the provenance graph for one
    workflow run:
                            TA trace instance of WA:
                            h: TA ➔ WA homomorphism
                            h(x1 ➔ a1) = h(x2 ➔ a2) = X➔A,
                            h(a1 ➔ y1) = h(a2 ➔ y2) = A➔Y
                            ...
8
OPM extensions, principled




                                  core OPM / PIL(*)
                                        -used
                                  -wasGeneratedBy
                                  -wasDerivedFrom




9   (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
OPM extensions, principled




                                                       Processor
                           Data                          types
                           types
                                   core OPM / PIL(*)
                                         -used
                                   -wasGeneratedBy
                                   -wasDerivedFrom




                         (Additional      who
                          context)       when
                                         where
                                           ...




9   (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
OPM extensions, principled




                                                       Processor
                           Data                          types
                           types
                                   core OPM / PIL(*)
                                         -used
       Nested                      -wasGeneratedBy
       Ordered Lists               -wasDerivedFrom                 Operators on lists:
                                                                   - create, insert,
                                                                     delete, select...
                                                                   - map, fold

                         (Additional      who
                          context)       when
                                         where
                                           ...




9   (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
OPM extensions, principled

                        Graph models?
                                                                   Graph
    Relations?                                                     matching
    (sets of tuples)                                               queries?


                                                                     Relational
                                                       Processor     queries?
                           Data                          types
                           types
                                   core OPM / PIL(*)
                                         -used
       Nested                      -wasGeneratedBy
       Ordered Lists               -wasDerivedFrom                 Operators on lists:
                                                                   - create, insert,
                                                                     delete, select...
                                                                   - map, fold

                         (Additional      who
                          context)       when
                                         where
                                           ...




9   (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
Role                   used in relation:              Proposed extensions
                                                             Context of use
       element                Contained                       L Contained(element) x
       list
      Role                    Used in relation:
                              used                            P Used(list) L
                                                               Context of use
       position
      element                 Used i ned
                              C onta                          P Used(position) p
                                                               L C onta i ned (element) x
       term
      list                    Used
                              Used                            ListUsed (list) L
                                                               P comprehensions, see Sec. 4.2
       generator
      position                Used                             P Used (position) p
       filter
      term                    Used                             List comprehensions, see Sec. 4.2
       function               Used                            map, see Sec. 4.3
      generator
       operand
      filter
      function                Used                             map, see Sec. 4.3
      operand
                   Table 2: New roles for Used and Contained relations

                 Table 2: New roles
     Causal relation                        for Used and Contained relations
                                                 Example
     Contained(R) ⊆ [τ ] × A × [Int]          L Contained([i1 . . . in ]) x
                                              x was inserted into L at position [i1 . . . in ]
     Causal relation
     wasSelectedFrom(R) ⊆ [τ ] × [τ ] × [Int]  Example
                                              L wasSelectedFrom([i1 . . . in ]) L
     C onta i ned (R) [ ] × A × [Int]         L at onta i ned ([i 1. . . . nn ]) x selected from L
                                               L C position [i1 . . i i ] was
     wasRemovedFrom(R) ⊆ [τ ] × [τ ] × [Int]  L was inserted into L 1 . .position [i 1 . . . in ]
                                               x wasRemovedFrom([i at . in ]) L
     wasSelected F rom (R) [ ] × [ ] × [Int]  L at position [iF romi([i 1was in ]) L from L
                                               L wasSelected 1 . . . n ] . . . deleted
     wasSameAs ⊆ A × A                        L wasSameAs x[i 1 . . . in ] was selected from L
                                               L at position
     wasRemoved F rom (R) [ ] × [ ] × [Int] inferred (various F rom ([i 1 . . . in ]) L
                                               L wasRemoved contexts)
                                               L at position [i 1 . . . in ] was deleted from L
     wasSame A s     A × A                     L wasSame A s x
                      Table 3: Specialised OPM dependencycontexts)
                                               inferred (various relations.

10
OPM fragments for elementary operations
     empty list

                    wasGenerated by                                            insertion
              L                              !                                                       wasDerivedFrom


                                                                                     wasGenerated by                           used(list)
                                                                               L'                       P:ins                                  L
     unit                                                                                                                 us
                                                                                                                            ed
                                                                                                                                 (p




                                                                                                        used(element)
                                   contained                                                                                        os
                                                                                                                                      itio
                                                                                                                                          n)




                                                                                           co
                                                                                                                                               p




                                                                                            nt
                                                                                            ai
                                                                                              ne
              wasGenerated by                          used(element)




                                                                                                 d
        L                           P:unit                                 x

                                                                                                                   x

     selection                                                         p
                                                                                                                                               p
                                                        ition)
                                                    pos                        deletion                                                on)
                                               d   (
                                                                                                                               po  siti
                                           use                                                                              d(
                                                                                                                        use
                 wasGenerated by                   used(list)
        L'                         P:sel                               L             wasGeneratedBy                              used(list)
                                                                                L'                            P:del                            L



                              WasSelectedFrom
                                                                                                     WasRemovedFrom




11
templates for the two operations. Additionally, however, the following
                    Operator composition → graph composition
y holds whenever insertion is followed by selection, if no other intervening
on changes the state operators may translate into inferences on the graphs
       • Equalities on of the list:

                          sel (ins x L p) p = x

w translates into the following OPM inference rule:
      sameAs(L1,L) :-
          pType(P1, ins), used(P1, X, element), used(P1, Pos, position), wgby(L,P1),
          pType(P2, sel), used(P2, L, list), used(P2, Pos, position), wgby(L1,P2).


                                      7




  12
templates for the two operations. Additionally, however, the following
                    Operator composition → graph composition
y holds whenever insertion is followed by selection, if no other intervening
on changes the state operators may translate into inferences on the graphs
       • Equalities on of the list:

                              sel (ins x L p) p = x

w translates into the following OPM inference rule:
      sameAs(L1,L) :-
          pType(P1, ins), used(P1, X, element), used(P1, Pos, position), wgby(L,P1),
          pType(P2, sel), used(P2, L, list), used(P2, Pos, position), wgby(L1,P2).


                                               7used(position)                                p



                                                                              wasDerivedFrom


                   wasGeneratedBy           used(list)           wasGenerated by                        used(list)
              L'                    P:sel                 L'                       P:ins                                     L
                                                                                                   us
                                                                                                     ed
                                                                                                          (p




                                                                                   used(element)
                                                                                                            os
                             WasSelectedFrom                                                                     itio
                                                                    co                                                  n)
                                                                     nt
                                                                       ai
                                                                         ne
                                                                          d
                                       wasSam
                                               eAs
                                                                                              x

  12
nd p a position in the list L = map f L. The following equality follows from
he definition of map: Approach applies to map, fold (reduce)...

                                                                sel (map f x) p = f (sel x p)                                                                                                       (10)

This equality is captured by the inferred dependency (y wasSameAs y ) in the
                                                            WasSelectedFrom


 nhanced OPM template of Fig. 7 (inference rule omitted for simplicity).
                                                  wasGenerated by             used(list)        wasGenerated by                                   used(list)
                                 y                                  Q2:sel                 L'                                 P:map

                                     WasSelectedFrom                                                                                          us
                                                                                                                                                   ed                       L
                                                                                                                                                     (fu
                                                                                                                                                        nc
                                                                                                     use                                                   tio
                                                                                                         d(p                                                  n)
                                                                                                            o    siti
                                                                                                                     on)
                                                                                                                                                                            f
                                      wasSameAs




                        wasGenerated by                                used(list)                   wasGenerated by                                                             used(list)
        y                                                Q2:sel                            L'                                                        P:map
                                                                                                            io n)
                                                                                                        unct
                                                                                                   ed(f                                                                     p




                                                                                                                                             t)
                                                                                                us




                                                                                                                                     (lis
                                                                                                                                                                            us




                                                                                                                                    ed
                                                                                                                                                                                ed              L




                                                                                                                                   us
                                                                                                                                                                                  (fu
                                              wasGenerated by              used(operand)        wasGenerated by                                                                      nc
                                 y'                              R:apply                   x                use               Q1:sel              used(position)                        tio
                                                                                                                        d(p                                                                n)
                                                                                                                           osi
                                                                                                                                  tion
                                                                                                                                         )
                                                                                                                                                                                                f
            wasSameAs




                                                                                                                     n        )
                                                                                                           (fu nctio
                                                                                                    used                                                                                        p




                                                                                                                                                                      t)
                                                                                                                                                                      lis
                                                                                                                                                                e  d(
                                                                                                                                                             us
   13
Take-home message
     • OPM (PIL) a candidate starting point for workflow-based
       provenance
     • extension mechanisms are provided, but they must be used
       sensibly
       – data types
       – processor types

     • Provenance of nested ordered lists used as a prototypical example
       – semantics of provenance graphs and graph composition grounded in the
         semantics of lists



     • Can this approach be useful for other interesting data types?
       – sets of tuples / relational algebra




14

Contenu connexe

En vedette

Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Paolo Missier
 
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08Paolo Missier
 
Paper presentation @ SEBD '09
Paper presentation @ SEBD '09Paper presentation @ SEBD '09
Paper presentation @ SEBD '09Paolo Missier
 
Paper talk: Idcc 11
Paper talk: Idcc 11  Paper talk: Idcc 11
Paper talk: Idcc 11 Paolo Missier
 
Paper presentation: Taverna, reloaded
Paper presentation: Taverna, reloadedPaper presentation: Taverna, reloaded
Paper presentation: Taverna, reloadedPaolo Missier
 
Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)Paolo Missier
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud Paolo Missier
 

En vedette (9)

Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07
 
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08
 
Tapp11 presentation
Tapp11 presentationTapp11 presentation
Tapp11 presentation
 
Paper presentation @ SEBD '09
Paper presentation @ SEBD '09Paper presentation @ SEBD '09
Paper presentation @ SEBD '09
 
Paper talk: Idcc 11
Paper talk: Idcc 11  Paper talk: Idcc 11
Paper talk: Idcc 11
 
Paper presentation: Taverna, reloaded
Paper presentation: Taverna, reloadedPaper presentation: Taverna, reloaded
Paper presentation: Taverna, reloaded
 
Охота на Работу!EXCLUSIVE
Охота на Работу!EXCLUSIVEОхота на Работу!EXCLUSIVE
Охота на Работу!EXCLUSIVE
 
Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
 

Plus de Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

Plus de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

provenance of lists - TAPP'11 Mini-tutorial

  • 1. Provenance of workflow data products Paolo Missier School of Computing, Newcastle University, UK TAPP’11 workshop Heraklion, Crete June 20-21, 2011
  • 2. Workflow provenance Taverna type system: - strings + nested lists - “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ] Dataflow model: - data-driven execution - services activate when input is ready Raw provenance: A detailed trace of workflow execution - tasks performed, data transformations - inputs used, outputs produced Linköping, Sweden -- January 2010
  • 3. Workflow provenance Taverna type system: - strings + nested lists - “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ] Dataflow model: - data-driven execution - services activate when input is ready Raw provenance: A detailed trace of workflow execution - tasks performed, data transformations - inputs used, outputs produced Linköping, Sweden -- January 2010
  • 4. Workflow provenance Taverna type system: - strings + nested lists - “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ] lister Dataflow model: get pathways by genes1 - data-driven execution - services activate when input is ready merge pathways gene_id Raw provenance: A detailed trace of workflow execution concat gene pathway ids - tasks performed, data transformations output - inputs used, outputs produced pathway_genes Linköping, Sweden -- January 2010
  • 5. Implicit iteration in Taverna c = [c1 ... ck] a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y Linköping, Sweden -- January 2010
  • 6. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1) a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y Linköping, Sweden -- January 2010
  • 7. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1) a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] Linköping, Sweden -- January 2010
  • 8. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1) a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] How y is computed at P: let I = a ⊗ b = [ [ <ai, bj> | bj ∈ b ] | ai ∈ a ] // cross product I’ = [ [ <ai, c, bj> | bj ∈ b ] | ai ∈ a ] // same product but with c interleaved y = (map (map P) I’) = [(map P [ <a1,c, b1> ... <a1,c, bm>]), ..., (map P [ <an,c, b1> ... <an,c, bm>]) ] = [ [y11 ... y1n], ... [yn1 ... ynm] ] Linköping, Sweden -- January 2010
  • 9. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1) a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y y = [ [y11 ... y1n], bottom line: ... yij depends only on values ai, c, bj [ym1 ... ymn] ] How y is computed at P: let I = a ⊗ b = [ [ <ai, bj> | bj ∈ b ] | ai ∈ a ] // cross product I’ = [ [ <ai, c, bj> | bj ∈ b ] | ai ∈ a ] // same product but with c interleaved y = (map (map P) I’) = [(map P [ <a1,c, b1> ... <a1,c, bm>]), ..., (map P [ <an,c, b1> ... <an,c, bm>]) ] = [ [y11 ... y1n], ... [yn1 ... ynm] ] Linköping, Sweden -- January 2010
  • 10. Fine-grained (precise?) provenance [] [0] [1] [2] ... ... ... [] [0] [1] [2] [0,0] [0,1] [1,0] [1,1] [2,1] [2,2] [] [0] [1] [2] [0,0] [0,1] [1,0] [1,1] [2,1] [2,2] [] [0] [1] [2] [0,0] [0,1] [1,0] [1,1] [2,1] [2,2] 4
  • 11. The Open Provenance Model aims to capture the causal make this dependency explicit, it is required to assert that Which provenance model/language? endencies between the artifacts, processes, and agents. erefore, a provenance graph is defined as a directed artifact A2 was derived from another artifact A1 . This edge gives us a dataflow oriented view of provenance. ph, whose nodes are artifacts, processes and agents, It is also recognized that we may not be aware of the • Let’s toedge represents Open depen- exact artifact generated bya starting point was icted in Figure 1. An one at the a causal Provenance Modela as another process P . Process P d whose edges belong look of the following categories some artifact that process P2 used, but that there 1 2 cy, between its source, denoting the effect, and its des- is then said to have been triggered by P1 . In contrast to ation, denoting the cause. edge was derived from, a was triggered by edge allows for a process oriented view of past executions to be adopted. (Since these edges summarize some activities for which all Core OPM not being exposed, it was felt that it was not details are • agnostic wrtassociate a role with them.) types necessary to Artifact, Processor As far as conventions are concerned, we note that causal- • roles: edges use past tense to indicaterelationsrefer to past ity annotations on binary that they execution. Causal relationships are defined as follows. • extensions by subclassing Definition 4 (Causal Relationship). A causal relation- • node represented by an arc and denotes the presence of a ship is types, • relation types between arc source of the arc (the effect) causal dependency and the destination of the the (the cause). Five causal relationships are recognized: a process used an artifact, an artifact was generated by a process, a process was triggered by a process, an artifact was derived from an artifact, and a process was controlled by an agent. By means of annotations (see Section 8), we allow edges to be further subtyped from these five categories. Formal (temporal) semantics hopefully available soon ure 1: Edges in the Open Provenance Model: sources are effects, destinations causes Multiple notions of causal dependencies were consid- ered for OPM. A very strong notion of causal dependency The first two edges express that a process used an arti- would express that a set of entities was necessary and suffi- t and that an artifact was generated by a process. Since cient to explain the existence of another entity. It was felt rocess may have used several artifacts, it is important that such a notion was not practical, since, with an open dentify the roles under which these artifacts were used. world assumption, one could always argue that additional 5 oles are denoted by letter ‘R’ in Figure 1.) Likewise, factors may have influenced an outcome (e.g. electricity was used, temperature range allowed computer to work,
  • 12. OPM relations in the workflow context Single User’s View (Alice) Workflow Specification (WA) in out in out X A Y B Z in Process space Data Binding & Enactment Data Storage and Binding X / [x1, x2]! SA Z / [z1, z2]! Data space Execution & Trace Capture read Provenance Trace (TA) x2 read a2 write y2 b2 z2 Provenance space idep x1 a1 y1 b1 z1 ddep ans(X) :- ddep*(X, z1).! Provenance Queries Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
  • 13. Data and invocation dependencies - read, write are natural observables for a workflow run - possible additional relations (recorded or inferred): •  invocation dependencies: Explicit or via: a2 depends on a1 because a1 has written data d, a2 has read d •  data dependencies: Explicit or via: d2 depends on d1 ! because some actor invocation a read d1 prior to writing d2 7
  • 14. Provenance queries •  Closure queries: •  operate on the transitive closure ddep* over ddep: But also: - queries on the workflow structure - queries on the data structures (e.g. collections) and importantly: use workflow graphs to justify/explain the provenance graph for one workflow run: TA trace instance of WA: h: TA ➔ WA homomorphism h(x1 ➔ a1) = h(x2 ➔ a2) = X➔A, h(a1 ➔ y1) = h(a2 ➔ y2) = A➔Y ... 8
  • 15. OPM extensions, principled core OPM / PIL(*) -used -wasGeneratedBy -wasDerivedFrom 9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
  • 16. OPM extensions, principled Processor Data types types core OPM / PIL(*) -used -wasGeneratedBy -wasDerivedFrom (Additional who context) when where ... 9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
  • 17. OPM extensions, principled Processor Data types types core OPM / PIL(*) -used Nested -wasGeneratedBy Ordered Lists -wasDerivedFrom Operators on lists: - create, insert, delete, select... - map, fold (Additional who context) when where ... 9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
  • 18. OPM extensions, principled Graph models? Graph Relations? matching (sets of tuples) queries? Relational Processor queries? Data types types core OPM / PIL(*) -used Nested -wasGeneratedBy Ordered Lists -wasDerivedFrom Operators on lists: - create, insert, delete, select... - map, fold (Additional who context) when where ... 9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
  • 19. Role used in relation: Proposed extensions Context of use element Contained L Contained(element) x list Role Used in relation: used P Used(list) L Context of use position element Used i ned C onta P Used(position) p L C onta i ned (element) x term list Used Used ListUsed (list) L P comprehensions, see Sec. 4.2 generator position Used P Used (position) p filter term Used List comprehensions, see Sec. 4.2 function Used map, see Sec. 4.3 generator operand filter function Used map, see Sec. 4.3 operand Table 2: New roles for Used and Contained relations Table 2: New roles Causal relation for Used and Contained relations Example Contained(R) ⊆ [τ ] × A × [Int] L Contained([i1 . . . in ]) x x was inserted into L at position [i1 . . . in ] Causal relation wasSelectedFrom(R) ⊆ [τ ] × [τ ] × [Int] Example L wasSelectedFrom([i1 . . . in ]) L C onta i ned (R) [ ] × A × [Int] L at onta i ned ([i 1. . . . nn ]) x selected from L L C position [i1 . . i i ] was wasRemovedFrom(R) ⊆ [τ ] × [τ ] × [Int] L was inserted into L 1 . .position [i 1 . . . in ] x wasRemovedFrom([i at . in ]) L wasSelected F rom (R) [ ] × [ ] × [Int] L at position [iF romi([i 1was in ]) L from L L wasSelected 1 . . . n ] . . . deleted wasSameAs ⊆ A × A L wasSameAs x[i 1 . . . in ] was selected from L L at position wasRemoved F rom (R) [ ] × [ ] × [Int] inferred (various F rom ([i 1 . . . in ]) L L wasRemoved contexts) L at position [i 1 . . . in ] was deleted from L wasSame A s A × A L wasSame A s x Table 3: Specialised OPM dependencycontexts) inferred (various relations. 10
  • 20. OPM fragments for elementary operations empty list wasGenerated by insertion L ! wasDerivedFrom wasGenerated by used(list) L' P:ins L unit us ed (p used(element) contained os itio n) co p nt ai ne wasGenerated by used(element) d L P:unit x x selection p p ition) pos deletion on) d ( po siti use d( use wasGenerated by used(list) L' P:sel L wasGeneratedBy used(list) L' P:del L WasSelectedFrom WasRemovedFrom 11
  • 21. templates for the two operations. Additionally, however, the following Operator composition → graph composition y holds whenever insertion is followed by selection, if no other intervening on changes the state operators may translate into inferences on the graphs • Equalities on of the list: sel (ins x L p) p = x w translates into the following OPM inference rule: sameAs(L1,L) :- pType(P1, ins), used(P1, X, element), used(P1, Pos, position), wgby(L,P1), pType(P2, sel), used(P2, L, list), used(P2, Pos, position), wgby(L1,P2). 7 12
  • 22. templates for the two operations. Additionally, however, the following Operator composition → graph composition y holds whenever insertion is followed by selection, if no other intervening on changes the state operators may translate into inferences on the graphs • Equalities on of the list: sel (ins x L p) p = x w translates into the following OPM inference rule: sameAs(L1,L) :- pType(P1, ins), used(P1, X, element), used(P1, Pos, position), wgby(L,P1), pType(P2, sel), used(P2, L, list), used(P2, Pos, position), wgby(L1,P2). 7used(position) p wasDerivedFrom wasGeneratedBy used(list) wasGenerated by used(list) L' P:sel L' P:ins L us ed (p used(element) os WasSelectedFrom itio co n) nt ai ne d wasSam eAs x 12
  • 23. nd p a position in the list L = map f L. The following equality follows from he definition of map: Approach applies to map, fold (reduce)... sel (map f x) p = f (sel x p) (10) This equality is captured by the inferred dependency (y wasSameAs y ) in the WasSelectedFrom nhanced OPM template of Fig. 7 (inference rule omitted for simplicity). wasGenerated by used(list) wasGenerated by used(list) y Q2:sel L' P:map WasSelectedFrom us ed L (fu nc use tio d(p n) o siti on) f wasSameAs wasGenerated by used(list) wasGenerated by used(list) y Q2:sel L' P:map io n) unct ed(f p t) us (lis us ed ed L us (fu wasGenerated by used(operand) wasGenerated by nc y' R:apply x use Q1:sel used(position) tio d(p n) osi tion ) f wasSameAs n ) (fu nctio used p t) lis e d( us 13
  • 24. Take-home message • OPM (PIL) a candidate starting point for workflow-based provenance • extension mechanisms are provided, but they must be used sensibly – data types – processor types • Provenance of nested ordered lists used as a prototypical example – semantics of provenance graphs and graph composition grounded in the semantics of lists • Can this approach be useful for other interesting data types? – sets of tuples / relational algebra 14

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n
  33. \n
  34. \n