SlideShare une entreprise Scribd logo
1  sur  13
Incremental workflow improvement
                    through analysis of its data provenance


                           Paolo Missier
                        School of Computing,
                       Newcastle University, UK


     In collaboration with the eScience Central group at Newcastle:
Prof. Paul Watson, Dr. Simon Woodman, Dr Hugo Hiden, Dr. Jacek Cala




   TAPP’11 workshop
    Heraklion, Crete
    June 20-21, 2011
Motivation
    Despite the growing momentum around provenance as a premiere
                            form of metadata,
    success exploitation stories trail models and technology advances




2
Motivation
       Despite the growing momentum around provenance as a premiere
                               form of metadata,
       success exploitation stories trail models and technology advances

    • We are getting very good at recording provenance of data:
      – multiple data models (OPM, Provenir, Janus, Karma, PML,...)
      – provenance-aware system / service architectures ...
          • PASS, Karma
      – ... and workflows
          • Kepler, Taverna, Galaxy, VisTrails,...

    • But, what are systems/applications really doing with it?
      – deliver value to users? i.e., in e-science, in the Web
         • scientific reproducibility, quality, trust
      – optimize system analysis, performance?
         • enable partial re-run of resource-intensive processes
2
Broad goal and specific research context
     A systematic study on methods and applications of mining / learning
         techniques applied to large corpora of provenance metadata


                                                                                                               Reinforcement Learning

                                                                                             Sensor readings       Effectors


      Cloud-based workflows come with a
                       sequent states of the managed element, and might
                                                                                                      Autonomic manager

                       also include forecasts of any external processes
      clear cost model that generate workflow or traffic processed by the                        Analyze                Plan
                       managed element.
       – valuable testbed readily available:
                           A key factor limiting rapid adoption and wide
                       usage of self-* managing systems such as the sys-             Monitor               Knowledge                Execute
           • real e-scienceFigure 1 is the difficulty of engineering a
                       tem in applications
           • large provenanceaccurate knowledge in deployed sys-
                       sufficiently
                                      graphs
                       achieve acceptable performance
                                                           module that can
                                                                                               Sensor readings         Effectors

       – dynamic optimization today’s computing systems are high-
                       tems. Because requires many runs                                                  Managed element
                       ly complex and distributed, developing accurate
    • Reference framework:
                       models of them is a potentially complex and time-
                       consuming task. Moreover, developing such mod- Figure 1. A standard autonomic computing monitor-analyze-plan-
                       els might require original research. For example, execute (MAPE) loop. The autonomic manager continually monitors
      adaptive, self-managing software systems
                       queuing network models of multitier Internet serv- sensor readings, analyzes and plans management decisions, and
         – systems that can dynamically adapt their behaviour in response via effectors. The MAPE components have
                       ices have only recently appeared in the literature.4 executes the decisions to changing conditions in
           the inputs or in their environment onlyfull range of access to ato the likely effectiveness containing information
                           In particular, researchers are
                       approach proper treatment of the
                                                              beginning to
                                                          [1,2]             pertaining
                                                                                       central knowledge base
                                                                                                                   of different management
                       computing systems’ complex dynamic behaviors decisions in achieving policy objectives.
                       within standard model-building frameworks. Stan-
    [1] IBM. An architectural blueprint for autonomic computing. Tech. rep., IBM, 2011
                       dard control models, for example, model such
3   [2] Huebscher, M. effects onlyMcCann, J. A. A survey of autonomic computing - degrees, models, and
                        C., and approximately, and standard queuing            To overcome the knowledge bottleneck in devel-
    applications. ACMmodels ignore Surveys CSUR 40, 3 they oping self-* systems, ML approaches would ideally
                        Computing them entirely given that (2008), 1–28.
Broad goal and specific research context
     A systematic study on methods and applications of mining / learning
         techniques applied to large corpora of provenance metadata

    • An opportunistic starting point:
      optimization of resource-intensive, repetitive, provenance-aware e- Learning
                                                                Reinforcement
      science workflows
                                                                                             Sensor readings       Effectors


      Cloud-based workflows come with a
                       sequent states of the managed element, and might
                                                                                                      Autonomic manager

                       also include forecasts of any external processes
      clear cost model that generate workflow or traffic processed by the                        Analyze                Plan
                       managed element.
       – valuable testbed readily available:
                           A key factor limiting rapid adoption and wide
                       usage of self-* managing systems such as the sys-             Monitor               Knowledge                Execute
           • real e-scienceFigure 1 is the difficulty of engineering a
                       tem in applications
           • large provenanceaccurate knowledge in deployed sys-
                       sufficiently
                                      graphs
                       achieve acceptable performance
                                                           module that can
                                                                                               Sensor readings         Effectors

       – dynamic optimization today’s computing systems are high-
                       tems. Because requires many runs                                                  Managed element
                       ly complex and distributed, developing accurate
    • Reference framework:
                       models of them is a potentially complex and time-
                       consuming task. Moreover, developing such mod- Figure 1. A standard autonomic computing monitor-analyze-plan-
                       els might require original research. For example, execute (MAPE) loop. The autonomic manager continually monitors
      adaptive, self-managing software systems
                       queuing network models of multitier Internet serv- sensor readings, analyzes and plans management decisions, and
         – systems that can dynamically adapt their behaviour in response via effectors. The MAPE components have
                       ices have only recently appeared in the literature.4 executes the decisions to changing conditions in
           the inputs or in their environment onlyfull range of access to ato the likely effectiveness containing information
                           In particular, researchers are
                       approach proper treatment of the
                                                              beginning to
                                                          [1,2]             pertaining
                                                                                       central knowledge base
                                                                                                                   of different management
                       computing systems’ complex dynamic behaviors decisions in achieving policy objectives.
                       within standard model-building frameworks. Stan-
    [1] IBM. An architectural blueprint for autonomic computing. Tech. rep., IBM, 2011
                       dard control models, for example, model such
3   [2] Huebscher, M. effects onlyMcCann, J. A. A survey of autonomic computing - degrees, models, and
                        C., and approximately, and standard queuing            To overcome the knowledge bottleneck in devel-
    applications. ACMmodels ignore Surveys CSUR 40, 3 they oping self-* systems, ML approaches would ideally
                        Computing them entirely given that (2008), 1–28.
Research
    Hypothesis: Provenance traces recorded for past runs of a workflow can be
    used to make future runs more efficient

    Approach: Add adaptive control to an existing workflow, with provenance
    analysis at its core → new recommender task




4
Research
    Hypothesis: Provenance traces recorded for past runs of a workflow can be
    used to make future runs more efficient

    Approach: Add adaptive control to an existing workflow, with provenance
    analysis at its core → new recommender task


    Applicable for instance to a “Panel of Experts” (Poe)                                          input queue



    pattern:
     – N experts are activated on same inputs, best outputs are
       selected


    Provenance used for incremental correlation
                                                                                                                    provenance
                                                                                                                     case base


    of the inputs to the experts’ success rate
                                                                  Panel of Experts

                                                                                     recommender


      - Provenance of run i indicates which
       experts performed well on their input                                            expert
                                                                                                                       online
                                                                                        select

      - Adaptively pruning the process space:
                                                                         Expert 1                        Expert n   provenance
                                                                                                                      analysis


       on run i+1, use provenance of output                                            results
                                                                                       merge


       computed by runs 1..n to select/prioritize
       invocation of experts                                                                     output queue



4
Case study: DiscoveryBus workflow
    • QSAR: Quantitative structure-activity relationships
      – at forefront of Chemical Engineering research
    • OpenQsar project (http://www.openqsar.com):
      – Establish correlations between the structure of molecular compounds and some of
        their associated activities (toxicity, solubility, etc)      (i)
                                                                              DS

    • DiscoveryBus: a workflow
      implementation of OpenQSAR
                                                                        Feature Selection
       – eScience Central cloud-based
         workflow system
       – datasets DS(i) are a family of                    FS(i)1                              FS(i)k
         structurally homogeneous molecules
       – Feature Selection extracts few
                                                                         Model Discovery
         relevant features from DS(i)
       – Each learning scheme MB1...MBh
         generates a different predictive                  MB1                                 MBh
         models for molecular activity

    Repetitive and resource-intensive:
    Workflow execution repeated over about        M(i)11            M(i)1h            M(i)k1            M(i)kh
    10K different input datasets
5
Role of provenance in DiscoveryBus
                                                                                             (i)
                                                                                        DS



    • Connection to Panel of Experts:
                                                                                 Feature Selection
      – Experts model builders MBi
       – Experts outcome quality of generated                          (i)
                                                                     FS 1
                                                                                                              (i)
                                                                                                            FS k
         model (predictive power, stability)

    • Optimization goal:                                                          Model Discovery


      – to prioritize invocation of the MBi based on
        their past performance on inputs similar to                  MB1                                    MBh

        FS(i)j

                                                            M(i)11           M(i)1h                M(i)k1           M(i)kh



      Provenance correlates quality of output models M(i)jk to intermediate
      feature sets FS(i)j:

                (i ) W asGeneratedBy                (i ) W asDerivedF rom
                                                                                      DS (i )
                                          used
              Mjh          →           MB h →    FSj           →


6
The recommender
    • One Quality Matrix QMFS is associated to each Feature Set FS
    • QMFS[MBi] encodes the success history of model builder MBi in the workflow
      every time FS is used as input:

         QM FS [MB h ] = G, B
                                                                                    input queue




    • G (resp B): number of times MBh has been
      observed to produce a good (resp. bad)
      model when applied to input FS                                                                 provenance
                                                                                                      case base


    • QMFS is updated every time FS appears in     Panel of Experts

                                                                      recommender

      the provenance graph
    • The builders’ historical success rate G                            expert

      induces a dynamic partial order on the MBi          Expert 1
                                                                         select
                                                                                          Expert n
                                                                                                        online
                                                                                                     provenance
                                                                                                       analysis

                                                                        results
                                                                        merge



    • For each run, the Recommender:
                                                                                  output queue
       • intercepts FS in the flow
       • returns partial order from QMFS


7
Early experimental results
                                               $) ) + ) ,
                                                    )                                                                                                                                                                                                                                                           $) ) + ) ,
                                                                                                                                                                                                                                                                                                                     )


                                                #) + ) ,
                                                   )                                                                                                                                                                                                                                                            #) + ) ,
                                                                                                                                                                                                                                                                                                                   )


                                                ")+),
                                                  )                                                                                                                                                                                                                                                             ")+),
                                                                                                                                                                                                                                                                                                                  )
                        ;89400 278- - +=;>29




                                                ()+),
                                                  )                                                                                                                                                                                                                                                             ()+),
                                                                                                                                                                                                                                                                                                                  )
                                       0




                                                *) + ) ,
                                                   )                                                                                                                                                                                                                                                            *)+ ) ,
                                                                                                                                                                                                                                                                                                                  )
                             <6




                                                !)+),
                                                  )                                                                                                                                                                                                                                                             ! )+),
                                                                                                                                                                                                                                                                                                                   )
                                                                                                                                                                                                                                                                                                                             -   . / 0 12-
                                                                                                                                                                                                                                                                                                                                      . 1    31 50
                                                                                                                                                                                                                                                                                                                                               40 $
       6 0 278- - 29: . 1




                                                                                                                                                                                                                                                                                                                             -   . / 0 12-
                                                                                                                                                                                                                                                                                                                                      . 1    31 50
                                                                                                                                                                                                                                                                                                                                               40 ’
                                                &) + ) ,
                                                   )                                                                                                                                                                                                                                                            &) + ) ,
                                                                                                                                                                                                                                                                                                                   )         -   . / 0 12-
                                                                                                                                                                                                                                                                                                                                      . 1    31 50
                                                                                                                                                                                                                                                                                                                                               40 %
                                                                                                                                                                                                                                                                                                                             -   . / 0 12-
                                                                                                                                                                                                                                                                                                                                      . 1    31 50
                                                                                                                                                                                                                                                                                                                                               40 &
                                                %+),
                                                 ) )                                                                                                                                                                                                                                                            %+),
                                                                                                                                                                                                                                                                                                                 ) )
    786 2716




                                                ’ )+),
                                                   )                                                                                                                                                                                                                                                            ’ )+),
                                                                                                                                                                                                                                                                                                                   )


                                                $) + ) ,
                                                   )                                                                                                                                                                                                                                                            $) + ) ,
                                                                                                                                                                                                                                                                                                                   )


                                                  )+),
                                                   )                                                                                                                                                                                                                                                            )+),
                                                                                                                                                                                                                                                                                                                 )
                                                               #       !          &         &’      "%     $)        &           #           $          )           &       )&        )           ’         )&       %     "#      (%       #"       (’             "            #            %             (
                                                            !"     #$
                                                                      %         $&        "#      (#    !"         !%         ’"          ()         )$
                                                                                                                                                       %         *’       "%      *"
                                                                                                                                                                                     %
                                                                                                                                                                                            ’   "#       $#       #&
                                                                                                                                                                                                                    %
                                                                                                                                                                                                                         "$     **       &(       ’#         ’   *!        ’   (%       )   (%       "   )$
                                                                           $#         ’          %     &        ! %      *$          ()          "          "#           #     $)         $$          $’       $’      $%     $&      $!       $*         $(            $"           $#           $#
                                                                                                                                                                        6278- - 29: . 1 98+
                                                                                                                                                                                      ;890




    • max_attempts is the accuracy/resources trade-off parameter
                                               – max_attempts = n: only first n out of H model builders are invoked
    • Chart shows net accuracy over the entire available history of runs
                                               – success rate / number of recommendations given




8
Limitations and ongoing work

    • Approach suffers when FS space is                      % ###
                                                              "

      sparse                                                 %####

                                                             $" ###
       – most FS seen only once




                                              0 ; ) , ( 15
                                                             $####
    • Recommender abstains when QMFS                         ! " ###
      not sufficiently populated




                                              -,
                                                             ! ####
    • This is the main hurdle to successful                   " ###
      optimization                                                #
                                                                       !      "    !#    !"   $#    $"    %#   %"    &#   &"   ’ &"
                                                                           ( ) * +, -./ 0 0 -, ( 1, 2.3 .4( 5.26 78 .9:
                                                                                        .-, ,         /        ( ,




    • Strategy: increase space density by clustering the FS
      • needs a distance metric over the set of FS
      • hierarchical clustering should provide a way to experiment with
        accuracy/efficiency trade-offs




9
Short summary
     • What happens when you try to apply mining/learning techniques to
       large corpora of provenance metadata?
       – any interesting applications / use cases?
       – which techniques apply?
       – are there significant research challenges, or an orchard of low-hanging fruits?


     • privacy in provenance mining
     • what provenance models lend themselves well to mining
     • ...




10

Contenu connexe

Tendances

An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC IJECEIAES
 
Application scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designApplication scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designMr. Chanuwan
 
PROGRAMMED TARGET RECOGNITION FRAMEWORKS FOR UNDERWATER MINE CLASSIFICATION
PROGRAMMED TARGET RECOGNITION FRAMEWORKS FOR UNDERWATER MINE CLASSIFICATIONPROGRAMMED TARGET RECOGNITION FRAMEWORKS FOR UNDERWATER MINE CLASSIFICATION
PROGRAMMED TARGET RECOGNITION FRAMEWORKS FOR UNDERWATER MINE CLASSIFICATIONEditor IJCTER
 
MICE: Monitoring and modelIng of Context Evolution
MICE: Monitoring and modelIng of Context EvolutionMICE: Monitoring and modelIng of Context Evolution
MICE: Monitoring and modelIng of Context EvolutionLuca Berardinelli
 
Cybernetics in supply chain management
Cybernetics in supply chain managementCybernetics in supply chain management
Cybernetics in supply chain managementLuis Cabrera
 
Dynamic Value Engineering Method Optimizing the Risk on Real Time Operating S...
Dynamic Value Engineering Method Optimizing the Risk on Real Time Operating S...Dynamic Value Engineering Method Optimizing the Risk on Real Time Operating S...
Dynamic Value Engineering Method Optimizing the Risk on Real Time Operating S...ijeei-iaes
 
StruxureWare for data centers
StruxureWare for data centersStruxureWare for data centers
StruxureWare for data centersRogier den Boer
 
AN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMS
AN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMSAN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMS
AN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMSijseajournal
 
Dynamically Adapting Software Components for the Grid
Dynamically Adapting Software Components for the GridDynamically Adapting Software Components for the Grid
Dynamically Adapting Software Components for the GridEditor IJCATR
 
HW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on ChipHW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on ChipCSCJournals
 
Adaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud surveyAdaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud surveywww.pixelsolutionbd.com
 
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...IJMER
 
IRJET- Methodologies to the Strategy of Computer Networking Research laboratory
IRJET- Methodologies to the Strategy of Computer Networking Research laboratoryIRJET- Methodologies to the Strategy of Computer Networking Research laboratory
IRJET- Methodologies to the Strategy of Computer Networking Research laboratoryIRJET Journal
 
Extending open up for autonomic computing english_finalversionupload 6
Extending open up for autonomic computing english_finalversionupload 6Extending open up for autonomic computing english_finalversionupload 6
Extending open up for autonomic computing english_finalversionupload 6vpssantos
 
The KEDRI Integrated System for Personalised Modelling
The KEDRI Integrated System for Personalised ModellingThe KEDRI Integrated System for Personalised Modelling
The KEDRI Integrated System for Personalised ModellingHealth Informatics New Zealand
 

Tendances (19)

An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
An Investigation towards Effectiveness in Image Enhancement Process in MPSoC
 
347 352
347 352347 352
347 352
 
Aw26312325
Aw26312325Aw26312325
Aw26312325
 
81-T48
81-T4881-T48
81-T48
 
Application scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system designApplication scenarios in streaming oriented embedded-system design
Application scenarios in streaming oriented embedded-system design
 
50120140505008
5012014050500850120140505008
50120140505008
 
PROGRAMMED TARGET RECOGNITION FRAMEWORKS FOR UNDERWATER MINE CLASSIFICATION
PROGRAMMED TARGET RECOGNITION FRAMEWORKS FOR UNDERWATER MINE CLASSIFICATIONPROGRAMMED TARGET RECOGNITION FRAMEWORKS FOR UNDERWATER MINE CLASSIFICATION
PROGRAMMED TARGET RECOGNITION FRAMEWORKS FOR UNDERWATER MINE CLASSIFICATION
 
MICE: Monitoring and modelIng of Context Evolution
MICE: Monitoring and modelIng of Context EvolutionMICE: Monitoring and modelIng of Context Evolution
MICE: Monitoring and modelIng of Context Evolution
 
Cybernetics in supply chain management
Cybernetics in supply chain managementCybernetics in supply chain management
Cybernetics in supply chain management
 
Dynamic Value Engineering Method Optimizing the Risk on Real Time Operating S...
Dynamic Value Engineering Method Optimizing the Risk on Real Time Operating S...Dynamic Value Engineering Method Optimizing the Risk on Real Time Operating S...
Dynamic Value Engineering Method Optimizing the Risk on Real Time Operating S...
 
StruxureWare for data centers
StruxureWare for data centersStruxureWare for data centers
StruxureWare for data centers
 
AN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMS
AN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMSAN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMS
AN INVESTIGATION OF THE MONITORING ACTIVITY IN SELF ADAPTIVE SYSTEMS
 
Dynamically Adapting Software Components for the Grid
Dynamically Adapting Software Components for the GridDynamically Adapting Software Components for the Grid
Dynamically Adapting Software Components for the Grid
 
HW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on ChipHW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
HW/SW Partitioning Approach on Reconfigurable Multimedia System on Chip
 
Adaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud surveyAdaptive fault tolerance in cloud survey
Adaptive fault tolerance in cloud survey
 
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...Development of a Suitable Load Balancing Strategy In Case Of a  Cloud Computi...
Development of a Suitable Load Balancing Strategy In Case Of a Cloud Computi...
 
IRJET- Methodologies to the Strategy of Computer Networking Research laboratory
IRJET- Methodologies to the Strategy of Computer Networking Research laboratoryIRJET- Methodologies to the Strategy of Computer Networking Research laboratory
IRJET- Methodologies to the Strategy of Computer Networking Research laboratory
 
Extending open up for autonomic computing english_finalversionupload 6
Extending open up for autonomic computing english_finalversionupload 6Extending open up for autonomic computing english_finalversionupload 6
Extending open up for autonomic computing english_finalversionupload 6
 
The KEDRI Integrated System for Personalised Modelling
The KEDRI Integrated System for Personalised ModellingThe KEDRI Integrated System for Personalised Modelling
The KEDRI Integrated System for Personalised Modelling
 

En vedette

Paper talk: Idcc 11
Paper talk: Idcc 11  Paper talk: Idcc 11
Paper talk: Idcc 11 Paolo Missier
 
Paper presentation: Taverna, reloaded
Paper presentation: Taverna, reloadedPaper presentation: Taverna, reloaded
Paper presentation: Taverna, reloadedPaolo Missier
 
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08Paolo Missier
 
provenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorialprovenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorialPaolo Missier
 
Paper presentation @ SEBD '09
Paper presentation @ SEBD '09Paper presentation @ SEBD '09
Paper presentation @ SEBD '09Paolo Missier
 
Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Paolo Missier
 
Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)Paolo Missier
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud Paolo Missier
 

En vedette (9)

Paper talk: Idcc 11
Paper talk: Idcc 11  Paper talk: Idcc 11
Paper talk: Idcc 11
 
Paper presentation: Taverna, reloaded
Paper presentation: Taverna, reloadedPaper presentation: Taverna, reloaded
Paper presentation: Taverna, reloaded
 
Paper presentation @IPAW'08
Paper presentation @IPAW'08Paper presentation @IPAW'08
Paper presentation @IPAW'08
 
provenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorialprovenance of lists - TAPP'11 Mini-tutorial
provenance of lists - TAPP'11 Mini-tutorial
 
Paper presentation @ SEBD '09
Paper presentation @ SEBD '09Paper presentation @ SEBD '09
Paper presentation @ SEBD '09
 
Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07Invited talk @Roma La Sapienza, April '07
Invited talk @Roma La Sapienza, April '07
 
Охота на Работу!EXCLUSIVE
Охота на Работу!EXCLUSIVEОхота на Работу!EXCLUSIVE
Охота на Работу!EXCLUSIVE
 
Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)Repro pdiff-talk (invited, Humboldt University, Berlin)
Repro pdiff-talk (invited, Humboldt University, Berlin)
 
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A CloudScalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
 

Similaire à Tapp11 presentation

Performancepredictionforsoftwarearchitectures 100810045752-phpapp02
Performancepredictionforsoftwarearchitectures 100810045752-phpapp02Performancepredictionforsoftwarearchitectures 100810045752-phpapp02
Performancepredictionforsoftwarearchitectures 100810045752-phpapp02NNfamily
 
Can “Feature” be used to Model the Changing Access Control Policies?
Can “Feature” be used to Model the Changing Access Control Policies? Can “Feature” be used to Model the Changing Access Control Policies?
Can “Feature” be used to Model the Changing Access Control Policies? IJORCS
 
The real time publisher subscriber inter-process communication model for dist...
The real time publisher subscriber inter-process communication model for dist...The real time publisher subscriber inter-process communication model for dist...
The real time publisher subscriber inter-process communication model for dist...yancha1973
 
Analysis and Control of Computing Systems
Analysis and Control of Computing SystemsAnalysis and Control of Computing Systems
Analysis and Control of Computing Systemsnorhavillegas
 
Enterprise performance engineering solutions
Enterprise performance engineering solutionsEnterprise performance engineering solutions
Enterprise performance engineering solutionsInfosys
 
Departure Delay Prediction using Machine Learning.
Departure Delay Prediction using Machine Learning.Departure Delay Prediction using Machine Learning.
Departure Delay Prediction using Machine Learning.IRJET Journal
 
Independent tasks scheduling based on genetic
Independent tasks scheduling based on geneticIndependent tasks scheduling based on genetic
Independent tasks scheduling based on geneticambitlick
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RIRJET Journal
 
Software project management ppt
Software project management pptSoftware project management ppt
Software project management pptJyoti Kritika Bara
 
Cloud data management
Cloud data managementCloud data management
Cloud data managementambitlick
 
Systems Lifecycle workbook
Systems Lifecycle workbookSystems Lifecycle workbook
Systems Lifecycle workbookMISY
 
Fydp poster type a rev 1.0
Fydp poster type a rev 1.0Fydp poster type a rev 1.0
Fydp poster type a rev 1.0Nathan Mak
 
Ukd2008 18-9-08 andrea
Ukd2008 18-9-08 andreaUkd2008 18-9-08 andrea
Ukd2008 18-9-08 andreaAndrea Zaza
 
APPLICATION OF AUTONOMIC COMPUTING PRINCIPLES IN VIRTUALIZED ENVIRONMENT
APPLICATION OF AUTONOMIC COMPUTING PRINCIPLES IN VIRTUALIZED ENVIRONMENTAPPLICATION OF AUTONOMIC COMPUTING PRINCIPLES IN VIRTUALIZED ENVIRONMENT
APPLICATION OF AUTONOMIC COMPUTING PRINCIPLES IN VIRTUALIZED ENVIRONMENTcscpconf
 
IRJET- Scheduling of Independent Tasks over Virtual Machines on Computati...
IRJET-  	  Scheduling of Independent Tasks over Virtual Machines on Computati...IRJET-  	  Scheduling of Independent Tasks over Virtual Machines on Computati...
IRJET- Scheduling of Independent Tasks over Virtual Machines on Computati...IRJET Journal
 
Una plataforma de software para control de procesos en el sector agroindustrial
Una plataforma de software para control de procesos en el sector agroindustrialUna plataforma de software para control de procesos en el sector agroindustrial
Una plataforma de software para control de procesos en el sector agroindustrialSergio Mancera
 
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy DatabasesAssisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy DatabasesGihan Wikramanayake
 

Similaire à Tapp11 presentation (20)

Performancepredictionforsoftwarearchitectures 100810045752-phpapp02
Performancepredictionforsoftwarearchitectures 100810045752-phpapp02Performancepredictionforsoftwarearchitectures 100810045752-phpapp02
Performancepredictionforsoftwarearchitectures 100810045752-phpapp02
 
Can “Feature” be used to Model the Changing Access Control Policies?
Can “Feature” be used to Model the Changing Access Control Policies? Can “Feature” be used to Model the Changing Access Control Policies?
Can “Feature” be used to Model the Changing Access Control Policies?
 
The real time publisher subscriber inter-process communication model for dist...
The real time publisher subscriber inter-process communication model for dist...The real time publisher subscriber inter-process communication model for dist...
The real time publisher subscriber inter-process communication model for dist...
 
10.1.1.85.7969
10.1.1.85.796910.1.1.85.7969
10.1.1.85.7969
 
Analysis and Control of Computing Systems
Analysis and Control of Computing SystemsAnalysis and Control of Computing Systems
Analysis and Control of Computing Systems
 
Enterprise performance engineering solutions
Enterprise performance engineering solutionsEnterprise performance engineering solutions
Enterprise performance engineering solutions
 
Departure Delay Prediction using Machine Learning.
Departure Delay Prediction using Machine Learning.Departure Delay Prediction using Machine Learning.
Departure Delay Prediction using Machine Learning.
 
Independent tasks scheduling based on genetic
Independent tasks scheduling based on geneticIndependent tasks scheduling based on genetic
Independent tasks scheduling based on genetic
 
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and RSvm Classifier Algorithm for Data Stream Mining Using Hive and R
Svm Classifier Algorithm for Data Stream Mining Using Hive and R
 
Software project management ppt
Software project management pptSoftware project management ppt
Software project management ppt
 
Cloud data management
Cloud data managementCloud data management
Cloud data management
 
Systems Lifecycle workbook
Systems Lifecycle workbookSystems Lifecycle workbook
Systems Lifecycle workbook
 
Fydp poster type a rev 1.0
Fydp poster type a rev 1.0Fydp poster type a rev 1.0
Fydp poster type a rev 1.0
 
Ukd2008 18-9-08 andrea
Ukd2008 18-9-08 andreaUkd2008 18-9-08 andrea
Ukd2008 18-9-08 andrea
 
APPLICATION OF AUTONOMIC COMPUTING PRINCIPLES IN VIRTUALIZED ENVIRONMENT
APPLICATION OF AUTONOMIC COMPUTING PRINCIPLES IN VIRTUALIZED ENVIRONMENTAPPLICATION OF AUTONOMIC COMPUTING PRINCIPLES IN VIRTUALIZED ENVIRONMENT
APPLICATION OF AUTONOMIC COMPUTING PRINCIPLES IN VIRTUALIZED ENVIRONMENT
 
Journals analysis ppt
Journals analysis pptJournals analysis ppt
Journals analysis ppt
 
IRJET- Scheduling of Independent Tasks over Virtual Machines on Computati...
IRJET-  	  Scheduling of Independent Tasks over Virtual Machines on Computati...IRJET-  	  Scheduling of Independent Tasks over Virtual Machines on Computati...
IRJET- Scheduling of Independent Tasks over Virtual Machines on Computati...
 
Una plataforma de software para control de procesos en el sector agroindustrial
Una plataforma de software para control de procesos en el sector agroindustrialUna plataforma de software para control de procesos en el sector agroindustrial
Una plataforma de software para control de procesos en el sector agroindustrial
 
Assisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy DatabasesAssisting Migration and Evolution of Relational Legacy Databases
Assisting Migration and Evolution of Relational Legacy Databases
 
01_Program
01_Program01_Program
01_Program
 

Plus de Paolo Missier

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsPaolo Missier
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Paolo Missier
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...Paolo Missier
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...Paolo Missier
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Paolo Missier
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewPaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...Paolo Missier
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Paolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcarePaolo Missier
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data SciencePaolo Missier
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Paolo Missier
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...Paolo Missier
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...Paolo Missier
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...Paolo Missier
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...Paolo Missier
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff UniversityPaolo Missier
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Paolo Missier
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...Paolo Missier
 

Plus de Paolo Missier (20)

Towards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance recordsTowards explanations for Data-Centric AI using provenance records
Towards explanations for Data-Centric AI using provenance records
 
Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...Interpretable and robust hospital readmission predictions from Electronic Hea...
Interpretable and robust hospital readmission predictions from Electronic Hea...
 
Data-centric AI and the convergence of data and model engineering: opportunit...
Data-centric AI and the convergence of data and model engineering:opportunit...Data-centric AI and the convergence of data and model engineering:opportunit...
Data-centric AI and the convergence of data and model engineering: opportunit...
 
Realising the potential of Health Data Science: opportunities and challenges ...
Realising the potential of Health Data Science:opportunities and challenges ...Realising the potential of Health Data Science:opportunities and challenges ...
Realising the potential of Health Data Science: opportunities and challenges ...
 
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
 
A Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overviewA Data-centric perspective on Data-driven healthcare: a short overview
A Data-centric perspective on Data-driven healthcare: a short overview
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Tracking trajectories of multiple long-term conditions using dynamic patient...
Tracking trajectories of  multiple long-term conditions using dynamic patient...Tracking trajectories of  multiple long-term conditions using dynamic patient...
Tracking trajectories of multiple long-term conditions using dynamic patient...
 
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcareDigital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
 
Data Provenance for Data Science
Data Provenance for Data ScienceData Provenance for Data Science
Data Provenance for Data Science
 
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
 
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...Quo vadis, provenancer? Cui prodest? our own trajectory: provenance of data...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
 
Data Science for (Health) Science: tales from a challenging front line, and h...
Data Science for (Health) Science:tales from a challenging front line, and h...Data Science for (Health) Science:tales from a challenging front line, and h...
Data Science for (Health) Science: tales from a challenging front line, and h...
 
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Analytics of analytics pipelines:from optimising re-execution to general Dat...Analytics of analytics pipelines:from optimising re-execution to general Dat...
Analytics of analytics pipelines: from optimising re-execution to general Dat...
 
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
 
ReComp, the complete story: an invited talk at Cardiff University
ReComp, the complete story:  an invited talk at Cardiff UniversityReComp, the complete story:  an invited talk at Cardiff University
ReComp, the complete story: an invited talk at Cardiff University
 
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
 
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...Decentralized, Trust-less Marketplacefor Brokered IoT Data Tradingusing Blo...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
 

Tapp11 presentation

  • 1. Incremental workflow improvement through analysis of its data provenance Paolo Missier School of Computing, Newcastle University, UK In collaboration with the eScience Central group at Newcastle: Prof. Paul Watson, Dr. Simon Woodman, Dr Hugo Hiden, Dr. Jacek Cala TAPP’11 workshop Heraklion, Crete June 20-21, 2011
  • 2. Motivation Despite the growing momentum around provenance as a premiere form of metadata, success exploitation stories trail models and technology advances 2
  • 3. Motivation Despite the growing momentum around provenance as a premiere form of metadata, success exploitation stories trail models and technology advances • We are getting very good at recording provenance of data: – multiple data models (OPM, Provenir, Janus, Karma, PML,...) – provenance-aware system / service architectures ... • PASS, Karma – ... and workflows • Kepler, Taverna, Galaxy, VisTrails,... • But, what are systems/applications really doing with it? – deliver value to users? i.e., in e-science, in the Web • scientific reproducibility, quality, trust – optimize system analysis, performance? • enable partial re-run of resource-intensive processes 2
  • 4. Broad goal and specific research context A systematic study on methods and applications of mining / learning techniques applied to large corpora of provenance metadata Reinforcement Learning Sensor readings Effectors Cloud-based workflows come with a sequent states of the managed element, and might Autonomic manager also include forecasts of any external processes clear cost model that generate workflow or traffic processed by the Analyze Plan managed element. – valuable testbed readily available: A key factor limiting rapid adoption and wide usage of self-* managing systems such as the sys- Monitor Knowledge Execute • real e-scienceFigure 1 is the difficulty of engineering a tem in applications • large provenanceaccurate knowledge in deployed sys- sufficiently graphs achieve acceptable performance module that can Sensor readings Effectors – dynamic optimization today’s computing systems are high- tems. Because requires many runs Managed element ly complex and distributed, developing accurate • Reference framework: models of them is a potentially complex and time- consuming task. Moreover, developing such mod- Figure 1. A standard autonomic computing monitor-analyze-plan- els might require original research. For example, execute (MAPE) loop. The autonomic manager continually monitors adaptive, self-managing software systems queuing network models of multitier Internet serv- sensor readings, analyzes and plans management decisions, and – systems that can dynamically adapt their behaviour in response via effectors. The MAPE components have ices have only recently appeared in the literature.4 executes the decisions to changing conditions in the inputs or in their environment onlyfull range of access to ato the likely effectiveness containing information In particular, researchers are approach proper treatment of the beginning to [1,2] pertaining central knowledge base of different management computing systems’ complex dynamic behaviors decisions in achieving policy objectives. within standard model-building frameworks. Stan- [1] IBM. An architectural blueprint for autonomic computing. Tech. rep., IBM, 2011 dard control models, for example, model such 3 [2] Huebscher, M. effects onlyMcCann, J. A. A survey of autonomic computing - degrees, models, and C., and approximately, and standard queuing To overcome the knowledge bottleneck in devel- applications. ACMmodels ignore Surveys CSUR 40, 3 they oping self-* systems, ML approaches would ideally Computing them entirely given that (2008), 1–28.
  • 5. Broad goal and specific research context A systematic study on methods and applications of mining / learning techniques applied to large corpora of provenance metadata • An opportunistic starting point: optimization of resource-intensive, repetitive, provenance-aware e- Learning Reinforcement science workflows Sensor readings Effectors Cloud-based workflows come with a sequent states of the managed element, and might Autonomic manager also include forecasts of any external processes clear cost model that generate workflow or traffic processed by the Analyze Plan managed element. – valuable testbed readily available: A key factor limiting rapid adoption and wide usage of self-* managing systems such as the sys- Monitor Knowledge Execute • real e-scienceFigure 1 is the difficulty of engineering a tem in applications • large provenanceaccurate knowledge in deployed sys- sufficiently graphs achieve acceptable performance module that can Sensor readings Effectors – dynamic optimization today’s computing systems are high- tems. Because requires many runs Managed element ly complex and distributed, developing accurate • Reference framework: models of them is a potentially complex and time- consuming task. Moreover, developing such mod- Figure 1. A standard autonomic computing monitor-analyze-plan- els might require original research. For example, execute (MAPE) loop. The autonomic manager continually monitors adaptive, self-managing software systems queuing network models of multitier Internet serv- sensor readings, analyzes and plans management decisions, and – systems that can dynamically adapt their behaviour in response via effectors. The MAPE components have ices have only recently appeared in the literature.4 executes the decisions to changing conditions in the inputs or in their environment onlyfull range of access to ato the likely effectiveness containing information In particular, researchers are approach proper treatment of the beginning to [1,2] pertaining central knowledge base of different management computing systems’ complex dynamic behaviors decisions in achieving policy objectives. within standard model-building frameworks. Stan- [1] IBM. An architectural blueprint for autonomic computing. Tech. rep., IBM, 2011 dard control models, for example, model such 3 [2] Huebscher, M. effects onlyMcCann, J. A. A survey of autonomic computing - degrees, models, and C., and approximately, and standard queuing To overcome the knowledge bottleneck in devel- applications. ACMmodels ignore Surveys CSUR 40, 3 they oping self-* systems, ML approaches would ideally Computing them entirely given that (2008), 1–28.
  • 6. Research Hypothesis: Provenance traces recorded for past runs of a workflow can be used to make future runs more efficient Approach: Add adaptive control to an existing workflow, with provenance analysis at its core → new recommender task 4
  • 7. Research Hypothesis: Provenance traces recorded for past runs of a workflow can be used to make future runs more efficient Approach: Add adaptive control to an existing workflow, with provenance analysis at its core → new recommender task Applicable for instance to a “Panel of Experts” (Poe) input queue pattern: – N experts are activated on same inputs, best outputs are selected Provenance used for incremental correlation provenance case base of the inputs to the experts’ success rate Panel of Experts recommender - Provenance of run i indicates which experts performed well on their input expert online select - Adaptively pruning the process space: Expert 1 Expert n provenance analysis on run i+1, use provenance of output results merge computed by runs 1..n to select/prioritize invocation of experts output queue 4
  • 8. Case study: DiscoveryBus workflow • QSAR: Quantitative structure-activity relationships – at forefront of Chemical Engineering research • OpenQsar project (http://www.openqsar.com): – Establish correlations between the structure of molecular compounds and some of their associated activities (toxicity, solubility, etc) (i) DS • DiscoveryBus: a workflow implementation of OpenQSAR Feature Selection – eScience Central cloud-based workflow system – datasets DS(i) are a family of FS(i)1 FS(i)k structurally homogeneous molecules – Feature Selection extracts few Model Discovery relevant features from DS(i) – Each learning scheme MB1...MBh generates a different predictive MB1 MBh models for molecular activity Repetitive and resource-intensive: Workflow execution repeated over about M(i)11 M(i)1h M(i)k1 M(i)kh 10K different input datasets 5
  • 9. Role of provenance in DiscoveryBus (i) DS • Connection to Panel of Experts: Feature Selection – Experts model builders MBi – Experts outcome quality of generated (i) FS 1 (i) FS k model (predictive power, stability) • Optimization goal: Model Discovery – to prioritize invocation of the MBi based on their past performance on inputs similar to MB1 MBh FS(i)j M(i)11 M(i)1h M(i)k1 M(i)kh Provenance correlates quality of output models M(i)jk to intermediate feature sets FS(i)j: (i ) W asGeneratedBy (i ) W asDerivedF rom DS (i ) used Mjh → MB h → FSj → 6
  • 10. The recommender • One Quality Matrix QMFS is associated to each Feature Set FS • QMFS[MBi] encodes the success history of model builder MBi in the workflow every time FS is used as input: QM FS [MB h ] = G, B input queue • G (resp B): number of times MBh has been observed to produce a good (resp. bad) model when applied to input FS provenance case base • QMFS is updated every time FS appears in Panel of Experts recommender the provenance graph • The builders’ historical success rate G expert induces a dynamic partial order on the MBi Expert 1 select Expert n online provenance analysis results merge • For each run, the Recommender: output queue • intercepts FS in the flow • returns partial order from QMFS 7
  • 11. Early experimental results $) ) + ) , ) $) ) + ) , ) #) + ) , ) #) + ) , ) ")+), ) ")+), ) ;89400 278- - +=;>29 ()+), ) ()+), ) 0 *) + ) , ) *)+ ) , ) <6 !)+), ) ! )+), ) - . / 0 12- . 1 31 50 40 $ 6 0 278- - 29: . 1 - . / 0 12- . 1 31 50 40 ’ &) + ) , ) &) + ) , ) - . / 0 12- . 1 31 50 40 % - . / 0 12- . 1 31 50 40 & %+), ) ) %+), ) ) 786 2716 ’ )+), ) ’ )+), ) $) + ) , ) $) + ) , ) )+), ) )+), ) # ! & &’ "% $) & # $ ) & )& ) ’ )& % "# (% #" (’ " # % ( !" #$ % $& "# (# !" !% ’" () )$ % *’ "% *" % ’ "# $# #& % "$ ** &( ’# ’ *! ’ (% ) (% " )$ $# ’ % & ! % *$ () " "# # $) $$ $’ $’ $% $& $! $* $( $" $# $# 6278- - 29: . 1 98+ ;890 • max_attempts is the accuracy/resources trade-off parameter – max_attempts = n: only first n out of H model builders are invoked • Chart shows net accuracy over the entire available history of runs – success rate / number of recommendations given 8
  • 12. Limitations and ongoing work • Approach suffers when FS space is % ### " sparse %#### $" ### – most FS seen only once 0 ; ) , ( 15 $#### • Recommender abstains when QMFS ! " ### not sufficiently populated -, ! #### • This is the main hurdle to successful " ### optimization # ! " !# !" $# $" %# %" &# &" ’ &" ( ) * +, -./ 0 0 -, ( 1, 2.3 .4( 5.26 78 .9: .-, , / ( , • Strategy: increase space density by clustering the FS • needs a distance metric over the set of FS • hierarchical clustering should provide a way to experiment with accuracy/efficiency trade-offs 9
  • 13. Short summary • What happens when you try to apply mining/learning techniques to large corpora of provenance metadata? – any interesting applications / use cases? – which techniques apply? – are there significant research challenges, or an orchard of low-hanging fruits? • privacy in provenance mining • what provenance models lend themselves well to mining • ... 10

Notes de l'éditeur

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n