SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
WorkflowSim: A Toolkit for Simulating
  Scientific Workflows in Distributed
             Environments
         Weiwei Chen, Ewa Deelman
Outline
§  Introduction
  –  Scientific workflows?
  –  Distributed environments?
§  Challenge
  –  Large scale, system overhead
§  Solution
  –  Workflow Overhead and Failure Model
§  Validation and Application
  –  Overhead Robustness
  –  Fault Tolerant Clustering

                           2
Scientific Applications
§  Scientists often need to
  –  Integrate diverse components and data
  –  Automate data processing steps
  –  Reproduce/analyze/share previous results
  –  Track the provenance of data products
  –  Execute processing steps efficiently
  –  Reliably execute applications

       Scientific Workflows provide
       solutions to these problems


                         3
Scientific Workflows
§  DAG model (Directed Acyclic Graph)
  –  Node: computational activities
  –  Directed edge: data dependencies
  –  Task: a process that users would like to execute
  –  Job: a single unit for execution with one of more tasks
  –  Task clustering: the process of grouping tasks to jobs




                             4
Workflow Management System
§  Common Features of WMS
  –  Maps abstract workflows to executable workflows
  –  Handles data dependencies
  –  Replica selection, transfers, registration, cleanup
  –  Task clustering, workflow partitioning, scheduling
  –  Reliability and fault tolerance
  –  Monitoring and troubleshooting
§  Existing WMS
  –  Pegasus, Askalon, Taverna, Kepler, Triana



                           5
Workflow Simulation
§  Benefits
  –  Save efforts in system setup, execution
  –  Repeat experimental results
  –  Control system environments (failures)
§  Trace based Workflow Simulation
  –  Import trace from a completed execution
  –  Vary workflow structures and system environments
§  Challenges (CloudSim, GridSim, etc.)
  –  System overhead and failures
  –  Multiple levels of computational activities(task/job)
  –  A hierarchy of management components
                           6
Comparison
                          CloudSim           WorkflowSim
                         Task, Bag of
  Execution Model                           Task, Job, DAG
                            Tasks
Failure and Monitoring        No                 Yes

                                          Data Transfer Delay
                         Data Transfer
  Overhead Model                         Workflow Engine Delay
                            Delay
                                          Clustering Delay …

    Site selection          Single             Multiple
                                         Scheduling, Job retry
    Optimization
                          Scheduling         Clustering,
    Techniques
                                            Partitioning …
     WorkflowSim is an extension of
     CloudSim, but it is workflow aware
                             7
System Architecture
§  Submit Host
  –  Workflow Mapper
  –  Clustering Engine
  –  Workflow Engine
  –  Local Scheduler
§  Execution Site
  –  Remote Scheduler
  –  Worker Nodes
  –  Failure Generator
  –  Failure Monitor

                         8
Workflow Overhead
–  Workflow Engine       –  Postscript Delay
   Delay                 –  Clustering Delay
–  Queue Delay           –  Data Transfer Delay




                     9
Example: Clustering Delay
                         n is the number of tasks per
                         level. k is number of jobs per
                         level.
                          !"#$%&'()*!!"#$%|!!! ! ! !
                                              =   =
                          !"#$%&'()*!!"#$%|!!! ! ! !



                         mProjectPP, mDiffFit, and
                         mBackground are the major
                         jobs of Montage

Overhead is not a constant variable.
It has diverse distribution and patterns

                 10
Validation
           !"#$%&'#$!!"#$%&&!!"#$%&'   Ideal Case: Accuracy=1.0
!!!"#$!% =
             !"#$!!"#$%&&!!"#$%&'      k: Maximum jobs per level




                                        Overheads have
                                        the biggest impact




                               11
Application: Overhead Robustness
§  Overhead robustness: the influence of
    overheads on the workflow runtime for DAG
    scheduling heuristics.
§  Inaccurate estimation (under- or over-estimated)
    of workflow overheads influences the overall
    runtime of workflows.
§  Research Merits:
     –  Sensitivity of heuristics
     –  Overhead friendly heuristics



                         12
Application: Overhead Robustness
§  Increase or Reduce Overhead by a Factor
    (Weight)
  –  Under estimation and over estimation
§  Heuristics Evaluated
  –  FCFS: First Come First Serve
  –  MCT: Minimum Completion Time
  –  MinMin: The job with the minimum completion time is
     selected and assigned to the fastest resource.
  –  MaxMin: The job with the maximum completion time
     and assigns it to its best available resource.



                           13
Application: Overhead Robustness
                           MCT<MinMin MCT<MinMin
         MCT>MinMin

MCT<MinMin




    Accurate estimation of Clustering
    Delay is more important

                      14
Workflow Failure
§  Failures have significant influence on the
    performance
§  Classifying Transient Failures
   –  Task Failure: A task fails, other tasks within the
      same job may not fail
   –  Job Failure: A job fails, all of its tasks fail




                            15
System Architecture
§  Submit Host
  –  Job Retry
  –  Reclustering
§  Execution Site
  –  Failure Generator
  –  Failure Monitor




                         16
Application: Fault Tolerant Clustering
§  Task clustering can reduce execution overhead
§  A job composed of multiple tasks may have a
    greater risk of suffering from failures
§  Reclustering and Job Retry are proposed
  –  No Optimization (NOOP) retries the failed jobs.
  –  Dynamic Clustering (DC) decreases the clusters.size if
     the measured job failure rate is high.




                           17
Application: Fault Tolerant Clustering
–  Selective Reclustering (SR) retries the failed tasks in a job
–  Dynamic Reclustering (DR) retires the failed tasks in a job and also
   decreases the clusters.size if the measured job failure rate is high.




                               18
Performance
                       §  Reclustering (DR/DC/SR) reduces the influence of
                           failures significantly compared to NOOP
                       §  DR outperforms other techniques.
The lower the better




                                                 19
Conclusion
§  WorkflowSim assists researchers to evaluate
    their workflow optimization techniques with
    better accuracy and wider support.
  –  Modeling Overhead and Failures
  –  Distributed and Hierarchical components
  –  Workflow Techniques
§  It is necessary to consider both data
    dependencies, workflow failures and system
    overhead.



                           20
Acknowledgements
§    Pegasus Team: http://pegasus.isi.edu
§    FutureGrid & XSEDE
§    Funded by NSF grants IIS-0905032.
§    More info: wchen@isi.edu
§    Available on request
§    Q&A




                               21
Overhead Distribution




                   Montage




          22
Overhead Distribution




                   CyberShake




          23
Broadband




            24
Task Failure Model and Job Failure Model




                n                                       4d
      k* =                           −d + d 2 −
                r                                    ln(1− α )
                              k* =                               ,   if   n >> r
                                                2t
     *       (kt + d)
    ttotal
        =                    *   n(k *t + d)
              1− β           t =
                             total          *
                                 rk(1− α )k




                        25                               25
Related Papers


§    Integration of Workflow Partitioning and Resource Provisioning, CCGrid 2012.
§    Improving Scientific Workflow Performance using Policy Based Data
      Placement, IEEE Policy 2012.
§    Fault Tolerant Clustering in Scientific Workflows, SWF, IEEE Services 2012
§    Workflow Overhead Analysis and Optimizations, WORKS 2011.
§    Partitioning and Scheduling Workflows across Multiple Sites with Storage
      Constraints, PPAM 2011
§    Pegasus: a Framework for Mapping Complex Scientific Workflows onto
      Distributed Systems, Scientific Programming Journal 2005




                                          26

Contenu connexe

Tendances

Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingSayed Chhattan Shah
 
SYSTEM NETWORK ADMINISTRATIONS GOALS and TIPS
SYSTEM NETWORK ADMINISTRATIONS GOALS and TIPSSYSTEM NETWORK ADMINISTRATIONS GOALS and TIPS
SYSTEM NETWORK ADMINISTRATIONS GOALS and TIPSProf Ansari
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/CoreShay Cohen
 
Yaygın Linux Komutları ve Windows Karşılıkları
Yaygın Linux Komutları ve Windows KarşılıklarıYaygın Linux Komutları ve Windows Karşılıkları
Yaygın Linux Komutları ve Windows KarşılıklarıMert Hakki Bingol
 
Redis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal JoshiRedis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal JoshiRedis Labs
 
Kernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisKernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisBuland Singh
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
OOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM appsOOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM appsSematext Group, Inc.
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop QuébecMathieu Dumoulin
 
Cloud computing : Cloud sim
Cloud computing : Cloud sim Cloud computing : Cloud sim
Cloud computing : Cloud sim Khalid EDAIG
 
Z OS IBM Utilities
Z OS IBM UtilitiesZ OS IBM Utilities
Z OS IBM Utilitieskapa rohit
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemRutvik Bapat
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Databricks
 
Temel Linux Kullanımı ve Komutları
Temel Linux Kullanımı ve KomutlarıTemel Linux Kullanımı ve Komutları
Temel Linux Kullanımı ve KomutlarıAhmet Gürel
 
Comparaison des solutions Paas
Comparaison des solutions PaasComparaison des solutions Paas
Comparaison des solutions Paasyacine sebihi
 
Free rtos seminar
Free rtos seminarFree rtos seminar
Free rtos seminarCho Daniel
 

Tendances (20)

Linux Internals - Interview essentials - 1.0
Linux Internals - Interview essentials - 1.0Linux Internals - Interview essentials - 1.0
Linux Internals - Interview essentials - 1.0
 
Linux Internals - Interview essentials 4.0
Linux Internals - Interview essentials 4.0Linux Internals - Interview essentials 4.0
Linux Internals - Interview essentials 4.0
 
Introduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed ComputingIntroduction to Parallel and Distributed Computing
Introduction to Parallel and Distributed Computing
 
SYSTEM NETWORK ADMINISTRATIONS GOALS and TIPS
SYSTEM NETWORK ADMINISTRATIONS GOALS and TIPSSYSTEM NETWORK ADMINISTRATIONS GOALS and TIPS
SYSTEM NETWORK ADMINISTRATIONS GOALS and TIPS
 
HBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ SalesforceHBaseCon 2015: HBase Performance Tuning @ Salesforce
HBaseCon 2015: HBase Performance Tuning @ Salesforce
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/Core
 
Yaygın Linux Komutları ve Windows Karşılıkları
Yaygın Linux Komutları ve Windows KarşılıklarıYaygın Linux Komutları ve Windows Karşılıkları
Yaygın Linux Komutları ve Windows Karşılıkları
 
Introduction to OpenCL
Introduction to OpenCLIntroduction to OpenCL
Introduction to OpenCL
 
Redis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal JoshiRedis: Swiss Army Knife @HackerRank: Kamal Joshi
Redis: Swiss Army Knife @HackerRank: Kamal Joshi
 
Kernel_Crash_Dump_Analysis
Kernel_Crash_Dump_AnalysisKernel_Crash_Dump_Analysis
Kernel_Crash_Dump_Analysis
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
OOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM appsOOPs, OOMs, oh my! Containerizing JVM apps
OOPs, OOMs, oh my! Containerizing JVM apps
 
Presentation Hadoop Québec
Presentation Hadoop QuébecPresentation Hadoop Québec
Presentation Hadoop Québec
 
Cloud computing : Cloud sim
Cloud computing : Cloud sim Cloud computing : Cloud sim
Cloud computing : Cloud sim
 
Z OS IBM Utilities
Z OS IBM UtilitiesZ OS IBM Utilities
Z OS IBM Utilities
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
Dr. Elephant for Monitoring and Tuning Apache Spark Jobs on Hadoop with Carl ...
 
Temel Linux Kullanımı ve Komutları
Temel Linux Kullanımı ve KomutlarıTemel Linux Kullanımı ve Komutları
Temel Linux Kullanımı ve Komutları
 
Comparaison des solutions Paas
Comparaison des solutions PaasComparaison des solutions Paas
Comparaison des solutions Paas
 
Free rtos seminar
Free rtos seminarFree rtos seminar
Free rtos seminar
 

Similaire à Workflowsim escience12

High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of Viewaragozin
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCOlga Lavrentieva
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan finalpreethaappan
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...DataWorks Summit
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)camunda services GmbH
 
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...apidays
 
Java scalability considerations yogesh deshpande
Java scalability considerations   yogesh deshpandeJava scalability considerations   yogesh deshpande
Java scalability considerations yogesh deshpandeIndicThreads
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopHortonworks
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudRevolution Analytics
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesDataWorks Summit
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomFacultad de Informática UCM
 
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Nane Kratzke
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryDoKC
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliencyMasashi Narumoto
 
Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Sumanth Chinthagunta
 

Similaire à Workflowsim escience12 (20)

High Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of ViewHigh Performance Computing - Cloud Point of View
High Performance Computing - Cloud Point of View
 
Взгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPCВзгляд на облака с точки зрения HPC
Взгляд на облака с точки зрения HPC
 
Velocity 2018 preetha appan final
Velocity 2018   preetha appan finalVelocity 2018   preetha appan final
Velocity 2018 preetha appan final
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)Camunda BPM 7.2: Performance and Scalability (English)
Camunda BPM 7.2: Performance and Scalability (English)
 
IEEE CLOUD \'11
IEEE CLOUD \'11IEEE CLOUD \'11
IEEE CLOUD \'11
 
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
apidays Paris 2022 - Of graphQL, DX friction, and surgical monolithectomy, Fr...
 
Java scalability considerations yogesh deshpande
Java scalability considerations   yogesh deshpandeJava scalability considerations   yogesh deshpande
Java scalability considerations yogesh deshpande
 
Apache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with HadoopApache Hadoop YARN - The Future of Data Processing with Hadoop
Apache Hadoop YARN - The Future of Data Processing with Hadoop
 
NoSQL and ACID
NoSQL and ACIDNoSQL and ACID
NoSQL and ACID
 
Speed up R with parallel programming in the Cloud
Speed up R with parallel programming in the CloudSpeed up R with parallel programming in the Cloud
Speed up R with parallel programming in the Cloud
 
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on KubernetesApache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
Apache Druid Auto Scale-out/in for Streaming Data Ingestion on Kubernetes
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
 
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster RecoveryStop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
Stop Worrying and Keep Querying, Using Automated Multi-Region Disaster Recovery
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Designing apps for resiliency
Designing apps for resiliencyDesigning apps for resiliency
Designing apps for resiliency
 
Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0 Single Page Applications with AngularJS 2.0
Single Page Applications with AngularJS 2.0
 

Dernier

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Dernier (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Workflowsim escience12

  • 1. WorkflowSim: A Toolkit for Simulating Scientific Workflows in Distributed Environments Weiwei Chen, Ewa Deelman
  • 2. Outline §  Introduction –  Scientific workflows? –  Distributed environments? §  Challenge –  Large scale, system overhead §  Solution –  Workflow Overhead and Failure Model §  Validation and Application –  Overhead Robustness –  Fault Tolerant Clustering 2
  • 3. Scientific Applications §  Scientists often need to –  Integrate diverse components and data –  Automate data processing steps –  Reproduce/analyze/share previous results –  Track the provenance of data products –  Execute processing steps efficiently –  Reliably execute applications Scientific Workflows provide solutions to these problems 3
  • 4. Scientific Workflows §  DAG model (Directed Acyclic Graph) –  Node: computational activities –  Directed edge: data dependencies –  Task: a process that users would like to execute –  Job: a single unit for execution with one of more tasks –  Task clustering: the process of grouping tasks to jobs 4
  • 5. Workflow Management System §  Common Features of WMS –  Maps abstract workflows to executable workflows –  Handles data dependencies –  Replica selection, transfers, registration, cleanup –  Task clustering, workflow partitioning, scheduling –  Reliability and fault tolerance –  Monitoring and troubleshooting §  Existing WMS –  Pegasus, Askalon, Taverna, Kepler, Triana 5
  • 6. Workflow Simulation §  Benefits –  Save efforts in system setup, execution –  Repeat experimental results –  Control system environments (failures) §  Trace based Workflow Simulation –  Import trace from a completed execution –  Vary workflow structures and system environments §  Challenges (CloudSim, GridSim, etc.) –  System overhead and failures –  Multiple levels of computational activities(task/job) –  A hierarchy of management components 6
  • 7. Comparison CloudSim WorkflowSim Task, Bag of Execution Model Task, Job, DAG Tasks Failure and Monitoring No Yes Data Transfer Delay Data Transfer Overhead Model Workflow Engine Delay Delay Clustering Delay … Site selection Single Multiple Scheduling, Job retry Optimization Scheduling Clustering, Techniques Partitioning … WorkflowSim is an extension of CloudSim, but it is workflow aware 7
  • 8. System Architecture §  Submit Host –  Workflow Mapper –  Clustering Engine –  Workflow Engine –  Local Scheduler §  Execution Site –  Remote Scheduler –  Worker Nodes –  Failure Generator –  Failure Monitor 8
  • 9. Workflow Overhead –  Workflow Engine –  Postscript Delay Delay –  Clustering Delay –  Queue Delay –  Data Transfer Delay 9
  • 10. Example: Clustering Delay n is the number of tasks per level. k is number of jobs per level. !"#$%&'()*!!"#$%|!!! ! ! ! = = !"#$%&'()*!!"#$%|!!! ! ! ! mProjectPP, mDiffFit, and mBackground are the major jobs of Montage Overhead is not a constant variable. It has diverse distribution and patterns 10
  • 11. Validation !"#$%&'#$!!"#$%&&!!"#$%&' Ideal Case: Accuracy=1.0 !!!"#$!% = !"#$!!"#$%&&!!"#$%&' k: Maximum jobs per level Overheads have the biggest impact 11
  • 12. Application: Overhead Robustness §  Overhead robustness: the influence of overheads on the workflow runtime for DAG scheduling heuristics. §  Inaccurate estimation (under- or over-estimated) of workflow overheads influences the overall runtime of workflows. §  Research Merits: –  Sensitivity of heuristics –  Overhead friendly heuristics 12
  • 13. Application: Overhead Robustness §  Increase or Reduce Overhead by a Factor (Weight) –  Under estimation and over estimation §  Heuristics Evaluated –  FCFS: First Come First Serve –  MCT: Minimum Completion Time –  MinMin: The job with the minimum completion time is selected and assigned to the fastest resource. –  MaxMin: The job with the maximum completion time and assigns it to its best available resource. 13
  • 14. Application: Overhead Robustness MCT<MinMin MCT<MinMin MCT>MinMin MCT<MinMin Accurate estimation of Clustering Delay is more important 14
  • 15. Workflow Failure §  Failures have significant influence on the performance §  Classifying Transient Failures –  Task Failure: A task fails, other tasks within the same job may not fail –  Job Failure: A job fails, all of its tasks fail 15
  • 16. System Architecture §  Submit Host –  Job Retry –  Reclustering §  Execution Site –  Failure Generator –  Failure Monitor 16
  • 17. Application: Fault Tolerant Clustering §  Task clustering can reduce execution overhead §  A job composed of multiple tasks may have a greater risk of suffering from failures §  Reclustering and Job Retry are proposed –  No Optimization (NOOP) retries the failed jobs. –  Dynamic Clustering (DC) decreases the clusters.size if the measured job failure rate is high. 17
  • 18. Application: Fault Tolerant Clustering –  Selective Reclustering (SR) retries the failed tasks in a job –  Dynamic Reclustering (DR) retires the failed tasks in a job and also decreases the clusters.size if the measured job failure rate is high. 18
  • 19. Performance §  Reclustering (DR/DC/SR) reduces the influence of failures significantly compared to NOOP §  DR outperforms other techniques. The lower the better 19
  • 20. Conclusion §  WorkflowSim assists researchers to evaluate their workflow optimization techniques with better accuracy and wider support. –  Modeling Overhead and Failures –  Distributed and Hierarchical components –  Workflow Techniques §  It is necessary to consider both data dependencies, workflow failures and system overhead. 20
  • 21. Acknowledgements §  Pegasus Team: http://pegasus.isi.edu §  FutureGrid & XSEDE §  Funded by NSF grants IIS-0905032. §  More info: wchen@isi.edu §  Available on request §  Q&A 21
  • 22. Overhead Distribution Montage 22
  • 23. Overhead Distribution CyberShake 23
  • 24. Broadband 24
  • 25. Task Failure Model and Job Failure Model n 4d k* = −d + d 2 − r ln(1− α ) k* = , if n >> r 2t * (kt + d) ttotal = * n(k *t + d) 1− β t = total * rk(1− α )k 25 25
  • 26. Related Papers §  Integration of Workflow Partitioning and Resource Provisioning, CCGrid 2012. §  Improving Scientific Workflow Performance using Policy Based Data Placement, IEEE Policy 2012. §  Fault Tolerant Clustering in Scientific Workflows, SWF, IEEE Services 2012 §  Workflow Overhead Analysis and Optimizations, WORKS 2011. §  Partitioning and Scheduling Workflows across Multiple Sites with Storage Constraints, PPAM 2011 §  Pegasus: a Framework for Mapping Complex Scientific Workflows onto Distributed Systems, Scientific Programming Journal 2005 26