Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Scalable and reproducible workflows with Pachyderm

This presentation contains an introduction to using Pachyderm as a tool to enable scalable and reproducible workflows in the life sciences. Pachyderm is an open-source workflow-engine and distributed data processing tool that leverages the container ecosystem.

  • Soyez le premier à commenter

Scalable and reproducible workflows with Pachyderm

  1. 1. 2 October 2017 Scalable and reproducible workflows with Pachyderm Jon Ander Novella de Miguel Pharmaceutical Bioinformatics research group Uppsala, Sweden
  2. 2. 2 October 2017 APPROACHESTO TACKLE BIOLOGICAL COMPUTATIONS Data growth in biomedicine Scalable methods for Big Data Analytics enabled by Cloud Computing
  3. 3. 2 October 2017 • Mass Spectrometry can offer high metabolite coverage METABOLITE DATA
  4. 4. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  5. 5. 2 October 2017 • Stitching many different software tools is tedious • Time-intensive and parameter heavy steps involved • Examples:Taverna, Nextflow, SciPipe WORKFLOW DEFINITIONS
  6. 6. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  7. 7. 2 October 2017 • Containers wrap an app with its own operating environment • Portability and environmental consistency • Useful in science • Is Vagrant already old-fashioned? ISOLATION OF SCIENTIFIC SOFTWARE
  8. 8. 2 October 2017 • Deployment, scaling and management of containers in a cluster • Kubernetes: big and active community • Automatic healing and machine decoupling [1] https://www.kubernetes.io [1] CONTAINER ORCHESTRATION TOOLS
  9. 9. 2 October 2017 • Workflow definitions • Isolation of scientific software • Reproducibility • Parallelisation CHALLENGES
  10. 10. 2 October 2017 • Workflow-system based on Kubernetes • A distributed data processing tool based on containers • Enables reproducibility, provenance, parallelization and isolation “You can focus on being productive, while Pachyderm will scale up and analyze for you” [2] https://www.pachyderm.io [2] WHAT IS PACHYDERM?
  11. 11. 2 October 2017 The main primitives are: • Repositories: versioned collections of data • Commits: new data • Files: data storage primitives [3] https://www.pachyderm.io/pfs.html [3] PFS offers version control for data: PACHYDERM FILE SYSTEM (PFS)
  12. 12. 2 October 2017 • Tasks executed by Kubernetes pods • Parallelization: spreading data • Incrementality and glob patterns • Directed Acyclic Graph [4] https://www.pachyderm.io/pps.html [4] PACHYDERM PIPELINE SYSTEM (PPS)
  13. 13. 2 October 2017 • Reproducing a metabolomics workflow with Pachyderm • Learn how to distribute processing using containers • Feeling the power of data versioning • Learn how we can use containers in a cloud-like distributed processing environment GOALS OF THE DAY
  14. 14. 2 October 2017 • OpenMS: software for metabolite and proteome data analysis and management • Detection of mass traces and their aggregation into features • Four pre-processing steps AN OPENMS BASED WORKFLOW X CSV File Filter Feature Finder Feature Linker Text Exporter
  15. 15. 2 October 2017 • Kubernetes cluster backed by a Vagrant box (VM) • https://github.com/CARAMBA-Clinic/COST- CHARME/blob/master/README.md • Execution of workflow-engine in Cloud-Like environment via Jupyter • Downstream analysis on RStudio METHODS
  16. 16. 2 October 2017 • Four interconnected tasks/processes • Intermediate data handled by repositories • Results stored also in a repository WORKFLOW IN PACHYDERM
  17. 17. 2 October 2017 • Thanks to Pachyderm, we can enable a reproducible and scalable data processing platform • Can you write your own container and distribute its computation? REPRODUCIBLE RESULT
  18. 18. 2 October 2017 THANKS! ANY QUESTIONS? “Provenance and reproducibility enable a rigorous and efficient data science” Jon Ander Novella de Miguel Department of Pharmaceutical Biosciences Jon.Novella@farmbio.uu.se

×