C-SCALE Tutorial: Snakemake

S
Sebastian Luna-ValeroCloud Community Support Specialist à EGI Foundation
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017529.
Copernicus - eoSC AnaLytics Engine
C-SCALE tutorial: Snakemake
Sebastian Luna-Valero, EGI Foundation
sebastian.luna.valero@egi.eu
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Outline
• Why workflows?
• Why snakemake?
• Let’s build a workflow!
2
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Why workflows?
Credits: https://github.com/c-scale-community/use-case-hisea
Goals:
● from raw data to figures
○ with “one click”
● re-run with new config
○ spatial scale
○ temporal scale
● re-run half-way through
○ recover from issues
● dependency management
○ between tasks
○ software packages
3
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Why workflows?
When to build a workflow?
● Re-run the same analysis over and over again, with different input parameters
● Ability to re-run the work partially; recover from intermediate failures
● Combine together heterogeneous tooling into the same analysis
○ Python, R, Julia, Docker, Bash, etc.
4
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Why snakemake?
• Mature workflow management system.
• Great community around it.
• Easy to learn? :)
• A Snakemake workflow scales without modification from single core workstations and
multi-core servers to batch systems (e.g. slurm)
• Snakemake integrates with the package manager Conda and the container engine
Singularity such that defining the software stack becomes part of the workflow itself.
• Further information: https://snakemake.readthedocs.io/
5
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Let’s build a workflow!
• Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that
define how to create output files from input files.
• $ snakemake --cores 1
• The application of a rule to generate a set of output files is called job.
6
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"european-countries.txt"
output:
"number-of-countries.txt"
shell:
"wc --lines european-countries.txt > number-of-countries.txt"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
Let’s build a workflow!
• Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that
define how to create output files from input files.
• $ snakemake --cores 1
• Snakemake only re-runs jobs if one of the input files is newer than one of the output files
or one of the input files will be updated by another job.
7
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"european-countries.txt"
output:
"number-of-countries.txt"
shell:
"wc --lines european-countries.txt > number-of-countries.txt"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Belgium
Snakefile
Let’s build a workflow!
• Generalize the rule:
• $ snakemake --cores 1
• $ wc --lines european-countries.txt > number-of-countries.txt
8
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"european-countries.txt"
output:
"number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
Let’s build a workflow!
• Adding more than one input file:
• $ snakemake --cores 1
• $ wc --lines european-countries.txt other-countries.txt 
> number-of-countries.txt
9
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"european-countries.txt",
"other-countries.txt"
output:
"number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• It’s better to organize your working directory:
• $ snakemake --cores 1
10
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Connecting rules! Targets can be rules, output files.
• $ snakemake --cores 1 <target>
11
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"touch pre-processing.done"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Updating intermediate files (however: #1978 and #2011)
• $ snakemake --cores 1 <target>
12
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"touch pre-processing.done"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
Snakefile $ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Belgium
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Dependencies between the rules are determined creating a Directed Acyclic Graph
• $ snakemake --cores 1 --dag | dot -Tsvg > dag.svg
13
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"touch pre-processing.done"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
Snakefile
Let’s build a workflow!
• Python
• $ snakemake --cores 1 <target>
14
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"python --input stats/number-of-countries.txt myscript.py"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
Snakefile
Let’s build a workflow!
• Containers
• $ snakemake --cores 1 <target>
15
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"udocker run example"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
Snakefile
Let’s build a workflow!
• Pre-built support for Singularity (see docs for more details)
16
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
container:
"docker://repo/image"
script:
"scripts/plot.R"
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
Snakefile
Let’s build a workflow!
• Configuration
• $ snakemake --cores 1
17
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
configfile: "config.yaml"
rule count_countries:
input:
expand("{input}", input=config['european']),
expand("{input}", input=config['other'])
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile $ cat config.yaml
european: 'countries/european-countries.txt'
other: 'countries/other-countries.txt'
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Logging
• $ snakemake --cores 1
18
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
log:
"logs/count_countries.log"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Benchmarking
• $ snakemake --cores 1
19
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
benchmark:
"benchmarks/count_countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
Let’s build a workflow!
• Modularization
20
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
include: "rules/count_countries.smk"
rule pre_processing:
input:
"stats/number-of-countries.txt"
output:
"pre-processing.done"
shell:
"touch pre-processing.done"
Snakefile
Let’s build a workflow!
• Integration with conda
• $ snakemake --cores 1 --use-conda --conda-frontend mamba
21
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/european-countries.txt",
"countries/other-countries.txt"
output:
"stats/number-of-countries.txt"
conda:
"envs/count_countries.yaml"
shell:
"wc --lines {input} > {output}"
$ cat european-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
$ cat other-countries.txt
US
Canada
$ cat envs/count_countries.yaml
name: count_countries
channels:
- conda-forge
- defaults
dependencies:
- coreutils
Let’s build a workflow!
• Other examples
• https://github.com/c-scale-community/c-scale-tutorial-snakemake
• https://github.com/c-scale-community/use-case-hisea/pull/41/files
22
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Let’s build a workflow!
• Advanced features
• Pre-built functionality for scatter-gather jobs
• Cluster execution: snakemake --cluster qsub (see SLURM docs)
• Self-contained HTML reports
• Accessing remote storage:
• Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage
• SFTP, HTTP, FTP, Dropbox, XRootD, WebDAV, GFAL, GridFTP, iRODs, etc.
• Best practices
• https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html
• FAQs: https://snakemake.readthedocs.io/en/stable/project_info/faq.html
23
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Thank you for your attention.
This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017529.
Copernicus - eoSC AnaLytics Engine
contact@c-scale.eu
https://c-scale.eu
@C_SCALE_EU
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
Sebastian Luna-Valero, EGI Foundation
sebastian.luna.valero@egi.eu
Let’s build a workflow!
• Wildcards example:
• $ snakemake --cores 1 stats/number-of-european-countries.txt
• $ snakemake --cores 1 stats/number-of-other-countries.txt
25
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
rule count_countries:
input:
"countries/{category}-countries.txt"
output:
"stats/number-of-{category}-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat list-of-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
Let’s build a workflow!
• Many to many with glob_wildcards:
• $ snakemake --cores 1
26
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
CATEGORIES, = glob_wildcards("countries/{category}-countries.txt")
print(CATEGORIES)
rule all:
input:
expand("stats/number-of-{category}-countries.txt", category=CATEGORIES)
rule count_countries:
input:
"countries/{category}-countries.txt"
output:
"stats/number-of-{category}-countries.txt"
shell:
"wc --lines {input} > {output}"
$ cat list-of-countries.txt
Netherlands
Greece
Spain
Portugal
Italy
Poland
Austria
Snakefile
input-1
input-2
output-1
output-2
input-n output-n
input-.. output-..
Let’s build a workflow!
• Dependencies between the rules are determined automatically, creating a DAG (directed
acyclic graph) of jobs that can be automatically parallelized.
• Snakemake only re-runs jobs if one of the input files is newer than one of the output files
or one of the input files will be updated by another job.
• https://github.com/snakemake/snakemake/issues/1978
• Snakemake works backwards from requested output, and not from available input.
• Targets
• rule names can be targets
• output files can be targets
• if no target is given at the command line, Snakemake will define the first rule of the
Snakefile as the target. Hence, it is best practice to have a rule all at the top of the
workflow which has all typically desired target files as input files.
27
C-SCALE tutorial: Snakemake | 29th November 2022 | Online
1 sur 27

Recommandé

Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas... par
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...
Webinar: Começando seus trabalhos com Machine Learning utilizando ferramentas...Embarcados
97 vues75 diapositives
Kubernetes - State of the Union (Q1-2016) par
Kubernetes - State of the Union (Q1-2016)Kubernetes - State of the Union (Q1-2016)
Kubernetes - State of the Union (Q1-2016)DoiT International
2K vues33 diapositives
Writing plugins for Nagios and Opsview - CAPSiDE Tech Talks par
Writing plugins for Nagios and Opsview - CAPSiDE Tech TalksWriting plugins for Nagios and Opsview - CAPSiDE Tech Talks
Writing plugins for Nagios and Opsview - CAPSiDE Tech TalksJose Luis Martínez
3K vues35 diapositives
generate IP CORES par
generate IP CORESgenerate IP CORES
generate IP CORESguest296013
4.4K vues19 diapositives
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client) par
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)
IIT-RTC 2017 Qt WebRTC Tutorial (Qt Janus Client)Alexandre Gouaillard
5.4K vues62 diapositives
On the code of data science par
On the code of data scienceOn the code of data science
On the code of data scienceGael Varoquaux
4.7K vues73 diapositives

Contenu connexe

Similaire à C-SCALE Tutorial: Snakemake

InfluxDB Live Product Training par
InfluxDB Live Product TrainingInfluxDB Live Product Training
InfluxDB Live Product TrainingInfluxData
160 vues34 diapositives
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI) par
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)Phil Wilkins
2.2K vues39 diapositives
Scilab: Computing Tool For Engineers par
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For EngineersNaren P.R.
2K vues27 diapositives
Cape2013 scilab-workshop-19Oct13 par
Cape2013 scilab-workshop-19Oct13Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13Naren P.R.
3.7K vues26 diapositives
Node-RED and Minecraft - CamJam September 2015 par
Node-RED and Minecraft - CamJam September 2015Node-RED and Minecraft - CamJam September 2015
Node-RED and Minecraft - CamJam September 2015Boris Adryan
2.9K vues9 diapositives
Practical virtual network functions with Snabb (SDN Barcelona VI) par
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)Igalia
116 vues47 diapositives

Similaire à C-SCALE Tutorial: Snakemake(20)

InfluxDB Live Product Training par InfluxData
InfluxDB Live Product TrainingInfluxDB Live Product Training
InfluxDB Live Product Training
InfluxData160 vues
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI) par Phil Wilkins
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
GitHub Actions - using Free Oracle Cloud Infrastructure (OCI)
Phil Wilkins2.2K vues
Scilab: Computing Tool For Engineers par Naren P.R.
Scilab: Computing Tool For EngineersScilab: Computing Tool For Engineers
Scilab: Computing Tool For Engineers
Naren P.R.2K vues
Cape2013 scilab-workshop-19Oct13 par Naren P.R.
Cape2013 scilab-workshop-19Oct13Cape2013 scilab-workshop-19Oct13
Cape2013 scilab-workshop-19Oct13
Naren P.R.3.7K vues
Node-RED and Minecraft - CamJam September 2015 par Boris Adryan
Node-RED and Minecraft - CamJam September 2015Node-RED and Minecraft - CamJam September 2015
Node-RED and Minecraft - CamJam September 2015
Boris Adryan2.9K vues
Practical virtual network functions with Snabb (SDN Barcelona VI) par Igalia
Practical virtual network functions with Snabb (SDN Barcelona VI)Practical virtual network functions with Snabb (SDN Barcelona VI)
Practical virtual network functions with Snabb (SDN Barcelona VI)
Igalia116 vues
Node-RED and getting started on the Internet of Things par Boris Adryan
Node-RED and getting started on the Internet of ThingsNode-RED and getting started on the Internet of Things
Node-RED and getting started on the Internet of Things
Boris Adryan6.3K vues
Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically. par Hakky St
Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically.Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically.
Creating basic workflows as Jupyter Notebooks to use Cytoscape programmatically.
Hakky St445 vues
Machinel Learning with spark par Ons Dridi
Machinel Learning with spark Machinel Learning with spark
Machinel Learning with spark
Ons Dridi807 vues
An introduction to workflow-based programming with Node-RED par Boris Adryan
An introduction to workflow-based programming with Node-REDAn introduction to workflow-based programming with Node-RED
An introduction to workflow-based programming with Node-RED
Boris Adryan20.6K vues
Feature Detection in Ajax-enabled Web Applications par Nikolaos Tsantalis
Feature Detection in Ajax-enabled Web ApplicationsFeature Detection in Ajax-enabled Web Applications
Feature Detection in Ajax-enabled Web Applications
Nikolaos Tsantalis2.9K vues
Building TaxBrain: Numba-enabled Financial Computing on the Web par talumbau
Building TaxBrain: Numba-enabled Financial Computing on the WebBuilding TaxBrain: Numba-enabled Financial Computing on the Web
Building TaxBrain: Numba-enabled Financial Computing on the Web
talumbau1.8K vues
ESP8266 and IOT par dega1999
ESP8266 and IOTESP8266 and IOT
ESP8266 and IOT
dega199910.3K vues
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co... par Nane Kratzke
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Smuggling Multi-Cloud Support into Cloud-native Applications using Elastic Co...
Nane Kratzke1.7K vues

Dernier

Introduction to Gradle par
Introduction to GradleIntroduction to Gradle
Introduction to GradleJohn Valentino
7 vues7 diapositives
Benefits in Software Development par
Benefits in Software DevelopmentBenefits in Software Development
Benefits in Software DevelopmentJohn Valentino
6 vues15 diapositives
Understanding HTML terminology par
Understanding HTML terminologyUnderstanding HTML terminology
Understanding HTML terminologyartembondar5
8 vues8 diapositives
JioEngage_Presentation.pptx par
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptxadmin125455
9 vues4 diapositives
EV Charging App Case par
EV Charging App Case EV Charging App Case
EV Charging App Case iCoderz Solutions
10 vues1 diapositive
Electronic AWB - Electronic Air Waybill par
Electronic AWB - Electronic Air Waybill Electronic AWB - Electronic Air Waybill
Electronic AWB - Electronic Air Waybill Freightoscope
6 vues1 diapositive

Dernier(20)

JioEngage_Presentation.pptx par admin125455
JioEngage_Presentation.pptxJioEngage_Presentation.pptx
JioEngage_Presentation.pptx
admin1254559 vues
Electronic AWB - Electronic Air Waybill par Freightoscope
Electronic AWB - Electronic Air Waybill Electronic AWB - Electronic Air Waybill
Electronic AWB - Electronic Air Waybill
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile... par Stefan Wolpers
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
How To Make Your Plans Suck Less — Maarten Dalmijn at the 57th Hands-on Agile...
Stefan Wolpers44 vues
Transport Management System - Shipment & Container Tracking par Freightoscope
Transport Management System - Shipment & Container TrackingTransport Management System - Shipment & Container Tracking
Transport Management System - Shipment & Container Tracking
Top-5-production-devconMunich-2023.pptx par Tier1 app
Top-5-production-devconMunich-2023.pptxTop-5-production-devconMunich-2023.pptx
Top-5-production-devconMunich-2023.pptx
Tier1 app10 vues
tecnologia18.docx par nosi6702
tecnologia18.docxtecnologia18.docx
tecnologia18.docx
nosi67026 vues
ADDO_2022_CICID_Tom_Halpin.pdf par TomHalpin9
ADDO_2022_CICID_Tom_Halpin.pdfADDO_2022_CICID_Tom_Halpin.pdf
ADDO_2022_CICID_Tom_Halpin.pdf
TomHalpin96 vues
Top-5-production-devconMunich-2023-v2.pptx par Tier1 app
Top-5-production-devconMunich-2023-v2.pptxTop-5-production-devconMunich-2023-v2.pptx
Top-5-production-devconMunich-2023-v2.pptx
Tier1 app9 vues

C-SCALE Tutorial: Snakemake

  • 1. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017529. Copernicus - eoSC AnaLytics Engine C-SCALE tutorial: Snakemake Sebastian Luna-Valero, EGI Foundation sebastian.luna.valero@egi.eu C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 2. Outline • Why workflows? • Why snakemake? • Let’s build a workflow! 2 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 3. Why workflows? Credits: https://github.com/c-scale-community/use-case-hisea Goals: ● from raw data to figures ○ with “one click” ● re-run with new config ○ spatial scale ○ temporal scale ● re-run half-way through ○ recover from issues ● dependency management ○ between tasks ○ software packages 3 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 4. Why workflows? When to build a workflow? ● Re-run the same analysis over and over again, with different input parameters ● Ability to re-run the work partially; recover from intermediate failures ● Combine together heterogeneous tooling into the same analysis ○ Python, R, Julia, Docker, Bash, etc. 4 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 5. Why snakemake? • Mature workflow management system. • Great community around it. • Easy to learn? :) • A Snakemake workflow scales without modification from single core workstations and multi-core servers to batch systems (e.g. slurm) • Snakemake integrates with the package manager Conda and the container engine Singularity such that defining the software stack becomes part of the workflow itself. • Further information: https://snakemake.readthedocs.io/ 5 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 6. Let’s build a workflow! • Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that define how to create output files from input files. • $ snakemake --cores 1 • The application of a rule to generate a set of output files is called job. 6 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "european-countries.txt" output: "number-of-countries.txt" shell: "wc --lines european-countries.txt > number-of-countries.txt" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile
  • 7. Let’s build a workflow! • Snakemake follows the GNU Make paradigm: workflows are defined in terms of rules that define how to create output files from input files. • $ snakemake --cores 1 • Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job. 7 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "european-countries.txt" output: "number-of-countries.txt" shell: "wc --lines european-countries.txt > number-of-countries.txt" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Belgium Snakefile
  • 8. Let’s build a workflow! • Generalize the rule: • $ snakemake --cores 1 • $ wc --lines european-countries.txt > number-of-countries.txt 8 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "european-countries.txt" output: "number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile
  • 9. Let’s build a workflow! • Adding more than one input file: • $ snakemake --cores 1 • $ wc --lines european-countries.txt other-countries.txt > number-of-countries.txt 9 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "european-countries.txt", "other-countries.txt" output: "number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada
  • 10. Let’s build a workflow! • It’s better to organize your working directory: • $ snakemake --cores 1 10 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada
  • 11. Let’s build a workflow! • Connecting rules! Targets can be rules, output files. • $ snakemake --cores 1 <target> 11 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "touch pre-processing.done" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada
  • 12. Let’s build a workflow! • Updating intermediate files (however: #1978 and #2011) • $ snakemake --cores 1 <target> 12 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "touch pre-processing.done" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" Snakefile $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Belgium $ cat other-countries.txt US Canada
  • 13. Let’s build a workflow! • Dependencies between the rules are determined creating a Directed Acyclic Graph • $ snakemake --cores 1 --dag | dot -Tsvg > dag.svg 13 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "touch pre-processing.done" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" Snakefile
  • 14. Let’s build a workflow! • Python • $ snakemake --cores 1 <target> 14 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "python --input stats/number-of-countries.txt myscript.py" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" Snakefile
  • 15. Let’s build a workflow! • Containers • $ snakemake --cores 1 <target> 15 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "udocker run example" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" Snakefile
  • 16. Let’s build a workflow! • Pre-built support for Singularity (see docs for more details) 16 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" container: "docker://repo/image" script: "scripts/plot.R" rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" Snakefile
  • 17. Let’s build a workflow! • Configuration • $ snakemake --cores 1 17 C-SCALE tutorial: Snakemake | 29th November 2022 | Online configfile: "config.yaml" rule count_countries: input: expand("{input}", input=config['european']), expand("{input}", input=config['other']) output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat config.yaml european: 'countries/european-countries.txt' other: 'countries/other-countries.txt' $ cat other-countries.txt US Canada
  • 18. Let’s build a workflow! • Logging • $ snakemake --cores 1 18 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" log: "logs/count_countries.log" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada
  • 19. Let’s build a workflow! • Benchmarking • $ snakemake --cores 1 19 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" benchmark: "benchmarks/count_countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada
  • 20. Let’s build a workflow! • Modularization 20 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada include: "rules/count_countries.smk" rule pre_processing: input: "stats/number-of-countries.txt" output: "pre-processing.done" shell: "touch pre-processing.done" Snakefile
  • 21. Let’s build a workflow! • Integration with conda • $ snakemake --cores 1 --use-conda --conda-frontend mamba 21 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/european-countries.txt", "countries/other-countries.txt" output: "stats/number-of-countries.txt" conda: "envs/count_countries.yaml" shell: "wc --lines {input} > {output}" $ cat european-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile $ cat other-countries.txt US Canada $ cat envs/count_countries.yaml name: count_countries channels: - conda-forge - defaults dependencies: - coreutils
  • 22. Let’s build a workflow! • Other examples • https://github.com/c-scale-community/c-scale-tutorial-snakemake • https://github.com/c-scale-community/use-case-hisea/pull/41/files 22 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 23. Let’s build a workflow! • Advanced features • Pre-built functionality for scatter-gather jobs • Cluster execution: snakemake --cluster qsub (see SLURM docs) • Self-contained HTML reports • Accessing remote storage: • Amazon S3, Google Cloud Storage, Microsoft Azure Blob Storage • SFTP, HTTP, FTP, Dropbox, XRootD, WebDAV, GFAL, GridFTP, iRODs, etc. • Best practices • https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html • FAQs: https://snakemake.readthedocs.io/en/stable/project_info/faq.html 23 C-SCALE tutorial: Snakemake | 29th November 2022 | Online
  • 24. Thank you for your attention. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101017529. Copernicus - eoSC AnaLytics Engine contact@c-scale.eu https://c-scale.eu @C_SCALE_EU C-SCALE tutorial: Snakemake | 29th November 2022 | Online Sebastian Luna-Valero, EGI Foundation sebastian.luna.valero@egi.eu
  • 25. Let’s build a workflow! • Wildcards example: • $ snakemake --cores 1 stats/number-of-european-countries.txt • $ snakemake --cores 1 stats/number-of-other-countries.txt 25 C-SCALE tutorial: Snakemake | 29th November 2022 | Online rule count_countries: input: "countries/{category}-countries.txt" output: "stats/number-of-{category}-countries.txt" shell: "wc --lines {input} > {output}" $ cat list-of-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile
  • 26. Let’s build a workflow! • Many to many with glob_wildcards: • $ snakemake --cores 1 26 C-SCALE tutorial: Snakemake | 29th November 2022 | Online CATEGORIES, = glob_wildcards("countries/{category}-countries.txt") print(CATEGORIES) rule all: input: expand("stats/number-of-{category}-countries.txt", category=CATEGORIES) rule count_countries: input: "countries/{category}-countries.txt" output: "stats/number-of-{category}-countries.txt" shell: "wc --lines {input} > {output}" $ cat list-of-countries.txt Netherlands Greece Spain Portugal Italy Poland Austria Snakefile input-1 input-2 output-1 output-2 input-n output-n input-.. output-..
  • 27. Let’s build a workflow! • Dependencies between the rules are determined automatically, creating a DAG (directed acyclic graph) of jobs that can be automatically parallelized. • Snakemake only re-runs jobs if one of the input files is newer than one of the output files or one of the input files will be updated by another job. • https://github.com/snakemake/snakemake/issues/1978 • Snakemake works backwards from requested output, and not from available input. • Targets • rule names can be targets • output files can be targets • if no target is given at the command line, Snakemake will define the first rule of the Snakefile as the target. Hence, it is best practice to have a rule all at the top of the workflow which has all typically desired target files as input files. 27 C-SCALE tutorial: Snakemake | 29th November 2022 | Online