SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
How to write bioinformatics software
that people will use & cite
A/Prof Torsten Seemann
@torstenseemann
Bioinfosummer 2016 - Adelaide AU, Fri 2 Dec
Who am I ?
Doherty Applied
Microbial Genomics
Microbial genomics and bioinformatics
Public health and clinical microbiology
Before bioinformatics
● Undergraduate
○ Science / Engineering - Computer Science + Electrical En
● Honours
○ Computer Science - Digital image compression
● PhD
○ Computer Science - Digital image processing
● Never studied any biology
An opportunity
First “fully Aussie” bacterial genome
● Leptospira hardjobovis str. L550
● 2 chromosomes
● 4 Mbp
● $1M dollar project
● Sanger sequencing
● Led by Dieter Bulach
First Illumina instrument in Australia
● Dept Microbiology
Monash University, 2008
● 36 bp single end reads
● 2 weeks to run
● 2 lanes for 1.6 Mbp genome
Things have improved a bit since then
Q32
36 bp
Q20
Why am I here?
Bioinformatics software and me
Installed >1000 packages manually
Authored >100 packages into Brew
Written and maintain >10 packages
How to get a bioinformatics headache
1. See tweet about new published tool
2. Read abstract - sounds awesome!
3. Fail to find link to source code - eventually Google it
4. Attempt to compile and install it
5. Google for 30 min for fixes
6. Finally get it built
7. Run it on tiny data set
8. Get a vague error
9. Delete and never revisit it again
Should I stay for this talk ?
YES
It will help you write good tools
YES
It will help you identify bad tools
Should you write a tool?
Should you write a new tool?
● NO
○ It already exists
○ You are unable to maintain it
○ You won’t really use it
● YES
○ YOU need the tool
○ YOU will use the tool
○ YOU want others to use the tool
○ Desire to give back to the community
Eating my own dog food
Lessons from the Prokka experience
● Nearly all feedback is positive
● People all over the world are grateful
● Warm fuzzy feeling inside
● Increase your public profile
● But maintenance burden and guilt
Discoverability
Choosing a home base
University page
Personal home page
Naming
● Try to be unique
○ Google to check for conflicts
○ Consider how internationals will pronounce it
○ Be creative!
● Avoid dodgy acronyms
○ Try not to win a JABBA Award
○ “Just Another Bogus Bioinformatics Acronym”
Don’t be this person
First impressions count
● Keep It Simple Stupid
● First page of documentation
○ What does it do?
○ How do I install it?
○ How do I run it?
● Try to keep in one place
○ Otherwise becomes inconsistent or missed
Usability
A lesson from history
Print something useful if no parameters
% biotool
Please use --help for instructions
Always have a --help flag
% biotool -h
% biotool --help
Usage: biotool [options] seq.fa
--help Show this help
--version Print version and exit
--top N Keep top N sequences
Always have a --version flag
% biotool -v
% biotool -V
% biotool --version
biotool 1.3
Always raise an error when things go wrong
% biotool seq.fa
ERROR: can not open file ‘seq.fa’
Check that dependencies are installed
% biotool seq.fa
Checking BLAST... ok
Checking SAMtools... NOT FOUND!
Please install ‘samtools’ and add
it to your PATH.
Always let users control output filenames
% biotool seq.fa
Processing ‘seq.fa’
Wrote result to ‘filt.seq.fa.out’
# ARGH!
% biotool --out seq.filt.fa
KISS - run with minimum parameters
% biotool seq.fa
ERROR: missing -x parameter
% biotool -x 3 seq.fa
ERROR: missing -y parameter
% biotool -x 3 -y 7 seq.fa
ERROR: need -n name
# ARGH!
Standards
Use the standard getopt interface
Short options ( -h ) and long options ( --help )
● C #include <getopt.h>
● C++ boost:program_options
● Python import argparse
● Perl use Getopt::Long
● R library(argparse)
Command line interface
Unix exit codes
● A positive integer
● Loose standards
○ 0 = success
○ 1 = general failure
○ 2 = error with command line
○ 3..127 = user defined specific failures
● Result in shell $? Variable
Accessing exit codes in the shell
% ls /tmp/fake
ls: cannot access /tmp/fake
% echo $?
1
% ls /proc/cpuinfo
/proc/cpuinfo
% echo $?
0
Using stdin, stderr and stdout
● Stdin (0) command < input
● Stdout (1) command > output
● Stderr(2) command 2> errors
● All command < input > output 2> errors
● Allows piping!
sort input | command1 1> output 2> errors
This makes your tool useful in streaming
% zcat seq.fastq.gz |
cutadapt -a adapters.fa |
qualtrim -Q 20 |
bwa mem -t 8 ref.fa |
samtools sort --threads 4
> seq.bam
Use standards compliant files *
● Feature coordinates
○ BED, GFF
● Columnar data (put headings!)
○ TSV
○ CSV
● Structured data
○ JSON
○ YAML
* XML excepted
Installation
Keeping your audience
“Each equation in a book
will halve your audience”
“Each difficulty encountered in installation
will halve your number of users”
Traditional systems level packaging
● Debian / DEB
apt-get install blast
dpkg -i blast-2.2.5-amd64.deb
● Redhat / RPM
yum install blast
rpm -i blast-2.2.5-x86_64.rpm
● Various others
Cross platform solutions: Linux, Mac, Windows
● Brew
brew install blast
● Conda
conda install blast
● Others
○ GUIX, ...
○ Docker, AMI images
Language specific repositories
● Python - PIP
pip install ariba
● Perl - CPAN
cpanm Bio::Roary
● R - CRAN
install.packages(“edgeseq3”)
Marketing
Publish it
● Preprint archive
○ PeerJ, bioRxiv
● Method focussed journal
○ Bioinformatics, BMC Bioinformatics
● Software focussed journal
○ Journal of Open Source Software
Plug it
● Twitter
○ Ask someone popular you know to retweet it
● Blog
○ Start a general blog and slot
● Conferences
○ Tell people about it
Support your users
● Reply to emails
● Monitor your “Issues” web site
● Monitor Biostars and SeqAnswers
● Have a mailing list
● Update your documentation
● Fix bugs
Conclusions
Take home messages
● Make it as painless as possible to install
● Keep documentation clear and simple
● Get people to use it before you publish
● People are not judging your coding skills
● But they will curse you if waste their time
● Most users are grateful - leads to free beer
● A good tools worth much more than a paper
Acknowledgments
● Gary Glonek
● David Adelson
● Bernard Pope - VLSCI
● Dieter Bulach - VLSCI
● Anna Syme - VLSCI
● David Powell - Monash University
● Anders Goncalves da Silva - University of Melbourne
References
1. https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-2-15
2. http://berniepope.id.au/scientific_software_etiquette.html
3. http://thegenomefactory.blogspot.com.au/
The end.

Contenu connexe

Tendances

2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
c.titus.brown
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
Torsten Seemann
 

Tendances (20)

Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
 
Long-read: assets and challenges of a (not so) emerging technology
Long-read: assets and challenges of a (not so) emerging technologyLong-read: assets and challenges of a (not so) emerging technology
Long-read: assets and challenges of a (not so) emerging technology
 
2014 khmer protocols
2014 khmer protocols2014 khmer protocols
2014 khmer protocols
 
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean fo...
 
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
Visualizing the pan genome - Australian Society for Microbiology - tue 8 jul ...
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore TechnologiesThoughts on the recent announcements by Oxford Nanopore Technologies
Thoughts on the recent announcements by Oxford Nanopore Technologies
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
Future of metagenomics
Future of metagenomicsFuture of metagenomics
Future of metagenomics
 
An introduction to RNA-seq data analysis
An introduction to RNA-seq data analysisAn introduction to RNA-seq data analysis
An introduction to RNA-seq data analysis
 
Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)Next-generation sequencing and quality control: An Introduction (2016)
Next-generation sequencing and quality control: An Introduction (2016)
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
Whole exome sequencing(wes)
Whole exome sequencing(wes)Whole exome sequencing(wes)
Whole exome sequencing(wes)
 
Kogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysisKogo 2013 RNA-seq analysis
Kogo 2013 RNA-seq analysis
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
 
Eccmid meet the-expert
Eccmid meet the-expertEccmid meet the-expert
Eccmid meet the-expert
 
ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?ECCMID 2015 - So I have sequenced my genome ... what now?
ECCMID 2015 - So I have sequenced my genome ... what now?
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
16S rRNA Analysis using Mothur Pipeline
16S rRNA Analysis using Mothur Pipeline16S rRNA Analysis using Mothur Pipeline
16S rRNA Analysis using Mothur Pipeline
 
High Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can KnowHigh Throughput Sequencing Technologies: What We Can Know
High Throughput Sequencing Technologies: What We Can Know
 
Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ...
 Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ... Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ...
Odyssey Of The IWGSC Reference Genome Sequence: 12 Years 1 Month 28 Days 11 ...
 

En vedette

En vedette (17)

De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...De novo genome assembly  - T.Seemann - IMB winter school 2016 - brisbane, au ...
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
 
Protein function and bioinformatics
Protein function and bioinformaticsProtein function and bioinformatics
Protein function and bioinformatics
 
Introduction to Bioinformatics.
 Introduction to Bioinformatics. Introduction to Bioinformatics.
Introduction to Bioinformatics.
 
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
Decoding our bacterial overlords - Melbourne Knowledge Week - tue 28 oct 2014
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
A peek inside the bioinformatics black box - DCAMG Symposium - mon 20 july 2015
 
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...Parallel computing in bioinformatics   t.seemann - balti bioinformatics - wed...
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
 
The Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of BioinformaticsThe Seven Deadly Sins of Bioinformatics
The Seven Deadly Sins of Bioinformatics
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012Assembling NGS Data - IMB Winter School - 3 July 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
 
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
 
Flexible Data Representation with Fixpoint Types
Flexible Data Representation with Fixpoint TypesFlexible Data Representation with Fixpoint Types
Flexible Data Representation with Fixpoint Types
 
Computer Literacy and Awareness in Schools
Computer Literacy and Awareness in SchoolsComputer Literacy and Awareness in Schools
Computer Literacy and Awareness in Schools
 
The Nuts and Bolts of Automating Your Workflow and Going Paperless
The Nuts and Bolts of Automating Your Workflow and Going PaperlessThe Nuts and Bolts of Automating Your Workflow and Going Paperless
The Nuts and Bolts of Automating Your Workflow and Going Paperless
 
Structure alignment methods
Structure alignment methodsStructure alignment methods
Structure alignment methods
 
CAI & CMI
CAI & CMICAI & CMI
CAI & CMI
 
Bioinformatics and Drug Discovery
Bioinformatics and Drug DiscoveryBioinformatics and Drug Discovery
Bioinformatics and Drug Discovery
 

Similaire à How to write bioinformatics software people will use and cite - t.seemann - fri 2 dec - bis 2016 - adelaide, au

2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 

Similaire à How to write bioinformatics software people will use and cite - t.seemann - fri 2 dec - bis 2016 - adelaide, au (20)

The quality of the python ecosystem - and how we can protect it!
The quality of the python ecosystem - and how we can protect it!The quality of the python ecosystem - and how we can protect it!
The quality of the python ecosystem - and how we can protect it!
 
05 python.pdf
05 python.pdf05 python.pdf
05 python.pdf
 
Get your FLOSS problems solved
Get your FLOSS problems solvedGet your FLOSS problems solved
Get your FLOSS problems solved
 
Pentester++
Pentester++Pentester++
Pentester++
 
Programming with Python - Basic
Programming with Python - BasicProgramming with Python - Basic
Programming with Python - Basic
 
MSL2008. Debugging
MSL2008. DebuggingMSL2008. Debugging
MSL2008. Debugging
 
Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)Deep Learning Applications (dadada2017)
Deep Learning Applications (dadada2017)
 
Golang, Python or C/C++, who wins
Golang, Python or C/C++, who wins Golang, Python or C/C++, who wins
Golang, Python or C/C++, who wins
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 
Let's Contribute
Let's ContributeLet's Contribute
Let's Contribute
 
Introduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdfIntroduction to Google Colaboratory.pdf
Introduction to Google Colaboratory.pdf
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Austin Python Learners Meetup - Everything you need to know about programming...
Austin Python Learners Meetup - Everything you need to know about programming...Austin Python Learners Meetup - Everything you need to know about programming...
Austin Python Learners Meetup - Everything you need to know about programming...
 
Code Camp NYC 2017 - How to deal with everything... | Chris Ozog - Codesushi
Code Camp NYC 2017 - How to deal with everything... | Chris Ozog - Codesushi Code Camp NYC 2017 - How to deal with everything... | Chris Ozog - Codesushi
Code Camp NYC 2017 - How to deal with everything... | Chris Ozog - Codesushi
 
SoC Python Discussion Group
SoC Python Discussion GroupSoC Python Discussion Group
SoC Python Discussion Group
 
py4inf-01-intro.ppt
py4inf-01-intro.pptpy4inf-01-intro.ppt
py4inf-01-intro.ppt
 
Reproducible research - to infinity
Reproducible research - to infinityReproducible research - to infinity
Reproducible research - to infinity
 
Reproducible Environments for Reproducible Results - PyOhio 2018
Reproducible Environments for Reproducible Results - PyOhio 2018Reproducible Environments for Reproducible Results - PyOhio 2018
Reproducible Environments for Reproducible Results - PyOhio 2018
 
Debugging Your Plone Site
Debugging Your Plone SiteDebugging Your Plone Site
Debugging Your Plone Site
 

Dernier

Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
University of Hertfordshire
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
Sérgio Sacani
 

Dernier (20)

Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 

How to write bioinformatics software people will use and cite - t.seemann - fri 2 dec - bis 2016 - adelaide, au

  • 1. How to write bioinformatics software that people will use & cite A/Prof Torsten Seemann @torstenseemann Bioinfosummer 2016 - Adelaide AU, Fri 2 Dec
  • 3. Doherty Applied Microbial Genomics Microbial genomics and bioinformatics
  • 4. Public health and clinical microbiology
  • 5. Before bioinformatics ● Undergraduate ○ Science / Engineering - Computer Science + Electrical En ● Honours ○ Computer Science - Digital image compression ● PhD ○ Computer Science - Digital image processing ● Never studied any biology
  • 7. First “fully Aussie” bacterial genome ● Leptospira hardjobovis str. L550 ● 2 chromosomes ● 4 Mbp ● $1M dollar project ● Sanger sequencing ● Led by Dieter Bulach
  • 8. First Illumina instrument in Australia ● Dept Microbiology Monash University, 2008 ● 36 bp single end reads ● 2 weeks to run ● 2 lanes for 1.6 Mbp genome
  • 9. Things have improved a bit since then Q32 36 bp Q20
  • 10. Why am I here?
  • 11. Bioinformatics software and me Installed >1000 packages manually Authored >100 packages into Brew Written and maintain >10 packages
  • 12. How to get a bioinformatics headache 1. See tweet about new published tool 2. Read abstract - sounds awesome! 3. Fail to find link to source code - eventually Google it 4. Attempt to compile and install it 5. Google for 30 min for fixes 6. Finally get it built 7. Run it on tiny data set 8. Get a vague error 9. Delete and never revisit it again
  • 13.
  • 14.
  • 15. Should I stay for this talk ? YES It will help you write good tools YES It will help you identify bad tools
  • 16. Should you write a tool?
  • 17. Should you write a new tool? ● NO ○ It already exists ○ You are unable to maintain it ○ You won’t really use it ● YES ○ YOU need the tool ○ YOU will use the tool ○ YOU want others to use the tool ○ Desire to give back to the community
  • 18. Eating my own dog food
  • 19. Lessons from the Prokka experience ● Nearly all feedback is positive ● People all over the world are grateful ● Warm fuzzy feeling inside ● Increase your public profile ● But maintenance burden and guilt
  • 21. Choosing a home base University page Personal home page
  • 22. Naming ● Try to be unique ○ Google to check for conflicts ○ Consider how internationals will pronounce it ○ Be creative! ● Avoid dodgy acronyms ○ Try not to win a JABBA Award ○ “Just Another Bogus Bioinformatics Acronym”
  • 23. Don’t be this person
  • 24. First impressions count ● Keep It Simple Stupid ● First page of documentation ○ What does it do? ○ How do I install it? ○ How do I run it? ● Try to keep in one place ○ Otherwise becomes inconsistent or missed
  • 26. A lesson from history
  • 27. Print something useful if no parameters % biotool Please use --help for instructions
  • 28. Always have a --help flag % biotool -h % biotool --help Usage: biotool [options] seq.fa --help Show this help --version Print version and exit --top N Keep top N sequences
  • 29. Always have a --version flag % biotool -v % biotool -V % biotool --version biotool 1.3
  • 30. Always raise an error when things go wrong % biotool seq.fa ERROR: can not open file ‘seq.fa’
  • 31. Check that dependencies are installed % biotool seq.fa Checking BLAST... ok Checking SAMtools... NOT FOUND! Please install ‘samtools’ and add it to your PATH.
  • 32. Always let users control output filenames % biotool seq.fa Processing ‘seq.fa’ Wrote result to ‘filt.seq.fa.out’ # ARGH! % biotool --out seq.filt.fa
  • 33. KISS - run with minimum parameters % biotool seq.fa ERROR: missing -x parameter % biotool -x 3 seq.fa ERROR: missing -y parameter % biotool -x 3 -y 7 seq.fa ERROR: need -n name # ARGH!
  • 35.
  • 36. Use the standard getopt interface Short options ( -h ) and long options ( --help ) ● C #include <getopt.h> ● C++ boost:program_options ● Python import argparse ● Perl use Getopt::Long ● R library(argparse) Command line interface
  • 37. Unix exit codes ● A positive integer ● Loose standards ○ 0 = success ○ 1 = general failure ○ 2 = error with command line ○ 3..127 = user defined specific failures ● Result in shell $? Variable
  • 38. Accessing exit codes in the shell % ls /tmp/fake ls: cannot access /tmp/fake % echo $? 1 % ls /proc/cpuinfo /proc/cpuinfo % echo $? 0
  • 39. Using stdin, stderr and stdout ● Stdin (0) command < input ● Stdout (1) command > output ● Stderr(2) command 2> errors ● All command < input > output 2> errors ● Allows piping! sort input | command1 1> output 2> errors
  • 40. This makes your tool useful in streaming % zcat seq.fastq.gz | cutadapt -a adapters.fa | qualtrim -Q 20 | bwa mem -t 8 ref.fa | samtools sort --threads 4 > seq.bam
  • 41. Use standards compliant files * ● Feature coordinates ○ BED, GFF ● Columnar data (put headings!) ○ TSV ○ CSV ● Structured data ○ JSON ○ YAML * XML excepted
  • 43. Keeping your audience “Each equation in a book will halve your audience” “Each difficulty encountered in installation will halve your number of users”
  • 44. Traditional systems level packaging ● Debian / DEB apt-get install blast dpkg -i blast-2.2.5-amd64.deb ● Redhat / RPM yum install blast rpm -i blast-2.2.5-x86_64.rpm ● Various others
  • 45. Cross platform solutions: Linux, Mac, Windows ● Brew brew install blast ● Conda conda install blast ● Others ○ GUIX, ... ○ Docker, AMI images
  • 46. Language specific repositories ● Python - PIP pip install ariba ● Perl - CPAN cpanm Bio::Roary ● R - CRAN install.packages(“edgeseq3”)
  • 48. Publish it ● Preprint archive ○ PeerJ, bioRxiv ● Method focussed journal ○ Bioinformatics, BMC Bioinformatics ● Software focussed journal ○ Journal of Open Source Software
  • 49. Plug it ● Twitter ○ Ask someone popular you know to retweet it ● Blog ○ Start a general blog and slot ● Conferences ○ Tell people about it
  • 50. Support your users ● Reply to emails ● Monitor your “Issues” web site ● Monitor Biostars and SeqAnswers ● Have a mailing list ● Update your documentation ● Fix bugs
  • 52. Take home messages ● Make it as painless as possible to install ● Keep documentation clear and simple ● Get people to use it before you publish ● People are not judging your coding skills ● But they will curse you if waste their time ● Most users are grateful - leads to free beer ● A good tools worth much more than a paper
  • 53. Acknowledgments ● Gary Glonek ● David Adelson ● Bernard Pope - VLSCI ● Dieter Bulach - VLSCI ● Anna Syme - VLSCI ● David Powell - Monash University ● Anders Goncalves da Silva - University of Melbourne