SlideShare une entreprise Scribd logo
1  sur  58
Six ways to Sunday:
approaches to computational
reproducibility in non-model
system sequence analysis.
C. Titus Brown
ctb@msu.edu
May 21, 2014
Hello!
Assistant Professor; Microbiology; Computer Science;
etc.
More information at:
• ged.msu.edu/
• github.com/ged-lab/
• ivory.idyll.org/blog/
• @ctitusbrown
The challenges of non-
model sequencing
• Missing or low quality genome reference.
• Evolutionarily distant.
• Most extant computational tools focus on model
organisms –
o Assume low polymorphism (internal variation)
o Assume reference genome
o Assume somewhat reliable functional annotation
o More significant compute infrastructure
…and cannot easily or directly be used on critters of interest.
Shotgun sequencing & assembly
http://eofdreams.com/library.html;
http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/;
http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
Shotgun sequencing
analysis goals:
• Assembly (what is the text?)
o Produces new genomes & transcriptomes.
o Gene discovery for enzymes, drug targets, etc.
• Counting (how many copies of each book?)
o Measure gene expression levels, protein-DNA
interactions
• Variant calling (how does each edition vary?)
o Discover genetic variation: genotyping, linkage
studies…
o Allele-specific expression analysis.
Assembly
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness
…but for lots and lots of fragments!
Shared low-level
fragments may
not reach the
threshold for
assembly.
Lamprey mRNAseq:
Introducing k-mers
CCGATTGCACTGGACCGA (<- read)
CCGATTGCAC
CGATTGCACT
GATTGCACTG
ATTGCACTGG
TTGCACTGGA
TGCACTGGAC
GCACTGGACC
ACTGGACCGA
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
K-mers give you an
implicit alignment
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTGGACCGATGCACGGTACCG
CATGGACCGATTGCACTGGACCGATGCACGGACCG
(with no accounting for mismatches or indels)
De Bruijn graphs –
assemble on overlaps
J.R. Miller et al. / Genomics (2010)
The problem with k-mers
CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC
CATGGACCGATTGCACTCGACCGATGCACGGTACCG
Each sequencing error results in k novel k-mers!
Conway T C , Bromage A J Bioinformatics 2011;27:479-486
© The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions,
please email: journals.permissions@oup.com
Assembly graphs scale with data size, not
information.
Practical memory
measurements (soil)
Velvet measurements (Adina Howe)
Data set size and cost
• $1000 gets you ~100m “reads”, or about 10-40 GB of
data, in ~week.
• > 1000 labs doing this regularly.
• Each data set analysis is ~custom.
• Analyses are data intensive and memory intensive.
Efficient data structures &
algorithms
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Shotgun sequencing is massively redundant; can we
eliminate redundancy while retaining information?
Analog: JPEG lossy compression
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Sparse collections of k-mers can be
stored efficiently in Bloom filters
Pell et al., 2012, PNAS; doi: 10.1073/pnas.1121464109
Data structures &
algorithms papers
• “These are not the k-mers you are looking for…”,
Zhang et al., arXiv 1309.2975, in review.
• “Scaling metagenome sequence assembly with
probabilistic de Bruijn graphs”, Pell et al., PNAS 2012.
• “A Reference-Free Algorithm for Computational
Normalization of Shotgun Sequencing Data”, Brown
et al., arXiv 1203.4802, under revision.
Data analysis papers
• “Tackling soil diversity with the assembly of large,
complex metagenomes”, Howe et al., PNAS, 2014.
• Assembling novel ascidian genomes &
transcriptomes, Lowe et al., in prep.
• A de novo lamprey transcriptome from large scale
multi-tissue mRNAseq, Scott et al., in prep.
Lab approach – not
intentional, but working out.
Novel data
structures and
algorithms
Implement at
scale
Apply to real
biological
problems
This leads to good things.
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
(khmer software)
Efficient online
counting of k-mers
Trimming reads
on abundance
Efficient De
Bruijn graph
representations
Read
abundance
normalization
Streaming
algorithms for
assembly,
variant calling,
and error
correction
Cloud assembly
protocols
Efficient graph
labeling &
exploration
Data set
partitioning
approaches
Assembly-free
comparison of
data sets
HMM-guided
assembly
Efficient search
for target genes
Currentresearch
(khmer software)
Testing & version control
– the not so secret sauce
• High test coverage - grown over time.
• Stupidity driven testing – we write tests for bugs after
we find them and before we fix them.
• Pull requests & continuous integration – does your
proposed merge break tests?
• Pull requests & code review – does new code meet
our minimal coding etc requirements?
o Note: spellchecking!!!
On the “novel research” side:
• Novel data structures and algorithms;
• Permit low(er) memory data analysis;
• Liberate analyses from specialized hardware.
Running entirely w/in cloud
Complete data; AWS m1.xlarge
~40 hours
(See PyCon 2014 talk; video and blog post.)
MEMORY
On the “novel research” side:
• Novel data structures and algorithms;
• Permit low(er) memory data analysis;
• Liberate analyses from specialized hardware.
This last bit? => reproducibility.
Reproducibility!
Scientific progress relies on reproducibility of
analysis. (Aristotle, Nature, 322 BCE.)
“There is no such thing as ‘reproducible science’.
There is only ‘science’, and ‘not science.’” –
someone on Twitter (Fernando Perez?)
Disclaimer
Not a researcher of reproducibility!
Merely a practitioner.
Please take my points below as an argument
and not as research conclusions.
(But I’m right.)
My usual intro:
We practice open science!
Everything discussed here:
• Code: github.com/ged-lab/ ; BSD license
• Blog: http://ivory.idyll.org/blog (‘titus brown blog’)
• Twitter: @ctitusbrown
• Grants on Lab Web site:
http://ged.msu.edu/research.html
• Preprints available.
Everything is > 80% reproducible.
My usual intro:
We practice open science!
Everything discussed here:
• Code: github.com/ged-lab/ ; BSD license
• Blog: http://ivory.idyll.org/blog (‘titus brown blog’)
• Twitter: @ctitusbrown
• Grants on Lab Web site:
http://ged.msu.edu/research.html
• Preprints available.
Everything is > 80% reproducible.
My lab & the diginorm paper.
• All our code was already on github;
• Much of our data analysis was already in the cloud;
• Our figures were already made in IPython Notebook
• Our paper was already in LaTeX
IPython Notebook: data +
code =>IPython)Notebook)
My lab & the diginorm paper.
• All our code was already on github;
• Much of our data analysis was already in the cloud;
• Our figures were already made in IPython Notebook
• Our paper was already in LaTeX
…why not push a bit more and make it easily
reproducible?
This involved writing a tutorial. And that’s it.
To reproduce our paper:
git clone <khmer> && python setup.py install
git clone <pipeline>
cd pipeline
wget <data> && tar xzf <data>
make && cd ../notebook && make
cd ../ && make
Now standard in lab --
All our papers now have:
• Source hosted on github;
• Data hosted there or on AWS;
• Long running data analysis =>
‘make’
• Graphing and data digestion
=> IPython Notebook (also in
github)
Qingpeng Zhang
Research process
Generate new
results; encode
in Makefile
Summarize in
IPython
Notebook
Push to githubDiscuss, explore
Literate graphing &
interactive exploration
The process
• We start with pipeline reproducibility
• Baked into lab culture; default “use git; write scripts”
Community of practice!
• Use standard open source approaches, so OSS
developers learn it easily.
• Enables easy collaboration w/in lab
• Valuable learning tool!
Growing & refining the
process
• Now moving to Ubuntu Long-Term Support + install
instructions.
• Everything is as automated as is convenient.
• Students expected to communicate with me in IPython
Notebooks.
• Trying to avoid building (or even using) new tools.
• Avoid maintenance burden as much as possible.
1. Use standard OS; provide
install instructions
• Providing install, execute for Ubuntu Long-Term
Support release 14.04: supported through 2017 and
beyond.
• Avoid pre-configured virtual machines!
o Locks you into specific cloud homes.
o Challenges remixability and extensibility.
2. Automate
• Literate graphing now easy with knitr and IPython
Notebook.
• Build automation with make, or whatever. To first
order, it does not matter what tools you use.
• Explicit is better than implicit. Make it easy to
understand what you’re doing and how to extend it.
Myths of reproducible
research
(Opinions from personal experience.)
Myth 1: Partial
reproducibility is hard.
“Here’s my script.” => Methods
More generally,
• Many scientists cannot replicate any part of their
analysis without a lot of manual work.
• Automating this is a win for reasons that have
nothing to do with reproducibility… efficiency!
See: Software Carpentry.
Myth 2: Incomplete
reproducibility is useless
Paraphrase: “We can’t possibly reproduce the
experimental data exactly, so we shouldn’t bother
with anything else, either.”
(Analogous arg re software testing & code coverage.)
• …I really have a hard time arguing the paraphrase
honestly…
• Being able to reanalyze your raw data? Interesting.
• Knowing how you made your figures? Really useful.
Myth 3: We need new
platforms
• Techies always want to build something (which is fun!)
but don’t want to do science (which is hard!)
• We probably do need new platforms, but stop thinking
that building them does a service.
• Platforms need to be use driven. Seriously.
• If you write good software for scientific inquiry and make
it easy to use reproducibly, that will drive virtuousity.
Myth 4. Virtual Machine
reproducibility is an end solution.
• Good start! Better than nothing!
But:
• Limits understanding & reuse.
• Limits remixing: often cannot install other software!
• “Chinese Room” argument: could be just a lookup
table.
Myth 5: We can use GUIs
for reproducible research
(OK, this is partly just to make people think ;)
• Almost all data analysis takes place within a larger
pipeline; the GUI must consume entire pipeline in
order to be reproducible.
• IFF GUI wraps command line, that’s a decent
compromise (e.g. Galaxy) but handicaps
researchers using novel approaches.
• By the time it’s in a GUI, it’s no longer research.
Our current efforts?
• Semantic versioning of our own code: stable
command-line interface.
• Writing easy-to-teach tutorials and protocols for
common analysis pipelines.
• Automate ‘em for testing purposes.
• Encourage their use, inclusion, and adaptation by
others.
khmer-protocols
khmer-protocols:
• Provide standard “cheap”
assembly protocols for the cloud.
• Entirely copy/paste; ~2-6 days
from raw reads to assembly,
annotations, and differential
expression analysis. ~$150 per
data set (on Amazon rental
computers)
• Open, versioned, forkable,
citable….
Read cleaning
Diginorm
Assembly
Annotation
RSEM differential
expression
Literate testing
• Our shell-command tutorials for bioinformatics can
now be executed in an automated fashion –
commands are extracted automatically into shell
scripts.
• See: github.com/ged-lab/literate-resting/.
• Tremendously improves peace of mind and
confidence moving forward!
Leigh Sheneman
Doing things right
=> #awesomesauce
Protocols in English
for running analyses in
the cloud
Literate reSTing =>
shell scripts
Tool
competitions
Benchmarking
Education
Acceptance
tests
Concluding thoughts
• We are not doing anything particularly neat on the
computational side... No “magic sauce.”
• Much of our effort is now driven by sheer utility:
o Automation reduces our maintenance burden.
o Extensibility makes revisions much easier!
o Explicit instructions are good for training.
• Some effort needed at the beginning, but once
practices are established, “virtuous cycle” takes
over.
What bits should people
adopt?
• Version control!
• Literate graphing!
• Automated “build” from data => results!
• Make available data as early in your pipeline as
possible.
More concluding
thoughts
• Nobody would care that we were doing things
reproducibly if our science wasn’t decent.
• Make sure students realize that faffing about on
infrastructure isn’t science.
• Research is about doing science. Reproducibility
(like other good practices) is much easier to
proselytize if you can link it to progress in science.
Biology & sequence analysis is in a
perfect place for reproducibility
We are lucky! A good opportunity!
• Big Data: laptops are too small;
• Excel doesn’t scale any more;
• Few tools in use; most of them are $$ or UNIX;
• Little in the way of entrenched research practice;
Thanks!
Talk is on slideshare: slideshare.net/c.titus.brown
E-mail or tweet me:
ctb@msu.edu
@ctitusbrown

Contenu connexe

Tendances

Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data AnalyticsEdureka!
 
Webinar: Deep Learning with H2O
Webinar: Deep Learning with H2OWebinar: Deep Learning with H2O
Webinar: Deep Learning with H2OSri Ambati
 
PE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesPE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesAntiy Labs
 
GDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapGDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapJiang Jun
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2Oodsc
 
Getting Started with Numenta Technology
Getting Started with Numenta Technology Getting Started with Numenta Technology
Getting Started with Numenta Technology Numenta
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineeringnathanmarz
 
TensorFlow in Context
TensorFlow in ContextTensorFlow in Context
TensorFlow in ContextAltoros
 
Introducing TensorFlow: The game changer in building "intelligent" applications
Introducing TensorFlow: The game changer in building "intelligent" applicationsIntroducing TensorFlow: The game changer in building "intelligent" applications
Introducing TensorFlow: The game changer in building "intelligent" applicationsRokesh Jankie
 
Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceDavid De Roure
 
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기 Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기 Mario Cho
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTKAshish Jaiman
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in RAnqi Fu
 

Tendances (15)

Python for Big Data Analytics
Python for Big Data AnalyticsPython for Big Data Analytics
Python for Big Data Analytics
 
Webinar: Deep Learning with H2O
Webinar: Deep Learning with H2OWebinar: Deep Learning with H2O
Webinar: Deep Learning with H2O
 
MAVRL Workshop 2014 - pymatgen-db & custodian
MAVRL Workshop 2014 - pymatgen-db & custodianMAVRL Workshop 2014 - pymatgen-db & custodian
MAVRL Workshop 2014 - pymatgen-db & custodian
 
PE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesPE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File Features
 
GDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit RecapGDG-Shanghai 2017 TensorFlow Summit Recap
GDG-Shanghai 2017 TensorFlow Summit Recap
 
Scalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2OScalable Data Science and Deep Learning with H2O
Scalable Data Science and Deep Learning with H2O
 
Getting Started with Numenta Technology
Getting Started with Numenta Technology Getting Started with Numenta Technology
Getting Started with Numenta Technology
 
The Epistemology of Software Engineering
The Epistemology of Software EngineeringThe Epistemology of Software Engineering
The Epistemology of Software Engineering
 
TensorFlow in Context
TensorFlow in ContextTensorFlow in Context
TensorFlow in Context
 
Introducing TensorFlow: The game changer in building "intelligent" applications
Introducing TensorFlow: The game changer in building "intelligent" applicationsIntroducing TensorFlow: The game changer in building "intelligent" applications
Introducing TensorFlow: The game changer in building "intelligent" applications
 
Knowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems ScienceKnowledge Infrastructure for Global Systems Science
Knowledge Infrastructure for Global Systems Science
 
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기 Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
Koss Lab 세미나 오픈소스 인공지능(AI) 프레임웍파헤치기
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
Deep Learning with CNTK
Deep Learning with CNTKDeep Learning with CNTK
Deep Learning with CNTK
 
Big Data Science with H2O in R
Big Data Science with H2O in RBig Data Science with H2O in R
Big Data Science with H2O in R
 

En vedette

Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiAmadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiMaurizio Repetto
 
Ekonomik i̇stikrar
Ekonomik i̇stikrarEkonomik i̇stikrar
Ekonomik i̇stikrarsadettin
 
06 Outsource To India Open Source Development
06 Outsource To India Open Source Development06 Outsource To India Open Source Development
06 Outsource To India Open Source DevelopmentoutsourceToIndia
 
Digital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsDigital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsJudith Baines
 
إنسان النهضة
إنسان النهضةإنسان النهضة
إنسان النهضةAhmad Darwish
 
Etwinning edinburgh april 2016
Etwinning edinburgh april 2016Etwinning edinburgh april 2016
Etwinning edinburgh april 2016sarahstead
 
2013 beacon-congress-social-media
2013 beacon-congress-social-media2013 beacon-congress-social-media
2013 beacon-congress-social-mediac.titus.brown
 
Approximate Thin Plate Spline Mappings
Approximate Thin Plate Spline MappingsApproximate Thin Plate Spline Mappings
Approximate Thin Plate Spline MappingsArchzilon Eshun-Davies
 
Experience development c120617-1-img
Experience development c120617-1-imgExperience development c120617-1-img
Experience development c120617-1-imgSham Yemul
 
KILLED DO NOT VIEW
KILLED DO NOT VIEWKILLED DO NOT VIEW
KILLED DO NOT VIEWavlainich
 
Feud In Federation
Feud In FederationFeud In Federation
Feud In Federationpyacoub
 

En vedette (20)

RealTimePostproduction
RealTimePostproductionRealTimePostproduction
RealTimePostproduction
 
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei DissidentiAmadeus Sindaco Viene Attaccato Da Dei Dissidenti
Amadeus Sindaco Viene Attaccato Da Dei Dissidenti
 
Tecno1r Eso
Tecno1r EsoTecno1r Eso
Tecno1r Eso
 
Ekonomik i̇stikrar
Ekonomik i̇stikrarEkonomik i̇stikrar
Ekonomik i̇stikrar
 
06 Outsource To India Open Source Development
06 Outsource To India Open Source Development06 Outsource To India Open Source Development
06 Outsource To India Open Source Development
 
Heartwave Appeal 9.6.09
Heartwave Appeal 9.6.09Heartwave Appeal 9.6.09
Heartwave Appeal 9.6.09
 
Digital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospectsDigital Footprints: Using the Internet to enhance your career prospects
Digital Footprints: Using the Internet to enhance your career prospects
 
2014 pycon-talk
2014 pycon-talk2014 pycon-talk
2014 pycon-talk
 
إنسان النهضة
إنسان النهضةإنسان النهضة
إنسان النهضة
 
2013 arizona-swc
2013 arizona-swc2013 arizona-swc
2013 arizona-swc
 
Mythbusters: Employment Law Edition
Mythbusters: Employment Law EditionMythbusters: Employment Law Edition
Mythbusters: Employment Law Edition
 
Etwinning edinburgh april 2016
Etwinning edinburgh april 2016Etwinning edinburgh april 2016
Etwinning edinburgh april 2016
 
TPSI by Competitive Analytics
TPSI by Competitive AnalyticsTPSI by Competitive Analytics
TPSI by Competitive Analytics
 
Kindle vs Sony
Kindle vs SonyKindle vs Sony
Kindle vs Sony
 
2013 beacon-congress-social-media
2013 beacon-congress-social-media2013 beacon-congress-social-media
2013 beacon-congress-social-media
 
Rising seas
Rising seasRising seas
Rising seas
 
Approximate Thin Plate Spline Mappings
Approximate Thin Plate Spline MappingsApproximate Thin Plate Spline Mappings
Approximate Thin Plate Spline Mappings
 
Experience development c120617-1-img
Experience development c120617-1-imgExperience development c120617-1-img
Experience development c120617-1-img
 
KILLED DO NOT VIEW
KILLED DO NOT VIEWKILLED DO NOT VIEW
KILLED DO NOT VIEW
 
Feud In Federation
Feud In FederationFeud In Federation
Feud In Federation
 

Similaire à 2014 manchester-reproducibility

2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practicesc.titus.brown
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringTao Xie
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithmsc.titus.brown
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...Bonnie Hurwitz
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017philippbayer
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyYannick Pouliot
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksBICA Labs
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sangerChris Dwan
 
Reproducible research concepts and tools
Reproducible research concepts and toolsReproducible research concepts and tools
Reproducible research concepts and toolsC. Tobin Magle
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introductionchristian.perez
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformaticsStephen Turner
 
MSR 2009
MSR 2009MSR 2009
MSR 2009swy351
 
It summit 150604 cb_wcl_ld_kmh_v6_to_publish
It summit 150604 cb_wcl_ld_kmh_v6_to_publishIt summit 150604 cb_wcl_ld_kmh_v6_to_publish
It summit 150604 cb_wcl_ld_kmh_v6_to_publishkevin_donovan
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdbjixuan1989
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchRachel Berryman
 

Similaire à 2014 manchester-reproducibility (20)

2013 ucar best practices
2013 ucar best practices2013 ucar best practices
2013 ucar best practices
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017HPCAC - the state of bioinformatics in 2017
HPCAC - the state of bioinformatics in 2017
 
Databases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems ImmunologyDatabases, Web Services and Tools For Systems Immunology
Databases, Web Services and Tools For Systems Immunology
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Data Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural NetworksData Science, Machine Learning and Neural Networks
Data Science, Machine Learning and Neural Networks
 
2016 05 sanger
2016 05 sanger2016 05 sanger
2016 05 sanger
 
Reproducible research concepts and tools
Reproducible research concepts and toolsReproducible research concepts and tools
Reproducible research concepts and tools
 
Cloud computing and Hadoop introduction
Cloud computing and Hadoop introductionCloud computing and Hadoop introduction
Cloud computing and Hadoop introduction
 
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
2018 ABRF Tools for improving rigor and reproducibility in bioinformatics
 
MSR 2009
MSR 2009MSR 2009
MSR 2009
 
It summit 150604 cb_wcl_ld_kmh_v6_to_publish
It summit 150604 cb_wcl_ld_kmh_v6_to_publishIt summit 150604 cb_wcl_ld_kmh_v6_to_publish
It summit 150604 cb_wcl_ld_kmh_v6_to_publish
 
From a student to an apache committer practice of apache io tdb
From a student to an apache committer  practice of apache io tdbFrom a student to an apache committer  practice of apache io tdb
From a student to an apache committer practice of apache io tdb
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 

Plus de c.titus.brown

Plus de c.titus.brown (20)

2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 

Dernier

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Thomas Poetter
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPTBoston Institute of Analytics
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 

Dernier (20)

Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
Minimizing AI Hallucinations/Confabulations and the Path towards AGI with Exa...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default  Presentation : Data Analysis Project PPTPredictive Analysis for Loan Default  Presentation : Data Analysis Project PPT
Predictive Analysis for Loan Default Presentation : Data Analysis Project PPT
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 

2014 manchester-reproducibility

  • 1. Six ways to Sunday: approaches to computational reproducibility in non-model system sequence analysis. C. Titus Brown ctb@msu.edu May 21, 2014
  • 2. Hello! Assistant Professor; Microbiology; Computer Science; etc. More information at: • ged.msu.edu/ • github.com/ged-lab/ • ivory.idyll.org/blog/ • @ctitusbrown
  • 3. The challenges of non- model sequencing • Missing or low quality genome reference. • Evolutionarily distant. • Most extant computational tools focus on model organisms – o Assume low polymorphism (internal variation) o Assume reference genome o Assume somewhat reliable functional annotation o More significant compute infrastructure …and cannot easily or directly be used on critters of interest.
  • 4. Shotgun sequencing & assembly http://eofdreams.com/library.html; http://www.theshreddingservices.com/2011/11/paper-shredding-services-small-business/; http://schoolworkhelper.net/charles-dickens%E2%80%99-tale-of-two-cities-summary-analysis/
  • 5. Shotgun sequencing analysis goals: • Assembly (what is the text?) o Produces new genomes & transcriptomes. o Gene discovery for enzymes, drug targets, etc. • Counting (how many copies of each book?) o Measure gene expression levels, protein-DNA interactions • Variant calling (how does each edition vary?) o Discover genetic variation: genotyping, linkage studies… o Allele-specific expression analysis.
  • 6. Assembly It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  • 7. Shared low-level fragments may not reach the threshold for assembly. Lamprey mRNAseq:
  • 8. Introducing k-mers CCGATTGCACTGGACCGA (<- read) CCGATTGCAC CGATTGCACT GATTGCACTG ATTGCACTGG TTGCACTGGA TGCACTGGAC GCACTGGACC ACTGGACCGA
  • 9. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG
  • 10. K-mers give you an implicit alignment CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTGGACCGATGCACGGTACCG CATGGACCGATTGCACTGGACCGATGCACGGACCG (with no accounting for mismatches or indels)
  • 11. De Bruijn graphs – assemble on overlaps J.R. Miller et al. / Genomics (2010)
  • 12. The problem with k-mers CCGATTGCACTGGACCGATGCACGGTACCGTATAGCC CATGGACCGATTGCACTCGACCGATGCACGGTACCG Each sequencing error results in k novel k-mers!
  • 13. Conway T C , Bromage A J Bioinformatics 2011;27:479-486 © The Author 2011. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com Assembly graphs scale with data size, not information.
  • 14. Practical memory measurements (soil) Velvet measurements (Adina Howe)
  • 15. Data set size and cost • $1000 gets you ~100m “reads”, or about 10-40 GB of data, in ~week. • > 1000 labs doing this regularly. • Each data set analysis is ~custom. • Analyses are data intensive and memory intensive.
  • 16. Efficient data structures & algorithms Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization
  • 17. Shotgun sequencing is massively redundant; can we eliminate redundancy while retaining information? Analog: JPEG lossy compression Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB)
  • 18. Sparse collections of k-mers can be stored efficiently in Bloom filters Pell et al., 2012, PNAS; doi: 10.1073/pnas.1121464109
  • 19. Data structures & algorithms papers • “These are not the k-mers you are looking for…”, Zhang et al., arXiv 1309.2975, in review. • “Scaling metagenome sequence assembly with probabilistic de Bruijn graphs”, Pell et al., PNAS 2012. • “A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data”, Brown et al., arXiv 1203.4802, under revision.
  • 20. Data analysis papers • “Tackling soil diversity with the assembly of large, complex metagenomes”, Howe et al., PNAS, 2014. • Assembling novel ascidian genomes & transcriptomes, Lowe et al., in prep. • A de novo lamprey transcriptome from large scale multi-tissue mRNAseq, Scott et al., in prep.
  • 21. Lab approach – not intentional, but working out. Novel data structures and algorithms Implement at scale Apply to real biological problems
  • 22. This leads to good things. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization (khmer software)
  • 23. Efficient online counting of k-mers Trimming reads on abundance Efficient De Bruijn graph representations Read abundance normalization Streaming algorithms for assembly, variant calling, and error correction Cloud assembly protocols Efficient graph labeling & exploration Data set partitioning approaches Assembly-free comparison of data sets HMM-guided assembly Efficient search for target genes Currentresearch (khmer software)
  • 24. Testing & version control – the not so secret sauce • High test coverage - grown over time. • Stupidity driven testing – we write tests for bugs after we find them and before we fix them. • Pull requests & continuous integration – does your proposed merge break tests? • Pull requests & code review – does new code meet our minimal coding etc requirements? o Note: spellchecking!!!
  • 25. On the “novel research” side: • Novel data structures and algorithms; • Permit low(er) memory data analysis; • Liberate analyses from specialized hardware.
  • 26. Running entirely w/in cloud Complete data; AWS m1.xlarge ~40 hours (See PyCon 2014 talk; video and blog post.) MEMORY
  • 27. On the “novel research” side: • Novel data structures and algorithms; • Permit low(er) memory data analysis; • Liberate analyses from specialized hardware. This last bit? => reproducibility.
  • 28. Reproducibility! Scientific progress relies on reproducibility of analysis. (Aristotle, Nature, 322 BCE.) “There is no such thing as ‘reproducible science’. There is only ‘science’, and ‘not science.’” – someone on Twitter (Fernando Perez?)
  • 29. Disclaimer Not a researcher of reproducibility! Merely a practitioner. Please take my points below as an argument and not as research conclusions. (But I’m right.)
  • 30. My usual intro: We practice open science! Everything discussed here: • Code: github.com/ged-lab/ ; BSD license • Blog: http://ivory.idyll.org/blog (‘titus brown blog’) • Twitter: @ctitusbrown • Grants on Lab Web site: http://ged.msu.edu/research.html • Preprints available. Everything is > 80% reproducible.
  • 31. My usual intro: We practice open science! Everything discussed here: • Code: github.com/ged-lab/ ; BSD license • Blog: http://ivory.idyll.org/blog (‘titus brown blog’) • Twitter: @ctitusbrown • Grants on Lab Web site: http://ged.msu.edu/research.html • Preprints available. Everything is > 80% reproducible.
  • 32. My lab & the diginorm paper. • All our code was already on github; • Much of our data analysis was already in the cloud; • Our figures were already made in IPython Notebook • Our paper was already in LaTeX
  • 33. IPython Notebook: data + code =>IPython)Notebook)
  • 34. My lab & the diginorm paper. • All our code was already on github; • Much of our data analysis was already in the cloud; • Our figures were already made in IPython Notebook • Our paper was already in LaTeX …why not push a bit more and make it easily reproducible? This involved writing a tutorial. And that’s it.
  • 35. To reproduce our paper: git clone <khmer> && python setup.py install git clone <pipeline> cd pipeline wget <data> && tar xzf <data> make && cd ../notebook && make cd ../ && make
  • 36. Now standard in lab -- All our papers now have: • Source hosted on github; • Data hosted there or on AWS; • Long running data analysis => ‘make’ • Graphing and data digestion => IPython Notebook (also in github) Qingpeng Zhang
  • 37. Research process Generate new results; encode in Makefile Summarize in IPython Notebook Push to githubDiscuss, explore
  • 39. The process • We start with pipeline reproducibility • Baked into lab culture; default “use git; write scripts” Community of practice! • Use standard open source approaches, so OSS developers learn it easily. • Enables easy collaboration w/in lab • Valuable learning tool!
  • 40. Growing & refining the process • Now moving to Ubuntu Long-Term Support + install instructions. • Everything is as automated as is convenient. • Students expected to communicate with me in IPython Notebooks. • Trying to avoid building (or even using) new tools. • Avoid maintenance burden as much as possible.
  • 41. 1. Use standard OS; provide install instructions • Providing install, execute for Ubuntu Long-Term Support release 14.04: supported through 2017 and beyond. • Avoid pre-configured virtual machines! o Locks you into specific cloud homes. o Challenges remixability and extensibility.
  • 42. 2. Automate • Literate graphing now easy with knitr and IPython Notebook. • Build automation with make, or whatever. To first order, it does not matter what tools you use. • Explicit is better than implicit. Make it easy to understand what you’re doing and how to extend it.
  • 43. Myths of reproducible research (Opinions from personal experience.)
  • 44. Myth 1: Partial reproducibility is hard. “Here’s my script.” => Methods More generally, • Many scientists cannot replicate any part of their analysis without a lot of manual work. • Automating this is a win for reasons that have nothing to do with reproducibility… efficiency! See: Software Carpentry.
  • 45. Myth 2: Incomplete reproducibility is useless Paraphrase: “We can’t possibly reproduce the experimental data exactly, so we shouldn’t bother with anything else, either.” (Analogous arg re software testing & code coverage.) • …I really have a hard time arguing the paraphrase honestly… • Being able to reanalyze your raw data? Interesting. • Knowing how you made your figures? Really useful.
  • 46. Myth 3: We need new platforms • Techies always want to build something (which is fun!) but don’t want to do science (which is hard!) • We probably do need new platforms, but stop thinking that building them does a service. • Platforms need to be use driven. Seriously. • If you write good software for scientific inquiry and make it easy to use reproducibly, that will drive virtuousity.
  • 47. Myth 4. Virtual Machine reproducibility is an end solution. • Good start! Better than nothing! But: • Limits understanding & reuse. • Limits remixing: often cannot install other software! • “Chinese Room” argument: could be just a lookup table.
  • 48. Myth 5: We can use GUIs for reproducible research (OK, this is partly just to make people think ;) • Almost all data analysis takes place within a larger pipeline; the GUI must consume entire pipeline in order to be reproducible. • IFF GUI wraps command line, that’s a decent compromise (e.g. Galaxy) but handicaps researchers using novel approaches. • By the time it’s in a GUI, it’s no longer research.
  • 49. Our current efforts? • Semantic versioning of our own code: stable command-line interface. • Writing easy-to-teach tutorials and protocols for common analysis pipelines. • Automate ‘em for testing purposes. • Encourage their use, inclusion, and adaptation by others.
  • 51. khmer-protocols: • Provide standard “cheap” assembly protocols for the cloud. • Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 per data set (on Amazon rental computers) • Open, versioned, forkable, citable…. Read cleaning Diginorm Assembly Annotation RSEM differential expression
  • 52. Literate testing • Our shell-command tutorials for bioinformatics can now be executed in an automated fashion – commands are extracted automatically into shell scripts. • See: github.com/ged-lab/literate-resting/. • Tremendously improves peace of mind and confidence moving forward! Leigh Sheneman
  • 53. Doing things right => #awesomesauce Protocols in English for running analyses in the cloud Literate reSTing => shell scripts Tool competitions Benchmarking Education Acceptance tests
  • 54. Concluding thoughts • We are not doing anything particularly neat on the computational side... No “magic sauce.” • Much of our effort is now driven by sheer utility: o Automation reduces our maintenance burden. o Extensibility makes revisions much easier! o Explicit instructions are good for training. • Some effort needed at the beginning, but once practices are established, “virtuous cycle” takes over.
  • 55. What bits should people adopt? • Version control! • Literate graphing! • Automated “build” from data => results! • Make available data as early in your pipeline as possible.
  • 56. More concluding thoughts • Nobody would care that we were doing things reproducibly if our science wasn’t decent. • Make sure students realize that faffing about on infrastructure isn’t science. • Research is about doing science. Reproducibility (like other good practices) is much easier to proselytize if you can link it to progress in science.
  • 57. Biology & sequence analysis is in a perfect place for reproducibility We are lucky! A good opportunity! • Big Data: laptops are too small; • Excel doesn’t scale any more; • Few tools in use; most of them are $$ or UNIX; • Little in the way of entrenched research practice;
  • 58. Thanks! Talk is on slideshare: slideshare.net/c.titus.brown E-mail or tweet me: ctb@msu.edu @ctitusbrown

Notes de l'éditeur

  1. A sketch showing the relationship between the number of sequence reads and the number of edges in the graph. Because the underlying genome is fixed in size, as the number of sequence reads increases the number of edges in the graph due to the underlying genome that will plateau when every part of the genome is covered. Conversely, since errors tend to be random and more or less unique, their number scales linearly with the number of sequence reads. Once enough sequence reads are present to have enough coverage to clearly distinguish true edges (which come from the underlying genome), they will usually be outnumbered by spurious edges (which arise from errors) by a substantial factor.
  2. Slow, but powerful.
  3. Acceptance testing other people’s software