SlideShare a Scribd company logo
1 of 38
Making de novo assembly cheap & easy:
standardized protocols for mRNAseq and
metagenome assembly and analysis
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
Jan 2014
ctb@msu.edu
My lab’s focus
 De novo assembly and efficient/effective use of

NGS, especially for non-model organism.
 Open source software engineering.

 Training and education in NGS.
There is quite a bit of life left to sequence & assem

http://pacelab.colorado.edu/
Three problems:
1.

Assembly memory & compute requirements?

2.

It’s a complex process; what are good defaults?

3.

Training is limited in opportunity, difficult for
students, not always effective.
First problem: lots of data!
So, we want to go from raw data:
Name
@SRR606249.17/1
GAGTATGTTCTCATAGAGGTTGGTANNNNT
+
B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score
@SRR606249.17/2
CGAANNNNNNNNNNNNNNNNNCCTGGCTCA
+
CCCF#################22@GHIJJJ
…to “assembled” original sequence.

UMD assembly primer (cbcb.umd.edu)
Practical memory measurements

Velvet measurements (Adina Howe)
Shotgun sequencing & de novo
assembly:
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness

It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
Why are big data sets difficult?
Need to resolve errors: the more coverage there is, the
more errors there are.
Memory usage ~ “real” variation + number of errors
Number of errors ~ size of data set
The scaling problem
 We can cheaply gather DNA data in quantities

sufficient to swamp straightforward assembly
algorithms running on commodity hardware.
 Since ~2008:
 The field has engaged in lots of engineering

optimization…
 …but the data generation rate has consistently
outstripped Moore’s Law.
Our solution: Digital
normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Contig assembly now scales a lot better.

Most samples can be assembled in < 50 GB of
memory.
Diginorm is widely useful, becoming
widely used:
1. Assembly of the H. contortus parasitic nematode
genome, a “high polymorphism/variable coverage”
problem.
(Schwarz et al., 2013; pmid 23985341)
2. Reference-free assembly of the lamprey (P.
marinus) transcriptome, a “big assembly” problem.
(in prep)
3. Osedax symbiont metagenome, a “contaminated
metagenome” problem (Goffredi et al, 2013; pmid
Second problem: too many choices!
Read trimming
and filtering
(x100)

What
programs and
options do
you use??

Assembly
(x10)

Quantification
(x20)

Science!
(x 10,000)

Annotation
(x20)
Third problem: training
 I teach:
 Summer NGS course (two weeks, KBS); heavily

oversubscribed.
 Many ad hoc workshops
 Fall BEACON course (intro computational science)
 Others teach:
 Summer/fall workshops (Robin Buell)
 Various genomics/bioinformatics courses (Shin-han

Shiu, Rob Britton, ???)
Overall training results:
 We can fairly easily get people over the initial

“technical” hump (here are some programs,
here’s how to use them).
 We can begin to teach people the way to think

about the problem.
 People have a really tough time connecting

generic instruction to their own research,
however!
(And people need to learn how to analyze their own
Three problems:
1.

Assembly memory & compute requirements?

2.

It’s a complex process; what are good defaults?

3.

Training is limited in opportunity, difficult for
students, not always effective.
Solution? khmer-protocols
Read cleaning

 Effort to provide standard “cheap”

assembly protocols for Illumina
mRNAseq & metagenomes in the
cloud.

Diginorm

Assembly

 Entirely copy/paste; ~2-6 days from

raw reads to assembly,
annotations, and differential
expression analysis. ~$150 on
Amazon per data set.

Annotation

RSEM differential
expression

 Open, versioned, forkable, citable.
“Eel Pond” mRNAseq protocol
Adapter trim &
quality filter
Group transcripts

EBSeq
(Differential
expression
analysis)

Diginorm to C=20

Annotate x
database
Trim highcoverage reads at
low-abundance
k-mers
RSEM (Map QC
reads to count)
Assemble with
Trinity

Extracting
differentially
expressed genes
& graphing
“Kalamazoo” metagenome protocol
Adapter trim &
quality filter
Partition
graph
Map reads to
assembly
Diginorm to C=10
Too big to
assemble?
Split into "groups"

Annotate contigs
with abundances

Trim highcoverage reads at
low-abundance
k-mers
Reinflate groups
(optional

Diginorm to C=5

Small enough to assemble?

Assemble!!!

Prokka
Show: Web site

http://khmer-protocols.readthedocs.org/
Show: mRNAseq output
Differential expression graph
Show: mRNAseq spreadsheet
Show: BLAST server
Soon: Galaxy integration
What khmer-protocols is:
 Starting point.

 Defensible initial solution to get initial results.

Works on ~80% or more of samples, guesstimated.
 Great (?) way to learn
 100% reproducible; methods section on

computational analysis is more or less written for you.
 Fairly fast and inexpensive (comparatively)

(~$100/data set)
What khmer-protocols is not:
 The One True Solution.
 The Best Solution.
 Proprietary.
 Closed.
 Slow and expensive (comparatively).
Speed up/efficiency?
Walltime to complete assemblies

RAM needed to complete assemblies

occ oases occ trinity ocu oases ocu trinity

occ oases occ trinity ocu oases ocu trinity
500

400

Total memory used (GB)

Total walltime (hrs)

75

50

25

300

200

100

0

0
DN RAW

DN RAW

DN RAW

Sample

DN RAW

DN RAW

DN RAW

DN RAW

DN RAW

Sample

Elijah Lowe
Diginorm increases sensitivity (very
slightly :)

Evaluation by homology against a reference gene

37 extra from diginorm, vs 17 lost;

64 extra from diginorm, vs 15 lost;
Elijah Lowe
Please use!
 Would love feedback: what worked? What didn’t

work?
 Cannot support khmer protocols on HPC, but can

support it in the cloud; iCER may (?) support it on
HPC -- all of the software is installed.
(We are working on better default support for HPC.)
Links & more references
 ged.msu.edu/angus/ - NGS course materials
 khmer-protocols.readthedocs.org – khmer

protocols
 Cloud computing discussion next Wed, 1/22,

2pm, iCER. Don’t e-mail me at: ctb@msu.edu

More Related Content

What's hot

Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Torsten Seemann
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Torsten Seemann
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Torsten Seemann
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Torsten Seemann
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Torsten Seemann
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Keith Bradnam
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015Torsten Seemann
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014Torsten Seemann
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsNick Loman
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesScott Edmunds
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
How to write bioinformatics software people will use and cite - t.seemann - ...
How to write bioinformatics software people will use and cite -  t.seemann - ...How to write bioinformatics software people will use and cite -  t.seemann - ...
How to write bioinformatics software people will use and cite - t.seemann - ...Torsten Seemann
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment DesignYaoyu Wang
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Yaoyu Wang
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionJatinder Singh
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
Toolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSToolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSMirko Rossi
 

What's hot (20)

Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
Approaches to analysing 1000s of bacterial isolates - ICEID 2015 Atlanta, USA...
 
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
 
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
Pipeline or pipe dream - Midlands Micro Meeting UK - mon 15 sep 2014
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013Prokka - rapid bacterial genome annotation - ABPHM 2013
Prokka - rapid bacterial genome annotation - ABPHM 2013
 
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
Snippy - Rapid bacterial variant calling - UK - tue 5 may 2015
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015De novo genome assembly - IMB Winter School - 7 July 2015
De novo genome assembly - IMB Winter School - 7 July 2015
 
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014Rapid outbreak characterisation  - UK Genome Sciences 2014 - wed 3 sep 2014
Rapid outbreak characterisation - UK Genome Sciences 2014 - wed 3 sep 2014
 
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics ToolsECCMID 2015 Meet-The-Expert: Bioinformatics Tools
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
 
Ngs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challengesNgs de novo assembly progresses and challenges
Ngs de novo assembly progresses and challenges
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Genome Assembly 2018
Genome Assembly 2018Genome Assembly 2018
Genome Assembly 2018
 
How to write bioinformatics software people will use and cite - t.seemann - ...
How to write bioinformatics software people will use and cite -  t.seemann - ...How to write bioinformatics software people will use and cite -  t.seemann - ...
How to write bioinformatics software people will use and cite - t.seemann - ...
 
RNASeq Experiment Design
RNASeq Experiment DesignRNASeq Experiment Design
RNASeq Experiment Design
 
Rnaseq basics ngs_application1
Rnaseq basics ngs_application1Rnaseq basics ngs_application1
Rnaseq basics ngs_application1
 
RNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential ExpressionRNASeq - Analysis Pipeline for Differential Expression
RNASeq - Analysis Pipeline for Differential Expression
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Toolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGSToolbox for bacterial population analysis using NGS
Toolbox for bacterial population analysis using NGS
 

Viewers also liked

Hohmann liber2006text
Hohmann liber2006textHohmann liber2006text
Hohmann liber2006textTina Hohmann
 
Rainmaker Systems Overview
Rainmaker Systems OverviewRainmaker Systems Overview
Rainmaker Systems Overviewlizwheeles
 
From High Hopes to High Deficit and Back - EuroITV2009
From High Hopes to High Deficit and Back - EuroITV2009From High Hopes to High Deficit and Back - EuroITV2009
From High Hopes to High Deficit and Back - EuroITV2009Nils Walravens
 
유기화학 2nd
유기화학 2nd유기화학 2nd
유기화학 2ndshinkyung
 
98新進教師圖書館介紹
98新進教師圖書館介紹98新進教師圖書館介紹
98新進教師圖書館介紹isvincent
 
Nor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions PresentationNor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions Presentationlmeneley
 
Presentatie De Salesmanagers
Presentatie De SalesmanagersPresentatie De Salesmanagers
Presentatie De Salesmanagersrwarntjes
 
Revival College Sa
Revival  College SaRevival  College Sa
Revival College SaIvin
 
DataCore Corporate Introduction
DataCore Corporate IntroductionDataCore Corporate Introduction
DataCore Corporate IntroductionJames Price
 
AMD Virtualization -- Take Charge
AMD Virtualization -- Take ChargeAMD Virtualization -- Take Charge
AMD Virtualization -- Take ChargeJames Price
 
The Loop Limketkai_rooms
The Loop Limketkai_roomsThe Loop Limketkai_rooms
The Loop Limketkai_roomsjessecadelina
 
AMD Putting Server Virtualization to Work
AMD Putting Server Virtualization to WorkAMD Putting Server Virtualization to Work
AMD Putting Server Virtualization to WorkJames Price
 
Bodas Fot
Bodas FotBodas Fot
Bodas Fotpnchx
 
Total Package Program Final
Total Package Program FinalTotal Package Program Final
Total Package Program Finalbsrmailbox
 
Transformatie door innovatie IGC Amsterdam
Transformatie door innovatie IGC AmsterdamTransformatie door innovatie IGC Amsterdam
Transformatie door innovatie IGC AmsterdamPiet van Vugt
 

Viewers also liked (20)

Hohmann liber2006text
Hohmann liber2006textHohmann liber2006text
Hohmann liber2006text
 
Rainmaker Systems Overview
Rainmaker Systems OverviewRainmaker Systems Overview
Rainmaker Systems Overview
 
Carte(2)
Carte(2)Carte(2)
Carte(2)
 
From High Hopes to High Deficit and Back - EuroITV2009
From High Hopes to High Deficit and Back - EuroITV2009From High Hopes to High Deficit and Back - EuroITV2009
From High Hopes to High Deficit and Back - EuroITV2009
 
Hubbe Duniyan
Hubbe  DuniyanHubbe  Duniyan
Hubbe Duniyan
 
유기화학 2nd
유기화학 2nd유기화학 2nd
유기화학 2nd
 
98新進教師圖書館介紹
98新進教師圖書館介紹98新進教師圖書館介紹
98新進教師圖書館介紹
 
Nor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions PresentationNor Cal Pacific Dimensions Presentation
Nor Cal Pacific Dimensions Presentation
 
Presentatie De Salesmanagers
Presentatie De SalesmanagersPresentatie De Salesmanagers
Presentatie De Salesmanagers
 
Revival College Sa
Revival  College SaRevival  College Sa
Revival College Sa
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
DataCore Corporate Introduction
DataCore Corporate IntroductionDataCore Corporate Introduction
DataCore Corporate Introduction
 
AMD Virtualization -- Take Charge
AMD Virtualization -- Take ChargeAMD Virtualization -- Take Charge
AMD Virtualization -- Take Charge
 
The Loop Limketkai_rooms
The Loop Limketkai_roomsThe Loop Limketkai_rooms
The Loop Limketkai_rooms
 
Litigation 101: Depositions
Litigation 101: DepositionsLitigation 101: Depositions
Litigation 101: Depositions
 
AMD Putting Server Virtualization to Work
AMD Putting Server Virtualization to WorkAMD Putting Server Virtualization to Work
AMD Putting Server Virtualization to Work
 
SocialMediaMagic slides
SocialMediaMagic slidesSocialMediaMagic slides
SocialMediaMagic slides
 
Bodas Fot
Bodas FotBodas Fot
Bodas Fot
 
Total Package Program Final
Total Package Program FinalTotal Package Program Final
Total Package Program Final
 
Transformatie door innovatie IGC Amsterdam
Transformatie door innovatie IGC AmsterdamTransformatie door innovatie IGC Amsterdam
Transformatie door innovatie IGC Amsterdam
 

Similar to 2014 khmer protocols

Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assemblyc.titus.brown
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinarPistoia Alliance
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior softwareMichael R. Crusoe
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbugc.titus.brown
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architecturesinside-BigData.com
 
Issues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsIssues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsTaesu Kim
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Manuel Martín
 
The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...odsc
 
A tale of experiments on bug prediction
A tale of experiments on bug predictionA tale of experiments on bug prediction
A tale of experiments on bug predictionMartin Pinzger
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesmustafa sarac
 

Similar to 2014 khmer protocols (20)

Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar
 
2014 toronto-torbug
2014 toronto-torbug2014 toronto-torbug
2014 toronto-torbug
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior software
 
Final doc of dna
Final  doc of dnaFinal  doc of dna
Final doc of dna
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
Scaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale ArchitecturesScaling Deep Learning Algorithms on Extreme Scale Architectures
Scaling Deep Learning Algorithms on Extreme Scale Architectures
 
Issues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsIssues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applications
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
 
The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...The caret package is a unified interface to a large number of predictive mode...
The caret package is a unified interface to a large number of predictive mode...
 
A tale of experiments on bug prediction
A tale of experiments on bug predictionA tale of experiments on bug prediction
A tale of experiments on bug prediction
 
Memory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challengesMemory efficient java tutorial practices and challenges
Memory efficient java tutorial practices and challenges
 

More from c.titus.brown

More from c.titus.brown (20)

2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdfChristopherTHyatt
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

2014 khmer protocols

  • 1. Making de novo assembly cheap & easy: standardized protocols for mRNAseq and metagenome assembly and analysis C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University Jan 2014 ctb@msu.edu
  • 2. My lab’s focus  De novo assembly and efficient/effective use of NGS, especially for non-model organism.  Open source software engineering.  Training and education in NGS.
  • 3. There is quite a bit of life left to sequence & assem http://pacelab.colorado.edu/
  • 4. Three problems: 1. Assembly memory & compute requirements? 2. It’s a complex process; what are good defaults? 3. Training is limited in opportunity, difficult for students, not always effective.
  • 6. So, we want to go from raw data: Name @SRR606249.17/1 GAGTATGTTCTCATAGAGGTTGGTANNNNT + B@BDDFFFHHHHHJIJJJJGHIJHJ####1 Quality score @SRR606249.17/2 CGAANNNNNNNNNNNNNNNNNCCTGGCTCA + CCCF#################22@GHIJJJ
  • 7. …to “assembled” original sequence. UMD assembly primer (cbcb.umd.edu)
  • 8. Practical memory measurements Velvet measurements (Adina Howe)
  • 9. Shotgun sequencing & de novo assembly: It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 10. Why are big data sets difficult? Need to resolve errors: the more coverage there is, the more errors there are. Memory usage ~ “real” variation + number of errors Number of errors ~ size of data set
  • 11. The scaling problem  We can cheaply gather DNA data in quantities sufficient to swamp straightforward assembly algorithms running on commodity hardware.  Since ~2008:  The field has engaged in lots of engineering optimization…  …but the data generation rate has consistently outstripped Moore’s Law.
  • 18. Contig assembly now scales a lot better. Most samples can be assembled in < 50 GB of memory.
  • 19. Diginorm is widely useful, becoming widely used: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid
  • 20. Second problem: too many choices! Read trimming and filtering (x100) What programs and options do you use?? Assembly (x10) Quantification (x20) Science! (x 10,000) Annotation (x20)
  • 21. Third problem: training  I teach:  Summer NGS course (two weeks, KBS); heavily oversubscribed.  Many ad hoc workshops  Fall BEACON course (intro computational science)  Others teach:  Summer/fall workshops (Robin Buell)  Various genomics/bioinformatics courses (Shin-han Shiu, Rob Britton, ???)
  • 22. Overall training results:  We can fairly easily get people over the initial “technical” hump (here are some programs, here’s how to use them).  We can begin to teach people the way to think about the problem.  People have a really tough time connecting generic instruction to their own research, however! (And people need to learn how to analyze their own
  • 23. Three problems: 1. Assembly memory & compute requirements? 2. It’s a complex process; what are good defaults? 3. Training is limited in opportunity, difficult for students, not always effective.
  • 24. Solution? khmer-protocols Read cleaning  Effort to provide standard “cheap” assembly protocols for Illumina mRNAseq & metagenomes in the cloud. Diginorm Assembly  Entirely copy/paste; ~2-6 days from raw reads to assembly, annotations, and differential expression analysis. ~$150 on Amazon per data set. Annotation RSEM differential expression  Open, versioned, forkable, citable.
  • 25.
  • 26. “Eel Pond” mRNAseq protocol Adapter trim & quality filter Group transcripts EBSeq (Differential expression analysis) Diginorm to C=20 Annotate x database Trim highcoverage reads at low-abundance k-mers RSEM (Map QC reads to count) Assemble with Trinity Extracting differentially expressed genes & graphing
  • 27. “Kalamazoo” metagenome protocol Adapter trim & quality filter Partition graph Map reads to assembly Diginorm to C=10 Too big to assemble? Split into "groups" Annotate contigs with abundances Trim highcoverage reads at low-abundance k-mers Reinflate groups (optional Diginorm to C=5 Small enough to assemble? Assemble!!! Prokka
  • 33. What khmer-protocols is:  Starting point.  Defensible initial solution to get initial results. Works on ~80% or more of samples, guesstimated.  Great (?) way to learn  100% reproducible; methods section on computational analysis is more or less written for you.  Fairly fast and inexpensive (comparatively) (~$100/data set)
  • 34. What khmer-protocols is not:  The One True Solution.  The Best Solution.  Proprietary.  Closed.  Slow and expensive (comparatively).
  • 35. Speed up/efficiency? Walltime to complete assemblies RAM needed to complete assemblies occ oases occ trinity ocu oases ocu trinity occ oases occ trinity ocu oases ocu trinity 500 400 Total memory used (GB) Total walltime (hrs) 75 50 25 300 200 100 0 0 DN RAW DN RAW DN RAW Sample DN RAW DN RAW DN RAW DN RAW DN RAW Sample Elijah Lowe
  • 36. Diginorm increases sensitivity (very slightly :) Evaluation by homology against a reference gene 37 extra from diginorm, vs 17 lost; 64 extra from diginorm, vs 15 lost; Elijah Lowe
  • 37. Please use!  Would love feedback: what worked? What didn’t work?  Cannot support khmer protocols on HPC, but can support it in the cloud; iCER may (?) support it on HPC -- all of the software is installed. (We are working on better default support for HPC.)
  • 38. Links & more references  ged.msu.edu/angus/ - NGS course materials  khmer-protocols.readthedocs.org – khmer protocols  Cloud computing discussion next Wed, 1/22, 2pm, iCER. Don’t e-mail me at: ctb@msu.edu

Editor's Notes

  1. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression =&gt; OLC assembly.