SlideShare a Scribd company logo
1 of 39
C. Titus Brown
Assistant Professor
CSE, MMG, BEACON
Michigan State University
May 1, 2013
ctb@msu.edu
Streaming approaches to reference-free variant
calling
Open, online science
Much of the software and approaches I’m talking
about today are available:
khmer software:
github.com/ged-lab/khmer/
Blog: http://ivory.idyll.org/blog/
Twitter: @ctitusbrown
Outline & Overview
 Motivation: lots of data; analyzed with “offline”
approaches.
 Reference-based vs reference-free approaches.
 Single-pass algorithms for lossy compression;
application to resequencing data.
Shotgun sequencing
It was the best of times, it was the wor
, it was the worst of times, it was the
isdom, it was the age of foolishness
mes, it was the age of wisdom, it was th
It was the best of times, it was the worst of times, it was
the age of wisdom, it was the age of foolishness
…but for lots and lots of fragments!
Sequencers produce errors
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
Three basic problems
Resequencing, counting, and assembly.
Three basic problems
Resequencing & counting, and assembly.
Resequencing analysis
We know a reference genome, and want to find
variants (blue) in a background of errors (red)
Counting
We have a reference genome (or gene set) and
want to know how much we have. Think gene
expression/microarrays, copy number variation..
Noisy observations <->
information
It was the Gest of times, it was the wor
, it was the worst of timZs, it was the
isdom, it was the age of foolisXness
, it was the worVt of times, it was the
mes, it was Ahe age of wisdom, it was th
It was the best of times, it Gas the wor
mes, it was the age of witdom, it was th
isdom, it was tIe age of foolishness
It was the best of times, it was the worst of times, it was the
age of wisdom, it was the age of foolishness
“Three types of data scientists.”
(Bob Grossman, U. Chicago, at XLDB 2012)
1. Your data gathering rate is slower than Moore’s
Law.
2. Your data gathering rate matches Moore’s Law.
3. Your data gathering rate exceeds Moore’s Law.
http://www.genome.gov/sequencingcosts/
“Three types of data scientists.”
1. Your data gathering rate is slower than Moore’s
Law.
=> Be lazy, all will work out.
2. Your data gathering rate matches Moore’s Law.
=> You need to write good software, but all will
work out.
3. Your data gathering rate exceeds Moore’s Law.
=> You need serious help.
Random sampling => deep sampling
needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
Applications in cancer genomics
 Single-cell cancer genomics will advance:
 e.g. ~60-300 Gbp data for each of ~1000 tumor
cells.
 Infer phylogeny of tumor => mechanistic insight.
 Current approaches are computationally intensive
and data-heavy.
Current variant calling approach.
Map reads to
reference
"Pileup" and do variant
calling
Downstream
diagnostics
Drawbacks of reference-based
approaches
 Fairly narrowly defined heuristics.
 Allelic mapping bias: mapping biased towards
reference allele.
 Ignorant of “unexpected” novelty
 Indels, especially large indels, are often ignored.
 Structural variation is not easily retained or
recovered.
 True novelty discarded.
 Most implementations are multipass on big data.
Challenges
 Considerable amounts of noise in data (0.1-1%
error)
 Reference-based approaches have several
drawbacks.
 Dependent on quality/applicability of reference.
 Detection of true novelty (SNP vs indels; SVs)
problematic.
 => The first major data reduction step (variant
calling) is extremely lossy in terms of potential
information.
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
A software & algorithms approach: can we develop
lossy compression approaches that
1. Reduce data size & remove errors => efficient
processing?
2. Retain all “information”? (think JPEG)
If so, then we can store only the compressed data for
later reanalysis.
Short answer is: yes, we can.
Raw data
(~10-100 GB) Analysis
"Information"
~1 GB
"Information"
"Information"
"Information"
"Information"
Database &
integration
Compression
(~2 GB)
Save in cold storage
Save for reanalysis,
investigation.
My lab at MSU:
Theoretical => applied solutions.
Theoretical advances
in data structures and
algorithms
Practically useful & usable
implementations, at scale.
Demonstrated
effectiveness on real data.
1. Time- and space-efficient k-mer
counting
To add element: increment associated counter at all hash locales
To get count: retrieve minimum counter across all hash locales
http://highlyscalable.wordpress.com/2012/0
5/01/probabilistic-structures-web-analytics-
data-mining/
1%
5%
15%10%
Pell et al., PNAS, 2012
2. Compressible assembly graphs
(NOVEL)
 Transcriptomes, microbial genomes incl MDA,
and most metagenomes can be assembled in
under 50 GB of RAM, with identical or improved
results.
 Core algorithm is single pass, “low” memory.
3. Online, streaming, lossy
compression.
(NOVEL)
Brown et al., arXiv, 2012
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization approach
A digital analog to cDNA library normalization, diginorm:
 Reference free.
 Is single pass: looks at each read only once;
 Does not “collect” the majority of errors;
 Keeps all low-coverage reads & retains all
information.
 Smooths out coverage of regions.
Can we apply this algorithmically efficient
technique to variants? Yes.
Single pass, reference free, tunable, streaming online varian
Align reads to assembly graph
Dr. Jason Pell
Reference-free variant calling.
Align read to graph
Novelty? Retain.
Downstream
diagnostics
Saturated? Count &
discard.
Output variant at
saturation (online).
Coverage is adjusted to retain signal
Reference-free variant calling
 Streaming & online algorithm; single pass.
 For real-time diagnostics, can be applied as bases are
emitted from sequencer.
 Reference free: independent of reference bias.
 Coverage of variants is adaptively adjusted to retain
all signal.
 Parameters are easily tuned, although theory needs
to be developed.
 High sensitivity (e.g. C=50 in 100x coverage) => poor
compression
 Low sensitivity (C=20) => good compression.
 Can “subtract” reference => novel structural variants.
 (See: Cortex, Zam Iqbal.)
Concluding thoughts
 This approach could provide significant and
substantial practical and theoretical leverage to
challenging problem.
 They provide a path to the future:
 Many-core implementation; distributable?
 Decreased memory footprint => cloud/rental computing
can be used for many analyses.
 Still early days, but funded…
 Our other techniques are in use, ~dozens of labs
using digital normalization.
References & reading list
 Iqbal et al., De novo assembly and genotyping of
variants using colored de Bruijn graphs. Nat. Gen
2012.
(PubMed 22231483)
 Nordstrom et al., Mutation identification by direct
comparison of whole-genome sequencing data
from mutant and wild-type individuals using k-
mers. Nat. Biotech 2013.
(PubMed 23475072)
 Brown et al., Reference-Free Algorithm for
Computational Normalization of Shotgun
Sequencing Data. arXiv 1203.4802
Note: this talk is online at slideshare.net, c.titus.brown.
Acknowledgements
Lab members involved Collaborators
 Adina Howe (w/Tiedje)
 Jason Pell
 Arend Hintze
 Rosangela Canino-Koning
 Qingpeng Zhang
 Elijah Lowe
 Likit Preeyanon
 Jiarong Guo
 Tim Brom
 Kanchan Pavangadkar
 Eric McDonald
 Chris Welcher
 Jim Tiedje, MSU
 Billie Swalla, UW
 Janet Jansson, LBNL
 Susannah Tringe, JGI
Funding
USDA NIFA; NSF IOS;
BEACON.
Thank you for the invitation!

More Related Content

What's hot

2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talkc.titus.brown
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsmikaelhuss
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-researchc.titus.brown
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotesc.titus.brown
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011c.titus.brown
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedSpark Summit
 
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea maysParallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea maysjrossibarra
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-datac.titus.brown
 
Revised Bio 1wfx Recombinant D N A
Revised  Bio 1wfx   Recombinant  D N ARevised  Bio 1wfx   Recombinant  D N A
Revised Bio 1wfx Recombinant D N AHans Lim
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Denis C. Bauer
 
Genome size and adaptation in plants
Genome size and adaptation in plantsGenome size and adaptation in plants
Genome size and adaptation in plantsjrossibarra
 
Adaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizeAdaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizejrossibarra
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Keith Bradnam
 
Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
Using Supercomputers and Gene Sequencers to Discover Your Inner MicrobiomeUsing Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
Using Supercomputers and Gene Sequencers to Discover Your Inner MicrobiomeLarry Smarr
 
Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Instituteinside-BigData.com
 
Supercomputing Your Inner Microbiome
Supercomputing Your Inner MicrobiomeSupercomputing Your Inner Microbiome
Supercomputing Your Inner MicrobiomeNicole McLaughlin
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Keith Bradnam
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinarc.titus.brown
 

What's hot (20)

2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
Data analysis & integration challenges in genomics
Data analysis & integration challenges in genomicsData analysis & integration challenges in genomics
Data analysis & integration challenges in genomics
 
2014 whitney-research
2014 whitney-research2014 whitney-research
2014 whitney-research
 
2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes2013 ucdavis-smbe-eukaryotes
2013 ucdavis-smbe-eukaryotes
 
2014 davis-talk
2014 davis-talk2014 davis-talk
2014 davis-talk
 
U Florida / Gainesville talk, apr 13 2011
U Florida / Gainesville  talk, apr 13 2011U Florida / Gainesville  talk, apr 13 2011
U Florida / Gainesville talk, apr 13 2011
 
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton SeedHail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
Hail: SCALING GENETIC DATA ANALYSIS WITH APACHE SPARK: Keynote by Cotton Seed
 
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea maysParallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
Parallel Altitudinal Clines Reveal Adaptive Evolution Of Genome Size In Zea mays
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
2014 ucl
2014 ucl2014 ucl
2014 ucl
 
Revised Bio 1wfx Recombinant D N A
Revised  Bio 1wfx   Recombinant  D N ARevised  Bio 1wfx   Recombinant  D N A
Revised Bio 1wfx Recombinant D N A
 
Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1 Variant (SNPs/Indels) calling in DNA sequences, Part 1
Variant (SNPs/Indels) calling in DNA sequences, Part 1
 
Genome size and adaptation in plants
Genome size and adaptation in plantsGenome size and adaptation in plants
Genome size and adaptation in plants
 
Adaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maizeAdaptive evolution of genome size across altitudinal clines in maize
Adaptive evolution of genome size across altitudinal clines in maize
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...Genome assembly: the art of trying to make one big thing from millions of ver...
Genome assembly: the art of trying to make one big thing from millions of ver...
 
Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
Using Supercomputers and Gene Sequencers to Discover Your Inner MicrobiomeUsing Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
Using Supercomputers and Gene Sequencers to Discover Your Inner Microbiome
 
Managing Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger InstituteManaging Genomics Data at the Sanger Institute
Managing Genomics Data at the Sanger Institute
 
Supercomputing Your Inner Microbiome
Supercomputing Your Inner MicrobiomeSupercomputing Your Inner Microbiome
Supercomputing Your Inner Microbiome
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 

Viewers also liked

2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcastc.titus.brown
 
Cost effective azure
Cost effective azureCost effective azure
Cost effective azureGal Kogman
 
SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.Gina Montgomery, V-TSP
 
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber ShandwickWeber Shandwick Korea
 
Moments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer BehaviorMoments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer BehaviorKyle Lacy
 
The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...ProductCamp Boston
 
Engage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2BEngage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2BAnco Stuij
 
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...Gina Montgomery, V-TSP
 
ProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening SlidesProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening SlidesProductCamp Boston
 
Engage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - TechnicalEngage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - TechnicalWebtrends
 

Viewers also liked (13)

2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
actividad 1.4
actividad 1.4actividad 1.4
actividad 1.4
 
Cost effective azure
Cost effective azureCost effective azure
Cost effective azure
 
John saraguro diapositiva
John saraguro diapositivaJohn saraguro diapositiva
John saraguro diapositiva
 
SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.SharePoint 2013 and the Consumerization of I.T.
SharePoint 2013 and the Consumerization of I.T.
 
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
[Infographic Korea Edition] The CEO Reputation Premium - Weber Shandwick
 
Moments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer BehaviorMoments Matter - Technology Transforming Consumer Behavior
Moments Matter - Technology Transforming Consumer Behavior
 
The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...The Art and Science of Pricing: Simple tools to align price with value (Rober...
The Art and Science of Pricing: Simple tools to align price with value (Rober...
 
Engage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2BEngage in effective collaboration with Azure AD B2B
Engage in effective collaboration with Azure AD B2B
 
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
Unleash the Power of Video Communication - Office 365 Video vs. Azure Media S...
 
ProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening SlidesProductCamp Boston 2016 Opening Slides
ProductCamp Boston 2016 Opening Slides
 
Internal, External and Digital Presence of the CEO is becoming more and more ...
Internal, External and Digital Presence of the CEO is becoming more and more ...Internal, External and Digital Presence of the CEO is becoming more and more ...
Internal, External and Digital Presence of the CEO is becoming more and more ...
 
Engage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - TechnicalEngage 2013 - Webtrends Streams - Technical
Engage 2013 - Webtrends Streams - Technical
 

Similar to 2013 caltech-edrn-talk

Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior softwareMichael R. Crusoe
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudJan Aerts
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012c.titus.brown
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformaticsc.titus.brown
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesGuy Coates
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodDuncan Hull
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynotec.titus.brown
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing techc.titus.brown
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizonac.titus.brown
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible researchYannick Wurm
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4c.titus.brown
 

Similar to 2013 caltech-edrn-talk (20)

2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
Better science through superior software
Better science through superior softwareBetter science through superior software
Better science through superior software
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
Next generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciencesNext generation genomics: Petascale data in the life sciences
Next generation genomics: Petascale data in the life sciences
 
eScience: A Transformed Scientific Method
eScience: A Transformed Scientific MethodeScience: A Transformed Scientific Method
eScience: A Transformed Scientific Method
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
BEACON 101: Sequencing tech
BEACON 101: Sequencing techBEACON 101: Sequencing tech
BEACON 101: Sequencing tech
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research2014 11-13-sbsm032-reproducible research
2014 11-13-sbsm032-reproducible research
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 

More from c.titus.brown

More from c.titus.brown (18)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 

2013 caltech-edrn-talk

  • 1. C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University May 1, 2013 ctb@msu.edu Streaming approaches to reference-free variant calling
  • 2. Open, online science Much of the software and approaches I’m talking about today are available: khmer software: github.com/ged-lab/khmer/ Blog: http://ivory.idyll.org/blog/ Twitter: @ctitusbrown
  • 3. Outline & Overview  Motivation: lots of data; analyzed with “offline” approaches.  Reference-based vs reference-free approaches.  Single-pass algorithms for lossy compression; application to resequencing data.
  • 4. Shotgun sequencing It was the best of times, it was the wor , it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!
  • 5. Sequencers produce errors It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 6. Three basic problems Resequencing, counting, and assembly.
  • 7. Three basic problems Resequencing & counting, and assembly.
  • 8. Resequencing analysis We know a reference genome, and want to find variants (blue) in a background of errors (red)
  • 9. Counting We have a reference genome (or gene set) and want to know how much we have. Think gene expression/microarrays, copy number variation..
  • 10. Noisy observations <-> information It was the Gest of times, it was the wor , it was the worst of timZs, it was the isdom, it was the age of foolisXness , it was the worVt of times, it was the mes, it was Ahe age of wisdom, it was th It was the best of times, it Gas the wor mes, it was the age of witdom, it was th isdom, it was tIe age of foolishness It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness
  • 11. “Three types of data scientists.” (Bob Grossman, U. Chicago, at XLDB 2012) 1. Your data gathering rate is slower than Moore’s Law. 2. Your data gathering rate matches Moore’s Law. 3. Your data gathering rate exceeds Moore’s Law.
  • 13. “Three types of data scientists.” 1. Your data gathering rate is slower than Moore’s Law. => Be lazy, all will work out. 2. Your data gathering rate matches Moore’s Law. => You need to write good software, but all will work out. 3. Your data gathering rate exceeds Moore’s Law. => You need serious help.
  • 14. Random sampling => deep sampling needed Typically 10-100x needed for robust recovery (300 Gbp for human)
  • 15. Applications in cancer genomics  Single-cell cancer genomics will advance:  e.g. ~60-300 Gbp data for each of ~1000 tumor cells.  Infer phylogeny of tumor => mechanistic insight.  Current approaches are computationally intensive and data-heavy.
  • 16. Current variant calling approach. Map reads to reference "Pileup" and do variant calling Downstream diagnostics
  • 17. Drawbacks of reference-based approaches  Fairly narrowly defined heuristics.  Allelic mapping bias: mapping biased towards reference allele.  Ignorant of “unexpected” novelty  Indels, especially large indels, are often ignored.  Structural variation is not easily retained or recovered.  True novelty discarded.  Most implementations are multipass on big data.
  • 18. Challenges  Considerable amounts of noise in data (0.1-1% error)  Reference-based approaches have several drawbacks.  Dependent on quality/applicability of reference.  Detection of true novelty (SNP vs indels; SVs) problematic.  => The first major data reduction step (variant calling) is extremely lossy in terms of potential information.
  • 19. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) A software & algorithms approach: can we develop lossy compression approaches that 1. Reduce data size & remove errors => efficient processing? 2. Retain all “information”? (think JPEG) If so, then we can store only the compressed data for later reanalysis. Short answer is: yes, we can.
  • 20. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Save in cold storage Save for reanalysis, investigation.
  • 21. My lab at MSU: Theoretical => applied solutions. Theoretical advances in data structures and algorithms Practically useful & usable implementations, at scale. Demonstrated effectiveness on real data.
  • 22. 1. Time- and space-efficient k-mer counting To add element: increment associated counter at all hash locales To get count: retrieve minimum counter across all hash locales http://highlyscalable.wordpress.com/2012/0 5/01/probabilistic-structures-web-analytics- data-mining/
  • 23. 1% 5% 15%10% Pell et al., PNAS, 2012 2. Compressible assembly graphs (NOVEL)
  • 24.  Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results.  Core algorithm is single pass, “low” memory. 3. Online, streaming, lossy compression. (NOVEL) Brown et al., arXiv, 2012
  • 31. Digital normalization approach A digital analog to cDNA library normalization, diginorm:  Reference free.  Is single pass: looks at each read only once;  Does not “collect” the majority of errors;  Keeps all low-coverage reads & retains all information.  Smooths out coverage of regions.
  • 32. Can we apply this algorithmically efficient technique to variants? Yes. Single pass, reference free, tunable, streaming online varian
  • 33. Align reads to assembly graph Dr. Jason Pell
  • 34. Reference-free variant calling. Align read to graph Novelty? Retain. Downstream diagnostics Saturated? Count & discard. Output variant at saturation (online).
  • 35. Coverage is adjusted to retain signal
  • 36. Reference-free variant calling  Streaming & online algorithm; single pass.  For real-time diagnostics, can be applied as bases are emitted from sequencer.  Reference free: independent of reference bias.  Coverage of variants is adaptively adjusted to retain all signal.  Parameters are easily tuned, although theory needs to be developed.  High sensitivity (e.g. C=50 in 100x coverage) => poor compression  Low sensitivity (C=20) => good compression.  Can “subtract” reference => novel structural variants.  (See: Cortex, Zam Iqbal.)
  • 37. Concluding thoughts  This approach could provide significant and substantial practical and theoretical leverage to challenging problem.  They provide a path to the future:  Many-core implementation; distributable?  Decreased memory footprint => cloud/rental computing can be used for many analyses.  Still early days, but funded…  Our other techniques are in use, ~dozens of labs using digital normalization.
  • 38. References & reading list  Iqbal et al., De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat. Gen 2012. (PubMed 22231483)  Nordstrom et al., Mutation identification by direct comparison of whole-genome sequencing data from mutant and wild-type individuals using k- mers. Nat. Biotech 2013. (PubMed 23475072)  Brown et al., Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data. arXiv 1203.4802 Note: this talk is online at slideshare.net, c.titus.brown.
  • 39. Acknowledgements Lab members involved Collaborators  Adina Howe (w/Tiedje)  Jason Pell  Arend Hintze  Rosangela Canino-Koning  Qingpeng Zhang  Elijah Lowe  Likit Preeyanon  Jiarong Guo  Tim Brom  Kanchan Pavangadkar  Eric McDonald  Chris Welcher  Jim Tiedje, MSU  Billie Swalla, UW  Janet Jansson, LBNL  Susannah Tringe, JGI Funding USDA NIFA; NSF IOS; BEACON. Thank you for the invitation!

Editor's Notes

  1. Bad habit…
  2. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression =&gt; OLC assembly.