SlideShare a Scribd company logo
1 of 50
STREAMING VARIANT 
CALLING? 
C. Titus Brown 
Michigan State University 
Sep 2014, NCI EDRN / Bethesda, MD
Mapping: locate reads in reference 
http://en.wikipedia.org/wiki/File:Mapping_Reads.png
Variant detection after mapping 
http://www.kenkraaijeveld.nl/genomics/bioinformatics/
Problem 1: 
Analysis is done after sequencing. 
Sequencing Analysis
Problem 2: 
Much of your data is unnecessary. 
Shotgun data is randomly sampled; 
So, you need high coverage for high sensitivity.
Problem 3: 
Current variant calling approaches are multipass 
Data 
Mapping 
Sorting 
Calling Answer
Problem 4: 
Allelic mapping bias favors reference genome. 
Number of nbh differentiating polymorphisms. 
Stevenson et al., 2013 (BMC Genomics)
Problem 5: 
Current approaches are often insensitive to indels 
Iqbal et al., Nat Gen 2012
Why are we concerned at all!? 
Looking forward 5 years… 
Navin et al., 2011
Some basic math: 
• 1000 single cells from a tumor… 
• …sequenced to 40x haploid coverage with Illumina… 
• …yields 120 Gbp each cell… 
• …or 120 Tbp of data. 
• HiSeq X10 can do the sequencing in ~3 weeks. 
• The variant calling will require 2,000 CPU weeks… 
• …so, given ~2,000 computers, can do this all in one 
month.
Similar math applies: 
• Pathogen detection in blood; 
• Environmental sequencing; 
• Sequencing rare DNA from circulating blood. 
• Two issues: 
•Volume of data & compute 
infrastructure; 
• Latency for clinical applications.
Can we improve this situation? 
• Tie directly into machine as it generates sequence 
(Illumina, PacBio, and Nanopore can all do streaming, in theory) 
• Analyze data as it comes off; for some (many?) 
applications, can stop run early if signal detected. 
• Avoid using a reference genome for primary variant 
calling. 
• Easier indel detection, less allelic mapping bias 
• Can use reference for interpretation. 
Does such a magical approach exist!?
~Digression: Digital normalization 
(a computational version of library normalization) 
Suppose you have 
a dilution factor of 
A (10) to B(1). To 
get 10x of B you 
need to get 100x 
of A! Overkill!! 
The high-coverage 
reads in sample A 
are unnecessary 
for assembly, and, 
in fact, distract.
Digital normalization
Digital normalization
Digital normalization
Digital normalization
Digital normalization is streaming
Digital normalization
Some key points -- 
• Digital normalization is streaming. 
• Digital normalizing is computationally efficient (lower 
memory than other approaches; parallelizable/multicore; 
single-pass) 
• Currently, primarily used for prefiltering for assembly, but 
relies on underlying abstraction (De Bruijn graph) that is 
also used in variant calling.
Assembly now scales with richness, not diversity. 
• 10-100 fold decrease in memory requirements 
• 10-100 fold speed up in analysis
Diginorm is widely useful: 
1. Assembly of the H. contortus parasitic nematode 
genome, a “high polymorphism/variable coverage” 
problem. 
(Schwarz et al., 2013; pmid 23985341) 
2. Reference-free assembly of the lamprey (P. marinus) 
transcriptome, a “big assembly” problem. (in prep) 
3. Osedax symbiont metagenome, a “contaminated 
metagenome” problem (Goffredi et al, 2013; pmid 
24225886)
Anecdata: diginorm is used in Illumina 
long-read sequencing (?)
Diginorm is “lossy compression” 
• Nearly perfect from an information theoretic perspective: 
• Discards 95% more of data for genomes. 
• Loses < 00.02% of information.
Digital normalization => graph alignment 
What we are actually doing this stage 
is building a graph of all the reads, 
and aligning new reads to that graph.
Error correction via graph alignment 
Jason Pell and Jordan Fish
Error correction on simulated E. coli data 
TP FP TN FN 
ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 
1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 
1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% 
(corrected) (mistakes) (OK) (missed) 
1% error rate, 100x coverage. 
Jordan Fish and Jason Pell
Error correction  variant calling 
Single pass, reference free, tunable, streaming 
online variant calling.
Coverage is adjusted to retain signal
Graph alignment can detect read saturation
Streaming with reads… 
Sequence... 
Graph 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
Sequence... 
.... 
Variants
Analysis is done after sequencing. 
Sequencing Analysis
Streaming with bases 
k bases... 
Graph 
k+1 
k bases... k+1 
k+2 
k bases... k+1 
k bases... k+1 
k bases... k+1 
... 
k bases... k+1 
Variants
Integrate sequencing and analysis 
Sequencing 
Analysis 
Are we done yet?
Streaming approach also supports more 
compute-intensive interludes – 
remapping, etc. 
Rimmer et al., 2014
Streaming algorithms can be very efficient 
Data 
1-pass 
Answer 
See also eXpress, Roberts et al., 2013.
So: reference-free variant calling 
• Streaming & online algorithm; single pass. 
• For real-time diagnostics, can be applied as bases are emitted from 
sequencer. 
• Reference free: independent of reference bias. 
• Coverage of variants is adaptively adjusted to retain all 
signal. 
• Parameters are easily tuned, although theory needs to be 
developed. 
• High sensitivity (e.g. C=50 in 100x coverage) => poor compression 
• Low sensitivity (C=20) => good compression. 
• Can “subtract” reference => novel structural variants. 
• (See: Cortex, Zam Iqbal.)
Two other features -- 
• More single-computer scalable approach than current: low 
disk access, high parallelizability. 
• Openness – our software is free to use, reuse, remix; no 
intellectual property restrictions. (Hence “We hear Illumina 
is using it…”)
Prospectus for streaming variant detection 
• Underlying concept is sound and offers many advantages over 
current approaches; 
• We have proofs of concept implemented; 
• We know that underlying approach works well in amplification 
situations, as well; 
• Tuning and math/theory needed! 
• …grad students keep on getting poached by Amazon and 
Google. (This is becoming a serious problem.)
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Lossy compression can substantially 
reduce data size while retaining 
information needed for later (re)analysis.
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
http://en.wikipedia.org/wiki/JPEG 
Lossy compression
Raw data 
(~10-100 GB) Analysis 
"Information" 
~1 GB 
"Information" 
"Information" 
"Information" 
"Information" 
Database & 
integration 
Compression 
(~2 GB) 
Save in cold storage 
Save for reanalysis, 
investigation.
Data integration? 
Once you have all the data, what do you do? 
"Business as usual simply cannot work." 
Looking at millions to billions of genomes. 
(David Haussler, 2014)
Data recipes 
Standardized (versioned, open, remixable, cloud) 
pipelines and protocols for sequence data analysis. 
See: khmer-recipes, khmer-protocols. 
Increases buy-in :)
Training! 
Lots of training planned at Davis – 
open workshops. 
ivory.idyll.org/blog/2014-davis-and-training.html 
Increases buy-in x 2!
Acknowledgements 
Lab members involved Collaborators 
• Adina Howe (w/Tiedje) 
• Jason Pell 
• Qingpeng Zhang 
• Tim Brom 
• Jordan Fish 
• Michael Crusoe 
• Jim Tiedje, MSU 
• Billie Swalla, UW 
• Janet Jansson, LBNL 
• Susannah Tringe, JGI 
• Eran Andrechek, MSU 
Funding 
USDA NIFA; NSF IOS; 
NIH NHGRI; NSF 
BEACON.

More Related Content

What's hot

Nearest neighbor, defect prediction
Nearest neighbor, defect predictionNearest neighbor, defect prediction
Nearest neighbor, defect prediction
CS, NcState
 

What's hot (8)

Edge-based Discovery of Training Data for Machine Learning
Edge-based Discovery of Training Data for Machine LearningEdge-based Discovery of Training Data for Machine Learning
Edge-based Discovery of Training Data for Machine Learning
 
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
The Face of Nanomaterials: Insightful Classification Using Deep Learning - An...
 
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...Transfer learning for low frequency extrapolation from shot gathers for FWI a...
Transfer learning for low frequency extrapolation from shot gathers for FWI a...
 
Feasibility of moment tensor inversion for a single-well microseismic data us...
Feasibility of moment tensor inversion for a single-well microseismic data us...Feasibility of moment tensor inversion for a single-well microseismic data us...
Feasibility of moment tensor inversion for a single-well microseismic data us...
 
Deep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and HypeDeep learning: Cutting through the Myths and Hype
Deep learning: Cutting through the Myths and Hype
 
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
The Incorporation of Machine Learning into Scientific Simulations at Lawrence...
 
Nearest neighbor, defect prediction
Nearest neighbor, defect predictionNearest neighbor, defect prediction
Nearest neighbor, defect prediction
 
Going Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCFGoing Smart and Deep on Materials at ALCF
Going Smart and Deep on Materials at ALCF
 

Viewers also liked

Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009
guest5bbe75
 
Enlightenment
EnlightenmentEnlightenment
Enlightenment
Gregorio
 
Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009
Eyeblaster Spain
 
Contiguity Principle
Contiguity PrincipleContiguity Principle
Contiguity Principle
jnpletcher
 
Prithvi Cg Work Presentation
Prithvi Cg Work PresentationPrithvi Cg Work Presentation
Prithvi Cg Work Presentation
prithvionline
 
用寧靜心擁抱世界
用寧靜心擁抱世界用寧靜心擁抱世界
用寧靜心擁抱世界
tina59520
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Kegler Brown Hill + Ritter
 

Viewers also liked (20)

Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012Talk at Bioinformatics Open Source Conference, 2012
Talk at Bioinformatics Open Source Conference, 2012
 
The Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial ManagementThe Nuts + Bolts of Construction Financial Management
The Nuts + Bolts of Construction Financial Management
 
Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009Trabajos Verano 2º Eso 2009
Trabajos Verano 2º Eso 2009
 
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
Circles of San Antonio Community Coalition Bexar County Needs Assessment Sept...
 
11i Logs
11i Logs11i Logs
11i Logs
 
Enlightenment
EnlightenmentEnlightenment
Enlightenment
 
TPSI by Competitive Analytics
TPSI by Competitive AnalyticsTPSI by Competitive Analytics
TPSI by Competitive Analytics
 
Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009Eyeblaster Global BenchMark Report 2009
Eyeblaster Global BenchMark Report 2009
 
Analizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en BisonAnalizador sintáctico de Pascal escrito en Bison
Analizador sintáctico de Pascal escrito en Bison
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
Contiguity Principle
Contiguity PrincipleContiguity Principle
Contiguity Principle
 
RealTimeSchool
RealTimeSchoolRealTimeSchool
RealTimeSchool
 
Tips And Tricks For Photos
Tips And Tricks For PhotosTips And Tricks For Photos
Tips And Tricks For Photos
 
Whitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van SourcingWhitepaper De Menskant Van Sourcing
Whitepaper De Menskant Van Sourcing
 
Prithvi Cg Work Presentation
Prithvi Cg Work PresentationPrithvi Cg Work Presentation
Prithvi Cg Work Presentation
 
CG borodino
CG borodinoCG borodino
CG borodino
 
Osss (Page Revisi)
Osss (Page Revisi)Osss (Page Revisi)
Osss (Page Revisi)
 
13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility13th Annual Seminar on Professional Responsibility
13th Annual Seminar on Professional Responsibility
 
用寧靜心擁抱世界
用寧靜心擁抱世界用寧靜心擁抱世界
用寧靜心擁抱世界
 
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal ConsiderationsBusiness in Brazil: An Insider's View, Regulatory and Legal Considerations
Business in Brazil: An Insider's View, Regulatory and Legal Considerations
 

Similar to 2014 nci-edrn

2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
c.titus.brown
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
c.titus.brown
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
c.titus.brown
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
c.titus.brown
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
c.titus.brown
 

Similar to 2014 nci-edrn (20)

2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
CT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloudCT Brown - Doing next-gen sequencing analysis in the cloud
CT Brown - Doing next-gen sequencing analysis in the cloud
 
2013 caltech-edrn-talk
2013 caltech-edrn-talk2013 caltech-edrn-talk
2013 caltech-edrn-talk
 
2016 bergen-sars
2016 bergen-sars2016 bergen-sars
2016 bergen-sars
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
2013 hmp-assembly-webinar
2013 hmp-assembly-webinar2013 hmp-assembly-webinar
2013 hmp-assembly-webinar
 
2012 hpcuserforum talk
2012 hpcuserforum talk2012 hpcuserforum talk
2012 hpcuserforum talk
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Scaling metagenome assembly
Scaling metagenome assemblyScaling metagenome assembly
Scaling metagenome assembly
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Alternative Computing
Alternative ComputingAlternative Computing
Alternative Computing
 
2012 oslo-talk
2012 oslo-talk2012 oslo-talk
2012 oslo-talk
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
2013 py con awesome big data algorithms
2013 py con awesome big data algorithms2013 py con awesome big data algorithms
2013 py con awesome big data algorithms
 
2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona2012 talk to CSE department at U. Arizona
2012 talk to CSE department at U. Arizona
 
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of BioinformaticsManaging & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
Managing & Processing Big Data for Cancer Genomics, an insight of Bioinformatics
 
Deeplearning in finance
Deeplearning in financeDeeplearning in finance
Deeplearning in finance
 
2013 siam-cse-big-data
2013 siam-cse-big-data2013 siam-cse-big-data
2013 siam-cse-big-data
 
Improving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN ApplicationsImproving Hardware Efficiency for DNN Applications
Improving Hardware Efficiency for DNN Applications
 

More from c.titus.brown

2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
c.titus.brown
 

More from c.titus.brown (20)

2016 davis-plantbio
2016 davis-plantbio2016 davis-plantbio
2016 davis-plantbio
 
2016 davis-biotech
2016 davis-biotech2016 davis-biotech
2016 davis-biotech
 
2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
2015 balti-and-bioinformatics
2015 balti-and-bioinformatics2015 balti-and-bioinformatics
2015 balti-and-bioinformatics
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 

Recently uploaded

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
 

Recently uploaded (20)

Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
Locating and isolating a gene, FISH, GISH, Chromosome walking and jumping, te...
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
dkNET Webinar "Texera: A Scalable Cloud Computing Platform for Sharing Data a...
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRLKochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
Kochi ❤CALL GIRL 84099*07087 ❤CALL GIRLS IN Kochi ESCORT SERVICE❤CALL GIRL
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATIONSTS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
STS-UNIT 4 CLIMATE CHANGE POWERPOINT PRESENTATION
 

2014 nci-edrn

  • 1. STREAMING VARIANT CALLING? C. Titus Brown Michigan State University Sep 2014, NCI EDRN / Bethesda, MD
  • 2. Mapping: locate reads in reference http://en.wikipedia.org/wiki/File:Mapping_Reads.png
  • 3. Variant detection after mapping http://www.kenkraaijeveld.nl/genomics/bioinformatics/
  • 4. Problem 1: Analysis is done after sequencing. Sequencing Analysis
  • 5. Problem 2: Much of your data is unnecessary. Shotgun data is randomly sampled; So, you need high coverage for high sensitivity.
  • 6. Problem 3: Current variant calling approaches are multipass Data Mapping Sorting Calling Answer
  • 7. Problem 4: Allelic mapping bias favors reference genome. Number of nbh differentiating polymorphisms. Stevenson et al., 2013 (BMC Genomics)
  • 8. Problem 5: Current approaches are often insensitive to indels Iqbal et al., Nat Gen 2012
  • 9. Why are we concerned at all!? Looking forward 5 years… Navin et al., 2011
  • 10. Some basic math: • 1000 single cells from a tumor… • …sequenced to 40x haploid coverage with Illumina… • …yields 120 Gbp each cell… • …or 120 Tbp of data. • HiSeq X10 can do the sequencing in ~3 weeks. • The variant calling will require 2,000 CPU weeks… • …so, given ~2,000 computers, can do this all in one month.
  • 11. Similar math applies: • Pathogen detection in blood; • Environmental sequencing; • Sequencing rare DNA from circulating blood. • Two issues: •Volume of data & compute infrastructure; • Latency for clinical applications.
  • 12. Can we improve this situation? • Tie directly into machine as it generates sequence (Illumina, PacBio, and Nanopore can all do streaming, in theory) • Analyze data as it comes off; for some (many?) applications, can stop run early if signal detected. • Avoid using a reference genome for primary variant calling. • Easier indel detection, less allelic mapping bias • Can use reference for interpretation. Does such a magical approach exist!?
  • 13. ~Digression: Digital normalization (a computational version of library normalization) Suppose you have a dilution factor of A (10) to B(1). To get 10x of B you need to get 100x of A! Overkill!! The high-coverage reads in sample A are unnecessary for assembly, and, in fact, distract.
  • 20. Some key points -- • Digital normalization is streaming. • Digital normalizing is computationally efficient (lower memory than other approaches; parallelizable/multicore; single-pass) • Currently, primarily used for prefiltering for assembly, but relies on underlying abstraction (De Bruijn graph) that is also used in variant calling.
  • 21. Assembly now scales with richness, not diversity. • 10-100 fold decrease in memory requirements • 10-100 fold speed up in analysis
  • 22. Diginorm is widely useful: 1. Assembly of the H. contortus parasitic nematode genome, a “high polymorphism/variable coverage” problem. (Schwarz et al., 2013; pmid 23985341) 2. Reference-free assembly of the lamprey (P. marinus) transcriptome, a “big assembly” problem. (in prep) 3. Osedax symbiont metagenome, a “contaminated metagenome” problem (Goffredi et al, 2013; pmid 24225886)
  • 23. Anecdata: diginorm is used in Illumina long-read sequencing (?)
  • 24. Diginorm is “lossy compression” • Nearly perfect from an information theoretic perspective: • Discards 95% more of data for genomes. • Loses < 00.02% of information.
  • 25. Digital normalization => graph alignment What we are actually doing this stage is building a graph of all the reads, and aligning new reads to that graph.
  • 26. Error correction via graph alignment Jason Pell and Jordan Fish
  • 27. Error correction on simulated E. coli data TP FP TN FN ideal 3,469,834 99.1% 8,186 460,655,449 31,731 0.9% 1-pass 2,827,839 80.8% 30,254 460,633,381 673,726 19.2% 1.2-pass 3,403,171 97.2% 8,764 460,654,871 98,394 2.8% (corrected) (mistakes) (OK) (missed) 1% error rate, 100x coverage. Jordan Fish and Jason Pell
  • 28. Error correction  variant calling Single pass, reference free, tunable, streaming online variant calling.
  • 29. Coverage is adjusted to retain signal
  • 30. Graph alignment can detect read saturation
  • 31. Streaming with reads… Sequence... Graph Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... Sequence... .... Variants
  • 32. Analysis is done after sequencing. Sequencing Analysis
  • 33. Streaming with bases k bases... Graph k+1 k bases... k+1 k+2 k bases... k+1 k bases... k+1 k bases... k+1 ... k bases... k+1 Variants
  • 34. Integrate sequencing and analysis Sequencing Analysis Are we done yet?
  • 35. Streaming approach also supports more compute-intensive interludes – remapping, etc. Rimmer et al., 2014
  • 36. Streaming algorithms can be very efficient Data 1-pass Answer See also eXpress, Roberts et al., 2013.
  • 37. So: reference-free variant calling • Streaming & online algorithm; single pass. • For real-time diagnostics, can be applied as bases are emitted from sequencer. • Reference free: independent of reference bias. • Coverage of variants is adaptively adjusted to retain all signal. • Parameters are easily tuned, although theory needs to be developed. • High sensitivity (e.g. C=50 in 100x coverage) => poor compression • Low sensitivity (C=20) => good compression. • Can “subtract” reference => novel structural variants. • (See: Cortex, Zam Iqbal.)
  • 38. Two other features -- • More single-computer scalable approach than current: low disk access, high parallelizability. • Openness – our software is free to use, reuse, remix; no intellectual property restrictions. (Hence “We hear Illumina is using it…”)
  • 39. Prospectus for streaming variant detection • Underlying concept is sound and offers many advantages over current approaches; • We have proofs of concept implemented; • We know that underlying approach works well in amplification situations, as well; • Tuning and math/theory needed! • …grad students keep on getting poached by Amazon and Google. (This is becoming a serious problem.)
  • 40. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Lossy compression can substantially reduce data size while retaining information needed for later (re)analysis.
  • 46. Raw data (~10-100 GB) Analysis "Information" ~1 GB "Information" "Information" "Information" "Information" Database & integration Compression (~2 GB) Save in cold storage Save for reanalysis, investigation.
  • 47. Data integration? Once you have all the data, what do you do? "Business as usual simply cannot work." Looking at millions to billions of genomes. (David Haussler, 2014)
  • 48. Data recipes Standardized (versioned, open, remixable, cloud) pipelines and protocols for sequence data analysis. See: khmer-recipes, khmer-protocols. Increases buy-in :)
  • 49. Training! Lots of training planned at Davis – open workshops. ivory.idyll.org/blog/2014-davis-and-training.html Increases buy-in x 2!
  • 50. Acknowledgements Lab members involved Collaborators • Adina Howe (w/Tiedje) • Jason Pell • Qingpeng Zhang • Tim Brom • Jordan Fish • Michael Crusoe • Jim Tiedje, MSU • Billie Swalla, UW • Janet Jansson, LBNL • Susannah Tringe, JGI • Eran Andrechek, MSU Funding USDA NIFA; NSF IOS; NIH NHGRI; NSF BEACON.

Editor's Notes

  1. Goal is to do first stage data reduction/analysis in less time than it takes to generate the data. Compression => OLC assembly.
  2. Update from Jordan