SlideShare a Scribd company logo
1 of 52
Download to read offline
Next-generation sequencing
  and structural variation
               Jan Aerts
    Wellcome Trust Sanger Institute
         jan.aerts@gmail.com
principles & pittfalls
           vs
  list of commands
What is structural variation?
• “variation that changes the structure of
  a chromosome”
• Mechanisms: NAHR, NHEJ, FoSTeS
• This presentation: focus on discovery
  (not: genotyping)

“experiment 4” from last slide Thomas
Types of structural variation
Approaches for discovery
Combination of:
• Read pairs
• Read depth
• Split reads
• Fine-mapping breakpoints: local assembly

=> Identify signatures
A. Read Pairs
RP - General principle
• Paired-end library => insert size
• Orientation/distance
RP - Signatures




           Medvedev et al, 2009
RP - Real world
RP - Workflow overview
Mapping
 Identify discordant readpairs
 Cluster on location
 Filter on nr RPs/cluster
 Filter on RD
 Filter: mappingQ x #readpairs
 Identify signatures
 Alternative reference
 Validate
RP - Mapping
• Provides raw data => crucial
• MAQ/bwa
  – only report one hit (mappingQ = 0)
  – MAQ might prefer mismatches to aberrant
    distance!
• Insert size = distribution instead of exact
RP - Discordant readpairs
• Orientation
• Distance
  – Plot insert size distribution for chromosome
  – Very long tail! => difficult to set cutoff:
     • 4mad or 0.01%?
RP - Clustering
“standard clustering strategy”
  – Only consider mate pairs that do not have
    concordant mappings
  – Ignore read pairs that have more than one
    good mapping


Clustering: use insert size distribution
  (e.g. 2x4 mad)
RP - Clustering: issues
• Ignores pairs that have >1 good mapping =>
  no detection within repetitive regions
  (segmental duplications)
• What cutoff for what is considered abnormal
  distance? (4 mad? 0.01%? 2stdev?)
• Low library quality or mix of libraries =>
  multiple peaks in size distribution
RP - Filtering
• On nr RPs/cluster
  – Normally: n=2
  – For high coverage (e.g. pilot 2: 80X): n=5
• On drop in RD & SR
• On (mappingQ x nrRP)
  – If published data available: ROC for
    different cutoffs mQxnrRP
  – If not: very difficult
RP - Issues
• Difficult => different groups = different results
  “consensus set”
   – RP & SP: many set agree
   – RD: totally different
• CEU (80X): sometimes drop in RD in all 3,
  but RP spanning only in 2 => why??
• Mapper = critical; maq/bwa: only 1 mapping
  (=> many false negatives); mosaik, mrFAST:
  return more results
RP - Issues (2)
• Large insert size: low resolution for detecting
  breakpoints
• Small insert size: low resolution for detecting
  complex regions
B. Read Depth
RD - General principle
• Similar to aCGH: using reference RD
  file (e.g. based on 1kG)
• In theory: higher resolution, but noisier
  than aCGH
  – Algorithms not mature yet
  – More complex steps
=> Data binned
RD - Exome


here: using exome data
RD - Example
RD - Workflow overview
•   Mapping
•   Read filtering
•   GC correction
•   Spike identification
•   Validation
RD - mapping


  Critical…
   (see RP)
RD - Filtering
• mapQ
  – mapQ >= 0 (noisy; few FN, many FP)
  – mapQ >= 10
  – mapQ >= 30 (many FN, few FP)
• Mean depth exon (often: e.g. +/- 0.01)
  – Mean depth > 1
  – Mean depth > 5
RD - Filtering: what’s left

                 mapQ >= 0   mapQ >= 10   mapQ >= 30

all              207,000     207,000      207,000

mean DP exon > 1 169,000     163,000      162,000

mean DP exon > 5 160,000     153,000      152,000
RD - correction
• Mainly: GC
  – Other: repeat-rich regions, mapping Q, …
• Fit linear model GC-content exon and
  RD of exon
  => noise decreases
RD - segmentation
• Identify spikes
• Many segmentational algorithms, e.g.
  GADA
• Issues: setting parameters: when to cut
  off peaks?
  – Combine outputs from different runs with
    different parameters
  – Compare to known CNVs
RD - Combine algorithms
RD - Issues
• How to assess TP/FP/FN? => compare
  with known CNVs
• Breakpoints: unknown
  – 1 datapoint/exon
  – Can be outside of exon
• Different parameters for rare vs
  common CNVs => which?
C. Split Reads
SR - Principle
SR - Mapping
Short subsequences => many possible
 mappings
Solution: “anchored split mapping” (e.g.
 Pindel)
D. Local reassembly
Aim: to determine breakpoints

Which reads?
   – for deletions: local reads
   – for insertions: hanging reads for read pairs with
     only one read mapped

   – (rather not: unmapped reads)


For large region: split up
Assemblers

Velvet
ABySS
TIGRA
…
Conclusions
• Available algorithms: more to
  demonstrate technique rather than
  complete solution
• Different algorithms => different results
Chris Yoon
Genotyping
•   Create alternative reference => remap reads
     – All reads vs reads covering variant locis
     – Whole-genome vs concatenation of variant loci
•   Homozygous insertions/deletions: should disappear
•   Heterozygous insertions/deletions: should have different
    signatures
•   Bayesian approach: see what’s the most likely: do the reads
    support wild-type/het/homnonref?
•   Not exact mapping => local reassembly
     – Microhomologies & non-template sequence => “breakpoint”
        = region of 2-10 bp
         • Convention: left-most position reported (but not always)
References and software
•   Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)
•   Lee S et al. Bioinformatics 24:i59-i67 (2008)
•   Hormozdiari F et al. Genome Res 19:1270-1278 (2009)
•   Campbell P et al. Nat Genet 40:722-729 (2008)
•   Ye K et al. Bioinformatics 25(21):2865-2871 (2009)
•   Chen K et al. Genome Res 19:1527-1541 (2009)
•   Yoon S et al. Genome Res 19:1586-1592 (2009)
•   Du J et al. PLoS Comp Biol 5(7):e1000432 (2009)
•   Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences
    (2009)
Questions?

More Related Content

Similar to Next-generation sequencing reveals structural variation detection challenges

Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsBita Kazemi
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMfnothaft
 
MapReduce
MapReduceMapReduce
MapReduceKavyaGo
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkCloudera, Inc.
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentssuser2be88c
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxssuser30e7d2
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchGreg Makowski
 
Faster and smaller inverted indices with Treaps Research Paper
Faster and smaller inverted indices with Treaps Research PaperFaster and smaller inverted indices with Treaps Research Paper
Faster and smaller inverted indices with Treaps Research Papersameiralk
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2Shrayes Ramesh
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...NoSQLmatters
 
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...AboutYouGmbH
 
Algorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial OperationsAlgorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial OperationsNatasha Mandal
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisAdamCribbs1
 
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RFinding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RRevolution Analytics
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekingeProf. Wim Van Criekinge
 

Similar to Next-generation sequencing reveals structural variation detection challenges (20)

Gsas intro rvd (1)
Gsas intro rvd (1)Gsas intro rvd (1)
Gsas intro rvd (1)
 
Distributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasetsDistributed approximate spectral clustering for large scale datasets
Distributed approximate spectral clustering for large scale datasets
 
Scalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAMScalable up genomic analysis with ADAM
Scalable up genomic analysis with ADAM
 
Hadoop classes in mumbai
Hadoop classes in mumbaiHadoop classes in mumbai
Hadoop classes in mumbai
 
MapReduce
MapReduceMapReduce
MapReduce
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Sketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignmentSketching and locality sensitive hashing for alignment
Sketching and locality sensitive hashing for alignment
 
isca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptxisca22-feng-menda_for sparse transposition and dataflow.pptx
isca22-feng-menda_for sparse transposition and dataflow.pptx
 
Self healing data
Self healing dataSelf healing data
Self healing data
 
Heuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient searchHeuristic design of experiments w meta gradient search
Heuristic design of experiments w meta gradient search
 
Faster and smaller inverted indices with Treaps Research Paper
Faster and smaller inverted indices with Treaps Research PaperFaster and smaller inverted indices with Treaps Research Paper
Faster and smaller inverted indices with Treaps Research Paper
 
ensembles_emptytemplate_v2
ensembles_emptytemplate_v2ensembles_emptytemplate_v2
ensembles_emptytemplate_v2
 
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
Uwe Friedrichsen – Extreme availability and self-healing data with CRDTs - No...
 
Chap11 slides
Chap11 slidesChap11 slides
Chap11 slides
 
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
Uwe Friedrichsen - CRDT und mehr - über extreme Verfügbarkeit und selbstheile...
 
Algorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial OperationsAlgorithms for Query Processing and Optimization of Spatial Operations
Algorithms for Query Processing and Optimization of Spatial Operations
 
Making powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysisMaking powerful science: an introduction to NGS data analysis
Making powerful science: an introduction to NGS data analysis
 
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in RFinding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
Finding Meaning in Points, Areas and Surfaces: Spatial Analysis in R
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
User biglm
User biglmUser biglm
User biglm
 

More from Jan Aerts

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationJan Aerts
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Jan Aerts
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Jan Aerts
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Jan Aerts
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Jan Aerts
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data AnalysisJan Aerts
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualizationJan Aerts
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsJan Aerts
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...Jan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumJan Aerts
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJan Aerts
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloudJan Aerts
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisJan Aerts
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...Jan Aerts
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...Jan Aerts
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...Jan Aerts
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...Jan Aerts
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsJan Aerts
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesJan Aerts
 

More from Jan Aerts (20)

VIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic VariationVIZBI 2014 - Visualizing Genomic Variation
VIZBI 2014 - Visualizing Genomic Variation
 
Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?Visual Analytics in Omics - why, what, how?
Visual Analytics in Omics - why, what, how?
 
Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?Visual Analytics in Omics: why, what, how?
Visual Analytics in Omics: why, what, how?
 
Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013Visual Analytics talk at ISMB2013
Visual Analytics talk at ISMB2013
 
Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)Visualizing the Structural Variome (VMLS-Eurovis 2013)
Visualizing the Structural Variome (VMLS-Eurovis 2013)
 
Humanizing Data Analysis
Humanizing Data AnalysisHumanizing Data Analysis
Humanizing Data Analysis
 
Intro to data visualization
Intro to data visualizationIntro to data visualization
Intro to data visualization
 
L Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformaticsL Fu - Dao: a novel programming language for bioinformatics
L Fu - Dao: a novel programming language for bioinformatics
 
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
J Wang - bioKepler: a comprehensive bioinformatics scientific workflow module...
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing ConsortiumB Temperton - The Bioinformatics Testing Consortium
B Temperton - The Bioinformatics Testing Consortium
 
J Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis FrameworkJ Goecks - The Galaxy Visual Analysis Framework
J Goecks - The Galaxy Visual Analysis Framework
 
S Cain - GMOD in the cloud
S Cain - GMOD in the cloudS Cain - GMOD in the cloud
S Cain - GMOD in the cloud
 
B Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysisB Chapman - Toolkit for variation comparison and analysis
B Chapman - Toolkit for variation comparison and analysis
 
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
P Rocca-Serra - The open source ISA metadata tracking framework: from data cu...
 
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
J Klein - KUPKB: sharing, connecting and exposing kidney and urinary knowledg...
 
S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...S Cheng - eagle-i: development and expansion of a scientific resource discove...
S Cheng - eagle-i: development and expansion of a scientific resource discove...
 
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
A Kanterakis - PyPedia: a python crowdsourcing development environment for bi...
 
A Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining componentsA Kalderimis - InterMine: Embeddable datamining components
A Kalderimis - InterMine: Embeddable datamining components
 
E Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutesE Afgan - Zero to a bioinformatics analysis platform in four minutes
E Afgan - Zero to a bioinformatics analysis platform in four minutes
 

Next-generation sequencing reveals structural variation detection challenges

  • 1. Next-generation sequencing and structural variation Jan Aerts Wellcome Trust Sanger Institute jan.aerts@gmail.com
  • 2. principles & pittfalls vs list of commands
  • 3. What is structural variation? • “variation that changes the structure of a chromosome” • Mechanisms: NAHR, NHEJ, FoSTeS • This presentation: focus on discovery (not: genotyping) “experiment 4” from last slide Thomas
  • 5. Approaches for discovery Combination of: • Read pairs • Read depth • Split reads • Fine-mapping breakpoints: local assembly => Identify signatures
  • 7. RP - General principle • Paired-end library => insert size • Orientation/distance
  • 8. RP - Signatures Medvedev et al, 2009
  • 9. RP - Real world
  • 10. RP - Workflow overview Mapping  Identify discordant readpairs  Cluster on location  Filter on nr RPs/cluster  Filter on RD  Filter: mappingQ x #readpairs  Identify signatures  Alternative reference  Validate
  • 11. RP - Mapping • Provides raw data => crucial • MAQ/bwa – only report one hit (mappingQ = 0) – MAQ might prefer mismatches to aberrant distance! • Insert size = distribution instead of exact
  • 12. RP - Discordant readpairs • Orientation • Distance – Plot insert size distribution for chromosome – Very long tail! => difficult to set cutoff: • 4mad or 0.01%?
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18. RP - Clustering “standard clustering strategy” – Only consider mate pairs that do not have concordant mappings – Ignore read pairs that have more than one good mapping Clustering: use insert size distribution (e.g. 2x4 mad)
  • 19. RP - Clustering: issues • Ignores pairs that have >1 good mapping => no detection within repetitive regions (segmental duplications) • What cutoff for what is considered abnormal distance? (4 mad? 0.01%? 2stdev?) • Low library quality or mix of libraries => multiple peaks in size distribution
  • 20. RP - Filtering • On nr RPs/cluster – Normally: n=2 – For high coverage (e.g. pilot 2: 80X): n=5 • On drop in RD & SR • On (mappingQ x nrRP) – If published data available: ROC for different cutoffs mQxnrRP – If not: very difficult
  • 21. RP - Issues • Difficult => different groups = different results “consensus set” – RP & SP: many set agree – RD: totally different • CEU (80X): sometimes drop in RD in all 3, but RP spanning only in 2 => why?? • Mapper = critical; maq/bwa: only 1 mapping (=> many false negatives); mosaik, mrFAST: return more results
  • 22. RP - Issues (2) • Large insert size: low resolution for detecting breakpoints • Small insert size: low resolution for detecting complex regions
  • 24. RD - General principle • Similar to aCGH: using reference RD file (e.g. based on 1kG) • In theory: higher resolution, but noisier than aCGH – Algorithms not mature yet – More complex steps => Data binned
  • 25. RD - Exome here: using exome data
  • 27. RD - Workflow overview • Mapping • Read filtering • GC correction • Spike identification • Validation
  • 28.
  • 29. RD - mapping Critical… (see RP)
  • 30. RD - Filtering • mapQ – mapQ >= 0 (noisy; few FN, many FP) – mapQ >= 10 – mapQ >= 30 (many FN, few FP) • Mean depth exon (often: e.g. +/- 0.01) – Mean depth > 1 – Mean depth > 5
  • 31. RD - Filtering: what’s left mapQ >= 0 mapQ >= 10 mapQ >= 30 all 207,000 207,000 207,000 mean DP exon > 1 169,000 163,000 162,000 mean DP exon > 5 160,000 153,000 152,000
  • 32. RD - correction • Mainly: GC – Other: repeat-rich regions, mapping Q, … • Fit linear model GC-content exon and RD of exon => noise decreases
  • 33.
  • 34.
  • 35. RD - segmentation • Identify spikes • Many segmentational algorithms, e.g. GADA • Issues: setting parameters: when to cut off peaks? – Combine outputs from different runs with different parameters – Compare to known CNVs
  • 36. RD - Combine algorithms
  • 37.
  • 38.
  • 39. RD - Issues • How to assess TP/FP/FN? => compare with known CNVs • Breakpoints: unknown – 1 datapoint/exon – Can be outside of exon • Different parameters for rare vs common CNVs => which?
  • 42. SR - Mapping Short subsequences => many possible mappings Solution: “anchored split mapping” (e.g. Pindel)
  • 43.
  • 44. D. Local reassembly Aim: to determine breakpoints Which reads? – for deletions: local reads – for insertions: hanging reads for read pairs with only one read mapped – (rather not: unmapped reads) For large region: split up
  • 46.
  • 47. Conclusions • Available algorithms: more to demonstrate technique rather than complete solution • Different algorithms => different results
  • 49.
  • 50. Genotyping • Create alternative reference => remap reads – All reads vs reads covering variant locis – Whole-genome vs concatenation of variant loci • Homozygous insertions/deletions: should disappear • Heterozygous insertions/deletions: should have different signatures • Bayesian approach: see what’s the most likely: do the reads support wild-type/het/homnonref? • Not exact mapping => local reassembly – Microhomologies & non-template sequence => “breakpoint” = region of 2-10 bp • Convention: left-most position reported (but not always)
  • 51. References and software • Medvedev P et al. Nat Methods 6(11):S13-S20 (2009) • Lee S et al. Bioinformatics 24:i59-i67 (2008) • Hormozdiari F et al. Genome Res 19:1270-1278 (2009) • Campbell P et al. Nat Genet 40:722-729 (2008) • Ye K et al. Bioinformatics 25(21):2865-2871 (2009) • Chen K et al. Genome Res 19:1527-1541 (2009) • Yoon S et al. Genome Res 19:1586-1592 (2009) • Du J et al. PLoS Comp Biol 5(7):e1000432 (2009) • Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences (2009)