Next-generation sequencing reveals structural variation detection challenges

Next-generation sequencing
and structural variation
Jan Aerts
Wellcome Trust Sanger Institute
jan.aerts@gmail.com

principles & pittfalls
vs
list of commands

What is structural variation?
• “variation that changes the structure of
a chromosome”
• Mechanisms: NAHR, NHEJ, FoSTeS
• This presentation: focus on discovery
(not: genotyping)

“experiment 4” from last slide Thomas

Approaches for discovery
Combination of:
• Read pairs
• Read depth
• Split reads
• Fine-mapping breakpoints: local assembly

=> Identify signatures

RP - General principle
• Paired-end library => insert size
• Orientation/distance

RP - Signatures

Medvedev et al, 2009

RP - Workflow overview
Mapping
 Identify discordant readpairs
 Cluster on location
 Filter on nr RPs/cluster
 Filter on RD
 Filter: mappingQ x #readpairs
 Identify signatures
 Alternative reference
 Validate

RP - Mapping
• Provides raw data => crucial
• MAQ/bwa
– only report one hit (mappingQ = 0)
– MAQ might prefer mismatches to aberrant
distance!
• Insert size = distribution instead of exact

RP - Discordant readpairs
• Orientation
• Distance
– Plot insert size distribution for chromosome
– Very long tail! => difficult to set cutoff:
• 4mad or 0.01%?

RP - Clustering
“standard clustering strategy”
– Only consider mate pairs that do not have
concordant mappings
– Ignore read pairs that have more than one
good mapping

Clustering: use insert size distribution
(e.g. 2x4 mad)

RP - Clustering: issues
• Ignores pairs that have >1 good mapping =>
no detection within repetitive regions
(segmental duplications)
• What cutoff for what is considered abnormal
distance? (4 mad? 0.01%? 2stdev?)
• Low library quality or mix of libraries =>
multiple peaks in size distribution

RP - Filtering
• On nr RPs/cluster
– Normally: n=2
– For high coverage (e.g. pilot 2: 80X): n=5
• On drop in RD & SR
• On (mappingQ x nrRP)
– If published data available: ROC for
different cutoffs mQxnrRP
– If not: very difficult

RP - Issues
• Difficult => different groups = different results
“consensus set”
– RP & SP: many set agree
– RD: totally different
• CEU (80X): sometimes drop in RD in all 3,
but RP spanning only in 2 => why??
• Mapper = critical; maq/bwa: only 1 mapping
(=> many false negatives); mosaik, mrFAST:
return more results

RP - Issues (2)
• Large insert size: low resolution for detecting
breakpoints
• Small insert size: low resolution for detecting
complex regions

RD - General principle
• Similar to aCGH: using reference RD
file (e.g. based on 1kG)
• In theory: higher resolution, but noisier
than aCGH
– Algorithms not mature yet
– More complex steps
=> Data binned

RD - Exome

here: using exome data

RD - Workflow overview
• Mapping
• Read filtering
• GC correction
• Spike identification
• Validation

RD - mapping

Critical…
(see RP)

RD - Filtering
• mapQ
– mapQ >= 0 (noisy; few FN, many FP)
– mapQ >= 10
– mapQ >= 30 (many FN, few FP)
• Mean depth exon (often: e.g. +/- 0.01)
– Mean depth > 1
– Mean depth > 5

RD - Filtering: what’s left

mapQ >= 0 mapQ >= 10 mapQ >= 30

all 207,000 207,000 207,000

mean DP exon > 1 169,000 163,000 162,000

mean DP exon > 5 160,000 153,000 152,000

RD - correction
• Mainly: GC
– Other: repeat-rich regions, mapping Q, …
• Fit linear model GC-content exon and
RD of exon
=> noise decreases

RD - segmentation
• Identify spikes
• Many segmentational algorithms, e.g.
GADA
• Issues: setting parameters: when to cut
off peaks?
– Combine outputs from different runs with
different parameters
– Compare to known CNVs

RD - Issues
• How to assess TP/FP/FN? => compare
with known CNVs
• Breakpoints: unknown
– 1 datapoint/exon
– Can be outside of exon
• Different parameters for rare vs
common CNVs => which?

SR - Mapping
Short subsequences => many possible
mappings
Solution: “anchored split mapping” (e.g.
Pindel)

D. Local reassembly
Aim: to determine breakpoints

Which reads?
– for deletions: local reads
– for insertions: hanging reads for read pairs with
only one read mapped

– (rather not: unmapped reads)

For large region: split up

Assemblers

Velvet
ABySS
TIGRA
…

Conclusions
• Available algorithms: more to
demonstrate technique rather than
complete solution
• Different algorithms => different results

Genotyping
• Create alternative reference => remap reads
– All reads vs reads covering variant locis
– Whole-genome vs concatenation of variant loci
• Homozygous insertions/deletions: should disappear
• Heterozygous insertions/deletions: should have different
signatures
• Bayesian approach: see what’s the most likely: do the reads
support wild-type/het/homnonref?
• Not exact mapping => local reassembly
– Microhomologies & non-template sequence => “breakpoint”
= region of 2-10 bp
• Convention: left-most position reported (but not always)

References and software
• Medvedev P et al. Nat Methods 6(11):S13-S20 (2009)
• Lee S et al. Bioinformatics 24:i59-i67 (2008)
• Hormozdiari F et al. Genome Res 19:1270-1278 (2009)
• Campbell P et al. Nat Genet 40:722-729 (2008)
• Ye K et al. Bioinformatics 25(21):2865-2871 (2009)
• Chen K et al. Genome Res 19:1527-1541 (2009)
• Yoon S et al. Genome Res 19:1586-1592 (2009)
• Du J et al. PLoS Comp Biol 5(7):e1000432 (2009)
• Aerts J & Tyler-Smith C. In: Encyclopedia of Life Sciences
(2009)

Next-generation sequencing reveals structural variation detection challenges

Recommended

Recommended

More Related Content

Similar to Next-generation sequencing reveals structural variation detection challenges

Similar to Next-generation sequencing reveals structural variation detection challenges (20)

More from Jan Aerts

More from Jan Aerts (20)

Next-generation sequencing reveals structural variation detection challenges