Mining Features from the Object-Oriented Source Code of Software Variants by ...
AMP Lab presentation -- Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation
1. CloudBreak: A MapReduce Algorithm for Genomic
Structural Variation Detection
Chris Whelan & Kemal S¨nmez
o
Oregon Health & Science University
March 5, 2013
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 1 / 34
2. Overview
Background
Current Approaches
MapReduce Framework for SV Detection
Cloudbreak Algorithm
Results
Ongoing Work
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 2 / 34
3. Background - High Throughput Sequencing
High Throughput (Illumina) Sequencing produces millions of paired
short (∼100bp) reads of DNA from an input sample
The challenge: use these reads to find characteristics of DNA sample
relevant to disease or phenotype
The approach: In resequencing experiments, align short reads to a
reference genome for the species and find the differences
Sequencing error, diploid genomes, hard to map repetitive sequences
can make this difficult
Need high coverage (eg 30X) to detect all single nucleotide
polymorphisms (SNPs); results in large data sets (100GB compressed
raw data for human)
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 3 / 34
4. Structural Variations
Harder to detect than SNPs are structural variations: deletions,
insertions, inversions, duplications, etc.
Generally events that affect more than 40 or 50 bases
The majority of variant bases in a normal individual genome are due
to structural variations (primarily insertions and deletions)
Variants are associated with cancer, neurological disease
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 4 / 34
5. SV Detection Approaches
Four main algorithmic approaches
Read pair (RP): Look for paired reads that map to the reference at a
distance or orientation that disagrees with the expected characteristics
of the library
Read depth (RD): Infer deletions and duplications from the number of
reads mapped to each locus
Split read mapping (SR): Split individual reads into two parts, see if
you can map them to either side of a breakpoint
De novo assembly (AS): assemble the reads into their original sequence
and compare to the reference
Hybrid approaches
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 5 / 34
6. SV Detection from Sequencing Data
Mills et al. Nature 2011
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 6 / 34
7. SV Detection is Hard
Sensitivity and FDR of deletion detection methods used on 1,000 Genomes
Project.
Mills et al. Nature 2011
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 7 / 34
8. Read-pair (RP) SV Detection
Building the sample library involves selecting the size of DNA
fragments
Only the ends of each fragment are sequenced, from the outside in
Therefore the distance between the two sequenced reads (the insert
size) is known - typically modeled as a normal distribution
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 8 / 34
9. Discordant read pairs
When reads map to the reference at a greater than expected distance
apart, indicates a deletion in the sample between the mapping
location of the two ends
Reads that map closer than expected imply an insertion
Reads in the wrong orientation imply an inversion
Medvedev et al. 2009
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 9 / 34
10. Read Pair Algorithms
Identify all read pairs with discordant mappings
Attempt to cluster discordant pairs supporting the same variant
Typically ignore concordant mappings
Some algorithms consider reads with multiple mappings by choosing
the mappings that minimize the number of predicted variants: have
shown that this increases sensitivity in repetitive regions of the
genome
Mapping results for a high coverage human genome are very large
(100GB of compressed alignment data storing only the best mappings
for a 30X genome)
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 10 / 34
11. MapReduce and Hadoop
Provides a distributed filesystem across a cluster with redundant
storage
Divides computation into Map and Reduce phases: Mappers emit
key-value pairs for a block of data, Reducers process all of the values
for each key
Good at handling data sets of the size seen in sequencing
experiments, and much larger
Able to harness a cluster of commodity machines rather than single
high-powered servers
Some algorithms translate easily to MapReduce model; others are
much harder
A natural abstraction in resequencing experiments: use a key for each
location in the genome. Examples: SNP calling in GATK or Crossbow
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 11 / 34
12. SV Detection in MapReduce
Clustering of read pairs as in traditional RP algorithms typically
involves global compuations or graph structures
MapReduce, on the other hand, forces local, parallel computations
Our approach: use MapReduce to compute features for each location
in the genome from alignments relevant to that location
Locations can be small tiled windows to make the problem more
tractable
Make SV calls from features computed along the genome in a
post-processing step
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 12 / 34
13. An Algorithmic Framework for SV Detection in MapReduce
1: job Alignment
2: function Map(ReadPairId rpid, ReadId r , ReadSequence s, ReadQuality q)
3: for all Alignments a ∈ Align(< s, q >) do
4: Emit(ReadPairId rpid, Alignment a)
5: function Reduce(ReadPairId rpid, Alignments a1,2,... )
6: AlignmentPairList ap ← ValidAlignmentPairs (a1,2,... )
7: Emit(ReadPairId rp, AlignmentPairList ap)
8: job Compute SV Features
9: function Map(ReadPairId rp, AlignmentPairList ap)
10: for all AlignmentPairs < a1 , a2 >∈ ap do
11: for all GenomicLocations l ∈ Loci (a1 , a2 ) do
12: ReadPairInfo rpi ← < InsertSize(a1 , a2 ), AlignmentScore(a1 , a2 ) >
13: Emit(GenomicLocation l, ReadPairInfo rpi)
14: function Reduce(GenomicLocation l, ReadPairInfos rpi1,2,... )
15: SVFeatures φl ← Φ(InsertSizes i1,2,... , AlignmentScores q1,2,... )
16: Emit(GenomicLocation l, SVFeatures φl )
17: StructuralVariationCalls svs ← PostProcess(φ1,2,... )
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 13 / 34
14. Three user-defined functions
This framework leaves three functions to be defined
May be many different approaches to take within this framework,
depending on the application
Loci : a1 , a2 → Lm ⊆ L
Φ : {ReadPairInfo rpim,i,j } → RN
PostProcess : {φ1 , φ2 , . . . , φN } → { SVType s, lstart , lend }
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 14 / 34
15. Cloudbreak implementation
We focus on detecting deletions and small insertions
Implemented as a native Hadoop application
Use features computed from fitting a mixture model to the observed
distribution of insert sizes at each locus
Process as many mappings as possible for ambiguously mapped reads
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 15 / 34
16. Local distributions of insert sizes
Estimate distribution of insert sizes observed at each window as a
Gaussian mixture model (GMM)
Similar to idea in MoDIL (Lee et al. 2009)
Use a constrained expectation-maximization algorithm to find mean,
weight of second component. Constrain one component to have the
library mean insert size, and constrain both components to have the
same variance. Find mean and weight of the second component.
Features computed include the log likelihood ratio of fit
two-component model to the likelihood of the insert sizes under a
model with no variant: normal distribution under library parameters.
Other features: weight of the second component, estimated mean of
the second component.
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 16 / 34
17. Local distributions of insert sizes
No Variant
Homozygous Deletion 0 100 200 300 400 500
Heterozygous Deletion 0 100 200 300 400 500
0 100 200 300 400 500
Heterozygous Insertion
0 100 200 300 400 500
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 17 / 34
19. Handling ambiguous mappings
Incorrect mappings of read pairs are unlikely to form clusters of insert
sizes at a given window
Before fitting GMM, remove outliers using a nearest neighbor
method: If kth nearest neighbor of each mapped pair is greater than
c * (library fragment size SD) away, remove that mapping
Control number of mappings based on an adaptive cutoff for
alignment score: Discard mapping m if the ratio of the best alignment
score for that window to the score of m is larger than some cutoff.
This allows visibility into regions where no reads are mapped
unambiguously.
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 19 / 34
20. Postprocessing
First extract contiguous genomic loci where the log-likelihood ratio of
the two models is greater than a given threshold.
To eliminate noise we apply a median filter with window size 5.
Let µ be the estimated mean of the second component and µ be the
library insert size. We end regions when µ changes by more than
60bp (2σ), and discard regions where the length of the region differs
from µ by more than µ.
Cloudbreak looses some resolution to breakpoint location based on
genome windows and filters.
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 20 / 34
21. Results Comparison
We compare Cloudbreak to a selection of widely used algorithms
taking different approaches:
Breakdancer (Chen et al. 2009): Traditional RP based approach
DELLY (Rausch et al. 2012): RP based approach with SR refinement
of calls
GASVPro (Sindi et al. 2012): RP based approach, uses ambiguous
mappings of discordant read pairs which it resolves through MCMC
algorithm; looks for RD signals at predicted breakpoint locations by
examining concordant pairs
Pindel (Ye et al. 2009): SR approach; looks for clusters of read pairs
where only one read could be mapped and searches for split read
mappings for the other read
MoDIL (Lee et al. 2009): Mixture of distributions; only on simulated
data due to runtime requirements.
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 21 / 34
22. Simulated Data
Very little publicly available NGS data from a genome with fully
characterized structural variations
Can match algorithm output to validated SVs, but dont know if novel
predictions are wrong or undiscovered.
Way to get a simulated data set with ground truth known and realistic
events: take a (somewhat) fully characterized genome, apply variants
to reference sequence, simulate reads from modified reference.
Use Venter genome (Levy et al, 2007), chromosome 2.
To simulate heterozygosity, randomly assign half of the variants to be
homozygous and half heterozygous, and create two modified
references.
Simulated 100bp paired reads with a 100bp insert size to 30X
coverage.
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 22 / 34
23. ROC curve for Chromosome 2 Deletion Simulation
Deletions in Venter diploid chr2 simulation
350
300
250
True Positives
200
Cloudbreak
Breakdancer
Pindel
150
GASVPro
DELLY
100
50
0
0 100 200 300 400
False Positives
Caveat: Methods perform better on simulated data than on real
whole genome datasets.
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 23 / 34
24. Ability to find simulated deletions by size at 10% FDR
Number of deletions found in each size class (number of exclusive
predictions for algorithm in that class)
Cloudbreak competitive for a range of size classes
40-100bp 101-250bp 251-500bp 501-1000bp > 1000bp
Total Number 224 84 82 31 26
Cloudbreak 47 ( 7) 50 ( 2) 55 ( 4) 12 ( 4) 15 (0)
Breakdancer 52 ( 10) 49 ( 2) 49 (0) 7 (0) 14 (0)
GASVPro 31 ( 4) 25 (0) 23 (0) 2 (0) 6 (0)
DELLY 22 ( 2) 56 ( 3) 40 (0) 8 (0) 12 (0)
Pindel 60 ( 35) 16 (0) 41 ( 2) 1 (0) 12 (0)
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 24 / 34
25. Insertions in Simulated Data
Insertions in Venter diploid chr2 simulation
80
60
True Positives
Cloudbreak
Breakdancer
Pindel
40
20
0
0 20 40 60 80
False Positives
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 25 / 34
26. NA18507 Data Set
Well studied sample from a Yoruban male individual
High quality sequence to 37X coverage, 100bp reads with a 100bp
insert size
We created a gold standard set of deletions from three different
studies with low false discovery rates: Mills et al. 2011, Human
Genome Structural Variation Project (Kidd et al. 2008), and the 1000
Genomes Project (Mills et al. 2011)
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 26 / 34
27. ROC Curve for NA18507 Deletions
All algorithms look much worse on real data (could be lack of
complete truth)
NA18507
2000
1500
True Positives
1000
Cloudbreak
Breakdancer
500
Pindel
GASVPro
DELLY
Cloudbreak
0
0 5000 10000 15000
Novel Predictions
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 27 / 34
28. Ability to find NA18507 deletions by size
Using the same cutoffs that yielded a 10% FDR on the simulated
chromosome 2 data set, adjusted for the difference in coverage from
30X to 37X.
Cloudbreak identifies more small deletions
Cloudbreak contributes more exclusive predictions
Prec. Recall 40-100bp 101-250bp 251-500bp 501-1000bp > 1000bp
Total Number 7,466 235 218 110 375
Cloudbreak 0.0978 0.115 423 ( 179) 128 ( 9) 158 ( 8) 70 ( 3) 186 ( 12)
Breakdancer 0.122 0.112 261 ( 41) 132 ( 8) 167 ( 1) 92 (0) 288 ( 10)
GASVPro 0.134 0.0401 104 ( 17) 37 ( 2) 77 (0) 26 (0) 93 (0)
DELLY 0.0824 0.091 143 ( 9) 125 ( 7) 158 ( 1) 83 ( 1) 256 ( 3)
Pindel 0.16 0.0685 149 ( 12) 57 (0) 140 ( 1) 58 (0) 172 ( 2)
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 28 / 34
29. Ability to detect deletions in repetitive regions
Detected deletions on the simulated and NA18507 data sets identified
by each tool, broken down by whether the deletion overlaps with a
RepeatMasker-annotated element.
Simulated Data NA18507
Non-repeat Repeat Non-repeat Repeat
Total Number 120 327 553 7851
Cloudbreak 28 ( 4) 151 ( 13) 204 ( 46) 761 ( 165)
Breakdancer 29 ( 5) 142 ( 7) 186 ( 21) 754 ( 39)
GASVPro 15 ( 2) 72 ( 2) 71 ( 6) 266 ( 13)
DELLY 21 ( 2) 117 ( 3) 147 ( 11) 618 ( 10)
Pindel 18 ( 9) 112 ( 28) 103 ( 4) 473 ( 11)
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 29 / 34
30. Genotyping Deletions
We can use the mixing parameter that controls the weight of the two
components in the GMM to accurately predict deletion genotypes.
By setting a simple cutoff of .2 on the average value of the weight in
each prediction, we were able to achieve 86.7% and 94.9% accuracy
in predicting the genotype of the true positive deletions we detected
in the simulated and real data sets, respectively.
Actual Genotypes
Simulated Data NA18507
Homozygous Heterozygous Homozygous Heterozygous
Predicted Homozygous 88 3 70 11
Genotypes Heterozygous 18 70 4 209
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 30 / 34
31. Running Times
Running times (wall) on both data sets
Cloudbreak took approx. 150 workers for simulated data, 650 workers
for NA18507 (42m in MapReduce)
Breakdancer and DELLY were run in a single CPU but can be set to
process each chromosome independently (10X speedup)
Pindel was run in single-threaded mode
MoDIL run on 200 cores
Simulated Chromosome 2 Data NA18507
Cloudbreak 835s 106m
Breakdancer 653s 36h
GASVPro 3339s 33h
DELLY 1964s 208m
Pindel 1336s 38h
MoDIL 48h **
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 31 / 34
32. Ongoing work: Generate additional features, improve
postprocessing
Goals: increase accuracy and breakpoint resolution
Features involving split read mappings or pairs in which only one end
is mapped
Features involving sequence and sequence variants
Annotations of sequence features and previously identified variants
Apply machine learning techniques: conditional random fields, Deep
Learning
Potential future work: add local assembly of breakpoints
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 32 / 34
33. Ongoing work: automate deployment and execution on
cloud providers
Many researchers don’t have access to Hadoop clusters, or servers
powerful enough process these data sets
On-demand creation of clusters with cloud providers can be
cost-effective, especially with spot pricing
Developing scripts to automate on-demand construction of Hadoop
clusters in cloud (Amazon EC2, Rackspace) using Apache Whirr
project
Bottleneck: transferring data into and out of the cloud
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 33 / 34
34. Conclusions
Novel approach to applying MapReduce algorithm to structural
variation problem
Make insert size distribution clustering approaches have feasible run
times
Improved accuracy over existing algorithms, especially in repetitive
regions
Ability to accurately genotype calls
Cost of additional CPU hours, somewhat less breakpoint resolution
Whelan & S¨nmez (OHSU)
o CloudBreak March 5, 2013 34 / 34