SlideShare une entreprise Scribd logo
1  sur  34
Télécharger pour lire hors ligne
CloudBreak: A MapReduce Algorithm for Genomic
           Structural Variation Detection

                         Chris Whelan & Kemal S¨nmez
                                               o

                          Oregon Health & Science University


                                  March 5, 2013




Whelan & S¨nmez (OHSU)
          o                           CloudBreak               March 5, 2013   1 / 34
Overview




    Background
    Current Approaches
    MapReduce Framework for SV Detection
    Cloudbreak Algorithm
    Results
    Ongoing Work




  Whelan & S¨nmez (OHSU)
            o                CloudBreak    March 5, 2013   2 / 34
Background - High Throughput Sequencing


    High Throughput (Illumina) Sequencing produces millions of paired
    short (∼100bp) reads of DNA from an input sample
    The challenge: use these reads to find characteristics of DNA sample
    relevant to disease or phenotype
    The approach: In resequencing experiments, align short reads to a
    reference genome for the species and find the differences
    Sequencing error, diploid genomes, hard to map repetitive sequences
    can make this difficult
    Need high coverage (eg 30X) to detect all single nucleotide
    polymorphisms (SNPs); results in large data sets (100GB compressed
    raw data for human)



  Whelan & S¨nmez (OHSU)
            o                   CloudBreak                March 5, 2013   3 / 34
Structural Variations




    Harder to detect than SNPs are structural variations: deletions,
    insertions, inversions, duplications, etc.
    Generally events that affect more than 40 or 50 bases
    The majority of variant bases in a normal individual genome are due
    to structural variations (primarily insertions and deletions)
    Variants are associated with cancer, neurological disease




  Whelan & S¨nmez (OHSU)
            o                    CloudBreak                 March 5, 2013   4 / 34
SV Detection Approaches



    Four main algorithmic approaches
           Read pair (RP): Look for paired reads that map to the reference at a
           distance or orientation that disagrees with the expected characteristics
           of the library
           Read depth (RD): Infer deletions and duplications from the number of
           reads mapped to each locus
           Split read mapping (SR): Split individual reads into two parts, see if
           you can map them to either side of a breakpoint
           De novo assembly (AS): assemble the reads into their original sequence
           and compare to the reference
    Hybrid approaches




  Whelan & S¨nmez (OHSU)
            o                         CloudBreak                   March 5, 2013   5 / 34
SV Detection from Sequencing Data




Mills et al. Nature 2011
  Whelan & S¨nmez (OHSU)
            o              CloudBreak   March 5, 2013   6 / 34
SV Detection is Hard

Sensitivity and FDR of deletion detection methods used on 1,000 Genomes
Project.




Mills et al. Nature 2011
  Whelan & S¨nmez (OHSU)
            o                   CloudBreak               March 5, 2013   7 / 34
Read-pair (RP) SV Detection




    Building the sample library involves selecting the size of DNA
    fragments
    Only the ends of each fragment are sequenced, from the outside in
    Therefore the distance between the two sequenced reads (the insert
    size) is known - typically modeled as a normal distribution




  Whelan & S¨nmez (OHSU)
            o                    CloudBreak                 March 5, 2013   8 / 34
Discordant read pairs


    When reads map to the reference at a greater than expected distance
    apart, indicates a deletion in the sample between the mapping
    location of the two ends
    Reads that map closer than expected imply an insertion
    Reads in the wrong orientation imply an inversion




Medvedev et al. 2009


  Whelan & S¨nmez (OHSU)
            o                   CloudBreak                   March 5, 2013   9 / 34
Read Pair Algorithms


    Identify all read pairs with discordant mappings
    Attempt to cluster discordant pairs supporting the same variant
    Typically ignore concordant mappings
    Some algorithms consider reads with multiple mappings by choosing
    the mappings that minimize the number of predicted variants: have
    shown that this increases sensitivity in repetitive regions of the
    genome
    Mapping results for a high coverage human genome are very large
    (100GB of compressed alignment data storing only the best mappings
    for a 30X genome)




  Whelan & S¨nmez (OHSU)
            o                    CloudBreak               March 5, 2013   10 / 34
MapReduce and Hadoop

   Provides a distributed filesystem across a cluster with redundant
   storage
   Divides computation into Map and Reduce phases: Mappers emit
   key-value pairs for a block of data, Reducers process all of the values
   for each key
   Good at handling data sets of the size seen in sequencing
   experiments, and much larger
   Able to harness a cluster of commodity machines rather than single
   high-powered servers
   Some algorithms translate easily to MapReduce model; others are
   much harder
   A natural abstraction in resequencing experiments: use a key for each
   location in the genome. Examples: SNP calling in GATK or Crossbow

 Whelan & S¨nmez (OHSU)
           o                     CloudBreak                March 5, 2013   11 / 34
SV Detection in MapReduce



    Clustering of read pairs as in traditional RP algorithms typically
    involves global compuations or graph structures
    MapReduce, on the other hand, forces local, parallel computations
    Our approach: use MapReduce to compute features for each location
    in the genome from alignments relevant to that location
    Locations can be small tiled windows to make the problem more
    tractable
    Make SV calls from features computed along the genome in a
    post-processing step




  Whelan & S¨nmez (OHSU)
            o                     CloudBreak                 March 5, 2013   12 / 34
An Algorithmic Framework for SV Detection in MapReduce

1:    job Alignment
2:        function Map(ReadPairId rpid, ReadId r , ReadSequence s, ReadQuality q)
3:           for all Alignments a ∈ Align(< s, q >) do
4:               Emit(ReadPairId rpid, Alignment a)
5:       function Reduce(ReadPairId rpid, Alignments a1,2,... )
6:          AlignmentPairList ap ← ValidAlignmentPairs (a1,2,... )
7:          Emit(ReadPairId rp, AlignmentPairList ap)

8: job Compute SV Features
9:     function Map(ReadPairId rp, AlignmentPairList ap)
10:        for all AlignmentPairs < a1 , a2 >∈ ap do
11:            for all GenomicLocations l ∈ Loci (a1 , a2 ) do
12:                ReadPairInfo rpi ← < InsertSize(a1 , a2 ), AlignmentScore(a1 , a2 ) >
13:                Emit(GenomicLocation l, ReadPairInfo rpi)
14:     function Reduce(GenomicLocation l, ReadPairInfos rpi1,2,... )
15:        SVFeatures φl ← Φ(InsertSizes i1,2,... , AlignmentScores q1,2,... )
16:        Emit(GenomicLocation l, SVFeatures φl )

17:   StructuralVariationCalls svs ← PostProcess(φ1,2,... )




     Whelan & S¨nmez (OHSU)
               o                               CloudBreak                           March 5, 2013   13 / 34
Three user-defined functions



    This framework leaves three functions to be defined
    May be many different approaches to take within this framework,
    depending on the application

                    Loci : a1 , a2 → Lm ⊆ L
                           Φ : {ReadPairInfo rpim,i,j } → RN
      PostProcess : {φ1 , φ2 , . . . , φN } → { SVType s, lstart , lend }




  Whelan & S¨nmez (OHSU)
            o                           CloudBreak             March 5, 2013   14 / 34
Cloudbreak implementation




    We focus on detecting deletions and small insertions
    Implemented as a native Hadoop application
    Use features computed from fitting a mixture model to the observed
    distribution of insert sizes at each locus
    Process as many mappings as possible for ambiguously mapped reads




  Whelan & S¨nmez (OHSU)
            o                   CloudBreak                 March 5, 2013   15 / 34
Local distributions of insert sizes


    Estimate distribution of insert sizes observed at each window as a
    Gaussian mixture model (GMM)
    Similar to idea in MoDIL (Lee et al. 2009)
    Use a constrained expectation-maximization algorithm to find mean,
    weight of second component. Constrain one component to have the
    library mean insert size, and constrain both components to have the
    same variance. Find mean and weight of the second component.
    Features computed include the log likelihood ratio of fit
    two-component model to the likelihood of the insert sizes under a
    model with no variant: normal distribution under library parameters.
    Other features: weight of the second component, estimated mean of
    the second component.


  Whelan & S¨nmez (OHSU)
            o                    CloudBreak                March 5, 2013   16 / 34
Local distributions of insert sizes
               No Variant




          Homozygous Deletion          0        100   200    300     400       500




          Heterozygous Deletion        0        100   200   300     400       500




                                       0        100   200   300     400      500
          Heterozygous Insertion




                                       0        100   200   300     400      500




  Whelan & S¨nmez (OHSU)
            o                      CloudBreak                      March 5, 2013     17 / 34
Cloudbreak output example




  Whelan & S¨nmez (OHSU)
            o              CloudBreak   March 5, 2013   18 / 34
Handling ambiguous mappings



    Incorrect mappings of read pairs are unlikely to form clusters of insert
    sizes at a given window
    Before fitting GMM, remove outliers using a nearest neighbor
    method: If kth nearest neighbor of each mapped pair is greater than
    c * (library fragment size SD) away, remove that mapping
    Control number of mappings based on an adaptive cutoff for
    alignment score: Discard mapping m if the ratio of the best alignment
    score for that window to the score of m is larger than some cutoff.
    This allows visibility into regions where no reads are mapped
    unambiguously.




  Whelan & S¨nmez (OHSU)
            o                     CloudBreak                March 5, 2013   19 / 34
Postprocessing



    First extract contiguous genomic loci where the log-likelihood ratio of
    the two models is greater than a given threshold.
    To eliminate noise we apply a median filter with window size 5.
    Let µ be the estimated mean of the second component and µ be the
    library insert size. We end regions when µ changes by more than
    60bp (2σ), and discard regions where the length of the region differs
    from µ by more than µ.
    Cloudbreak looses some resolution to breakpoint location based on
    genome windows and filters.




  Whelan & S¨nmez (OHSU)
            o                    CloudBreak                 March 5, 2013   20 / 34
Results Comparison

    We compare Cloudbreak to a selection of widely used algorithms
    taking different approaches:
    Breakdancer (Chen et al. 2009): Traditional RP based approach
    DELLY (Rausch et al. 2012): RP based approach with SR refinement
    of calls
    GASVPro (Sindi et al. 2012): RP based approach, uses ambiguous
    mappings of discordant read pairs which it resolves through MCMC
    algorithm; looks for RD signals at predicted breakpoint locations by
    examining concordant pairs
    Pindel (Ye et al. 2009): SR approach; looks for clusters of read pairs
    where only one read could be mapped and searches for split read
    mappings for the other read
    MoDIL (Lee et al. 2009): Mixture of distributions; only on simulated
    data due to runtime requirements.

  Whelan & S¨nmez (OHSU)
            o                    CloudBreak                March 5, 2013   21 / 34
Simulated Data

    Very little publicly available NGS data from a genome with fully
    characterized structural variations
    Can match algorithm output to validated SVs, but dont know if novel
    predictions are wrong or undiscovered.
    Way to get a simulated data set with ground truth known and realistic
    events: take a (somewhat) fully characterized genome, apply variants
    to reference sequence, simulate reads from modified reference.
    Use Venter genome (Levy et al, 2007), chromosome 2.
    To simulate heterozygosity, randomly assign half of the variants to be
    homozygous and half heterozygous, and create two modified
    references.
    Simulated 100bp paired reads with a 100bp insert size to 30X
    coverage.

  Whelan & S¨nmez (OHSU)
            o                    CloudBreak                March 5, 2013   22 / 34
ROC curve for Chromosome 2 Deletion Simulation

                                      Deletions in Venter diploid chr2 simulation


                            350
                            300
                            250
           True Positives

                            200




                                                                                          Cloudbreak
                                                                                          Breakdancer
                                                                                          Pindel
                            150




                                                                                          GASVPro
                                                                                          DELLY
                            100
                            50
                            0




                                  0    100                200               300     400

                                                     False Positives




    Caveat: Methods perform better on simulated data than on real
    whole genome datasets.
  Whelan & S¨nmez (OHSU)
            o                                              CloudBreak                      March 5, 2013   23 / 34
Ability to find simulated deletions by size at 10% FDR



    Number of deletions found in each size class (number of exclusive
    predictions for algorithm in that class)
    Cloudbreak competitive for a range of size classes

                           40-100bp    101-250bp    251-500bp    501-1000bp    > 1000bp
         Total Number            224           84           82            31          26
           Cloudbreak        47 ( 7)      50 ( 2)      55 ( 4)       12 ( 4)      15 (0)
          Breakdancer       52 ( 10)      49 ( 2)       49 (0)         7 (0)      14 (0)
             GASVPro         31 ( 4)       25 (0)       23 (0)         2 (0)       6 (0)
               DELLY         22 ( 2)      56 ( 3)       40 (0)         8 (0)      12 (0)
                Pindel      60 ( 35)       16 (0)      41 ( 2)         1 (0)      12 (0)




  Whelan & S¨nmez (OHSU)
            o                                CloudBreak                        March 5, 2013   24 / 34
Insertions in Simulated Data


                                     Insertions in Venter diploid chr2 simulation
                            80
                            60
           True Positives




                                                                                         Cloudbreak
                                                                                         Breakdancer
                                                                                         Pindel
                            40
                            20
                            0




                                 0     20                 40                60      80

                                                    False Positives




  Whelan & S¨nmez (OHSU)
            o                                             CloudBreak                      March 5, 2013   25 / 34
NA18507 Data Set




    Well studied sample from a Yoruban male individual
    High quality sequence to 37X coverage, 100bp reads with a 100bp
    insert size
    We created a gold standard set of deletions from three different
    studies with low false discovery rates: Mills et al. 2011, Human
    Genome Structural Variation Project (Kidd et al. 2008), and the 1000
    Genomes Project (Mills et al. 2011)




  Whelan & S¨nmez (OHSU)
            o                   CloudBreak               March 5, 2013   26 / 34
ROC Curve for NA18507 Deletions
    All algorithms look much worse on real data (could be lack of
    complete truth)

                                                 NA18507
                            2000
                            1500
           True Positives

                            1000




                                                                                  Cloudbreak
                                                                                  Breakdancer
                            500




                                                                                  Pindel
                                                                                  GASVPro
                                                                                  DELLY
                                                                                  Cloudbreak
                            0




                                   0   5000                       10000   15000

                                              Novel Predictions




  Whelan & S¨nmez (OHSU)
            o                                        CloudBreak                     March 5, 2013   27 / 34
Ability to find NA18507 deletions by size


    Using the same cutoffs that yielded a 10% FDR on the simulated
    chromosome 2 data set, adjusted for the difference in coverage from
    30X to 37X.
    Cloudbreak identifies more small deletions
    Cloudbreak contributes more exclusive predictions

                    Prec.   Recall    40-100bp    101-250bp     251-500bp   501-1000bp    > 1000bp
    Total Number                          7,466           235        218          110          375
      Cloudbreak   0.0978   0.115    423 ( 179)     128 ( 9)     158 ( 8)      70 ( 3)    186 ( 12)
     Breakdancer    0.122    0.112    261 ( 41)     132 ( 8)     167 ( 1)       92 (0)    288 ( 10)
       GASVPro      0.134   0.0401    104 ( 17)      37 ( 2)       77 (0)       26 (0)       93 (0)
         DELLY     0.0824    0.091     143 ( 9)     125 ( 7)     158 ( 1)      83 ( 1)     256 ( 3)
          Pindel     0.16   0.0685    149 ( 12)       57 (0)     140 ( 1)       58 (0)     172 ( 2)




  Whelan & S¨nmez (OHSU)
            o                                CloudBreak                            March 5, 2013      28 / 34
Ability to detect deletions in repetitive regions



    Detected deletions on the simulated and NA18507 data sets identified
    by each tool, broken down by whether the deletion overlaps with a
    RepeatMasker-annotated element.

                                  Simulated Data                NA18507
                              Non-repeat      Repeat    Non-repeat      Repeat
              Total Number           120          327          553         7851
                Cloudbreak       28 ( 4)    151 ( 13)    204 ( 46)   761 ( 165)
               Breakdancer       29 ( 5)     142 ( 7)    186 ( 21)    754 ( 39)
                  GASVPro        15 ( 2)      72 ( 2)      71 ( 6)    266 ( 13)
                    DELLY        21 ( 2)     117 ( 3)    147 ( 11)    618 ( 10)
                     Pindel      18 ( 9)    112 ( 28)     103 ( 4)    473 ( 11)




  Whelan & S¨nmez (OHSU)
            o                             CloudBreak                       March 5, 2013   29 / 34
Genotyping Deletions


    We can use the mixing parameter that controls the weight of the two
    components in the GMM to accurately predict deletion genotypes.
    By setting a simple cutoff of .2 on the average value of the weight in
    each prediction, we were able to achieve 86.7% and 94.9% accuracy
    in predicting the genotype of the true positive deletions we detected
    in the simulated and real data sets, respectively.

                                                     Actual Genotypes
                                      Simulated Data                   NA18507
                                 Homozygous    Heterozygous    Homozygous  Heterozygous
     Predicted    Homozygous             88               3            70            11
     Genotypes    Heterozygous           18              70             4           209




  Whelan & S¨nmez (OHSU)
            o                             CloudBreak                      March 5, 2013   30 / 34
Running Times

    Running times (wall) on both data sets
    Cloudbreak took approx. 150 workers for simulated data, 650 workers
    for NA18507 (42m in MapReduce)
    Breakdancer and DELLY were run in a single CPU but can be set to
    process each chromosome independently (10X speedup)
    Pindel was run in single-threaded mode
    MoDIL run on 200 cores

                           Simulated Chromosome 2 Data   NA18507
          Cloudbreak                              835s     106m
         Breakdancer                              653s       36h
           GASVPro                               3339s       33h
              DELLY                              1964s     208m
               Pindel                            1336s       38h
              MoDIL                                48h        **

  Whelan & S¨nmez (OHSU)
            o                       CloudBreak             March 5, 2013   31 / 34
Ongoing work: Generate additional features, improve
postprocessing



    Goals: increase accuracy and breakpoint resolution
    Features involving split read mappings or pairs in which only one end
    is mapped
    Features involving sequence and sequence variants
    Annotations of sequence features and previously identified variants
    Apply machine learning techniques: conditional random fields, Deep
    Learning
    Potential future work: add local assembly of breakpoints




  Whelan & S¨nmez (OHSU)
            o                    CloudBreak               March 5, 2013   32 / 34
Ongoing work: automate deployment and execution on
cloud providers



    Many researchers don’t have access to Hadoop clusters, or servers
    powerful enough process these data sets
    On-demand creation of clusters with cloud providers can be
    cost-effective, especially with spot pricing
    Developing scripts to automate on-demand construction of Hadoop
    clusters in cloud (Amazon EC2, Rackspace) using Apache Whirr
    project
    Bottleneck: transferring data into and out of the cloud




  Whelan & S¨nmez (OHSU)
            o                    CloudBreak                   March 5, 2013   33 / 34
Conclusions



    Novel approach to applying MapReduce algorithm to structural
    variation problem
    Make insert size distribution clustering approaches have feasible run
    times
    Improved accuracy over existing algorithms, especially in repetitive
    regions
    Ability to accurately genotype calls
    Cost of additional CPU hours, somewhat less breakpoint resolution




  Whelan & S¨nmez (OHSU)
            o                     CloudBreak                March 5, 2013   34 / 34

Contenu connexe

Similaire à AMP Lab presentation -- Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation

Health-e-Child CaseReasoner
Health-e-Child CaseReasonerHealth-e-Child CaseReasoner
Health-e-Child CaseReasonerGaborRendes
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clusteringtim_hare
 
Juha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the somJuha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the somArchiLab 7
 
Assessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersAssessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersperfj
 
Visualization of 3D Genome Data
Visualization of 3D Genome DataVisualization of 3D Genome Data
Visualization of 3D Genome DataNils Gehlenborg
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSijdkp
 
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big DataPradeeban Kathiravelu, Ph.D.
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn GraphAshwani kumar
 
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsData Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsColleen Farrelly
 
Clustering Algorithm Based On Correlation Preserving Indexing
Clustering Algorithm Based On Correlation Preserving IndexingClustering Algorithm Based On Correlation Preserving Indexing
Clustering Algorithm Based On Correlation Preserving IndexingIOSR Journals
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERkevig
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERijnlc
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
Phylogenetic Tree evolution
Phylogenetic Tree evolutionPhylogenetic Tree evolution
Phylogenetic Tree evolutionMd Omama Jawaid
 
Mining Features from the Object-Oriented Source Code of Software Variants by ...
Mining Features from the Object-Oriented Source Code of Software Variants by ...Mining Features from the Object-Oriented Source Code of Software Variants by ...
Mining Features from the Object-Oriented Source Code of Software Variants by ...Ra'Fat Al-Msie'deen
 

Similaire à AMP Lab presentation -- Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation (20)

Colombo14a
Colombo14aColombo14a
Colombo14a
 
Data preprocessing
Data preprocessingData preprocessing
Data preprocessing
 
Health-e-Child CaseReasoner
Health-e-Child CaseReasonerHealth-e-Child CaseReasoner
Health-e-Child CaseReasoner
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
EiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.DEiB Seminar from Antoni Miñarro, Ph.D
EiB Seminar from Antoni Miñarro, Ph.D
 
Statistical Clustering
Statistical ClusteringStatistical Clustering
Statistical Clustering
 
Juha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the somJuha vesanto esa alhoniemi 2000:clustering of the som
Juha vesanto esa alhoniemi 2000:clustering of the som
 
Assessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clustersAssessing the compactness and isolation of individual clusters
Assessing the compactness and isolation of individual clusters
 
Visualization of 3D Genome Data
Visualization of 3D Genome DataVisualization of 3D Genome Data
Visualization of 3D Genome Data
 
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMSSCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
SCAF – AN EFFECTIVE APPROACH TO CLASSIFY SUBSPACE CLUSTERING ALGORITHMS
 
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
 
Az36311316
Az36311316Az36311316
Az36311316
 
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression ModelsData Science Meetup: DGLARS and Homotopy LASSO for Regression Models
Data Science Meetup: DGLARS and Homotopy LASSO for Regression Models
 
Clustering Algorithm Based On Correlation Preserving Indexing
Clustering Algorithm Based On Correlation Preserving IndexingClustering Algorithm Based On Correlation Preserving Indexing
Clustering Algorithm Based On Correlation Preserving Indexing
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
Phylogenetic Tree evolution
Phylogenetic Tree evolutionPhylogenetic Tree evolution
Phylogenetic Tree evolution
 
Mining Features from the Object-Oriented Source Code of Software Variants by ...
Mining Features from the Object-Oriented Source Code of Software Variants by ...Mining Features from the Object-Oriented Source Code of Software Variants by ...
Mining Features from the Object-Oriented Source Code of Software Variants by ...
 

AMP Lab presentation -- Cloudbreak: A MapReduce Algorithm for Detecting Genomic Structural Variation

  • 1. CloudBreak: A MapReduce Algorithm for Genomic Structural Variation Detection Chris Whelan & Kemal S¨nmez o Oregon Health & Science University March 5, 2013 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 1 / 34
  • 2. Overview Background Current Approaches MapReduce Framework for SV Detection Cloudbreak Algorithm Results Ongoing Work Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 2 / 34
  • 3. Background - High Throughput Sequencing High Throughput (Illumina) Sequencing produces millions of paired short (∼100bp) reads of DNA from an input sample The challenge: use these reads to find characteristics of DNA sample relevant to disease or phenotype The approach: In resequencing experiments, align short reads to a reference genome for the species and find the differences Sequencing error, diploid genomes, hard to map repetitive sequences can make this difficult Need high coverage (eg 30X) to detect all single nucleotide polymorphisms (SNPs); results in large data sets (100GB compressed raw data for human) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 3 / 34
  • 4. Structural Variations Harder to detect than SNPs are structural variations: deletions, insertions, inversions, duplications, etc. Generally events that affect more than 40 or 50 bases The majority of variant bases in a normal individual genome are due to structural variations (primarily insertions and deletions) Variants are associated with cancer, neurological disease Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 4 / 34
  • 5. SV Detection Approaches Four main algorithmic approaches Read pair (RP): Look for paired reads that map to the reference at a distance or orientation that disagrees with the expected characteristics of the library Read depth (RD): Infer deletions and duplications from the number of reads mapped to each locus Split read mapping (SR): Split individual reads into two parts, see if you can map them to either side of a breakpoint De novo assembly (AS): assemble the reads into their original sequence and compare to the reference Hybrid approaches Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 5 / 34
  • 6. SV Detection from Sequencing Data Mills et al. Nature 2011 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 6 / 34
  • 7. SV Detection is Hard Sensitivity and FDR of deletion detection methods used on 1,000 Genomes Project. Mills et al. Nature 2011 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 7 / 34
  • 8. Read-pair (RP) SV Detection Building the sample library involves selecting the size of DNA fragments Only the ends of each fragment are sequenced, from the outside in Therefore the distance between the two sequenced reads (the insert size) is known - typically modeled as a normal distribution Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 8 / 34
  • 9. Discordant read pairs When reads map to the reference at a greater than expected distance apart, indicates a deletion in the sample between the mapping location of the two ends Reads that map closer than expected imply an insertion Reads in the wrong orientation imply an inversion Medvedev et al. 2009 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 9 / 34
  • 10. Read Pair Algorithms Identify all read pairs with discordant mappings Attempt to cluster discordant pairs supporting the same variant Typically ignore concordant mappings Some algorithms consider reads with multiple mappings by choosing the mappings that minimize the number of predicted variants: have shown that this increases sensitivity in repetitive regions of the genome Mapping results for a high coverage human genome are very large (100GB of compressed alignment data storing only the best mappings for a 30X genome) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 10 / 34
  • 11. MapReduce and Hadoop Provides a distributed filesystem across a cluster with redundant storage Divides computation into Map and Reduce phases: Mappers emit key-value pairs for a block of data, Reducers process all of the values for each key Good at handling data sets of the size seen in sequencing experiments, and much larger Able to harness a cluster of commodity machines rather than single high-powered servers Some algorithms translate easily to MapReduce model; others are much harder A natural abstraction in resequencing experiments: use a key for each location in the genome. Examples: SNP calling in GATK or Crossbow Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 11 / 34
  • 12. SV Detection in MapReduce Clustering of read pairs as in traditional RP algorithms typically involves global compuations or graph structures MapReduce, on the other hand, forces local, parallel computations Our approach: use MapReduce to compute features for each location in the genome from alignments relevant to that location Locations can be small tiled windows to make the problem more tractable Make SV calls from features computed along the genome in a post-processing step Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 12 / 34
  • 13. An Algorithmic Framework for SV Detection in MapReduce 1: job Alignment 2: function Map(ReadPairId rpid, ReadId r , ReadSequence s, ReadQuality q) 3: for all Alignments a ∈ Align(< s, q >) do 4: Emit(ReadPairId rpid, Alignment a) 5: function Reduce(ReadPairId rpid, Alignments a1,2,... ) 6: AlignmentPairList ap ← ValidAlignmentPairs (a1,2,... ) 7: Emit(ReadPairId rp, AlignmentPairList ap) 8: job Compute SV Features 9: function Map(ReadPairId rp, AlignmentPairList ap) 10: for all AlignmentPairs < a1 , a2 >∈ ap do 11: for all GenomicLocations l ∈ Loci (a1 , a2 ) do 12: ReadPairInfo rpi ← < InsertSize(a1 , a2 ), AlignmentScore(a1 , a2 ) > 13: Emit(GenomicLocation l, ReadPairInfo rpi) 14: function Reduce(GenomicLocation l, ReadPairInfos rpi1,2,... ) 15: SVFeatures φl ← Φ(InsertSizes i1,2,... , AlignmentScores q1,2,... ) 16: Emit(GenomicLocation l, SVFeatures φl ) 17: StructuralVariationCalls svs ← PostProcess(φ1,2,... ) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 13 / 34
  • 14. Three user-defined functions This framework leaves three functions to be defined May be many different approaches to take within this framework, depending on the application Loci : a1 , a2 → Lm ⊆ L Φ : {ReadPairInfo rpim,i,j } → RN PostProcess : {φ1 , φ2 , . . . , φN } → { SVType s, lstart , lend } Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 14 / 34
  • 15. Cloudbreak implementation We focus on detecting deletions and small insertions Implemented as a native Hadoop application Use features computed from fitting a mixture model to the observed distribution of insert sizes at each locus Process as many mappings as possible for ambiguously mapped reads Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 15 / 34
  • 16. Local distributions of insert sizes Estimate distribution of insert sizes observed at each window as a Gaussian mixture model (GMM) Similar to idea in MoDIL (Lee et al. 2009) Use a constrained expectation-maximization algorithm to find mean, weight of second component. Constrain one component to have the library mean insert size, and constrain both components to have the same variance. Find mean and weight of the second component. Features computed include the log likelihood ratio of fit two-component model to the likelihood of the insert sizes under a model with no variant: normal distribution under library parameters. Other features: weight of the second component, estimated mean of the second component. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 16 / 34
  • 17. Local distributions of insert sizes No Variant Homozygous Deletion 0 100 200 300 400 500 Heterozygous Deletion 0 100 200 300 400 500 0 100 200 300 400 500 Heterozygous Insertion 0 100 200 300 400 500 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 17 / 34
  • 18. Cloudbreak output example Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 18 / 34
  • 19. Handling ambiguous mappings Incorrect mappings of read pairs are unlikely to form clusters of insert sizes at a given window Before fitting GMM, remove outliers using a nearest neighbor method: If kth nearest neighbor of each mapped pair is greater than c * (library fragment size SD) away, remove that mapping Control number of mappings based on an adaptive cutoff for alignment score: Discard mapping m if the ratio of the best alignment score for that window to the score of m is larger than some cutoff. This allows visibility into regions where no reads are mapped unambiguously. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 19 / 34
  • 20. Postprocessing First extract contiguous genomic loci where the log-likelihood ratio of the two models is greater than a given threshold. To eliminate noise we apply a median filter with window size 5. Let µ be the estimated mean of the second component and µ be the library insert size. We end regions when µ changes by more than 60bp (2σ), and discard regions where the length of the region differs from µ by more than µ. Cloudbreak looses some resolution to breakpoint location based on genome windows and filters. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 20 / 34
  • 21. Results Comparison We compare Cloudbreak to a selection of widely used algorithms taking different approaches: Breakdancer (Chen et al. 2009): Traditional RP based approach DELLY (Rausch et al. 2012): RP based approach with SR refinement of calls GASVPro (Sindi et al. 2012): RP based approach, uses ambiguous mappings of discordant read pairs which it resolves through MCMC algorithm; looks for RD signals at predicted breakpoint locations by examining concordant pairs Pindel (Ye et al. 2009): SR approach; looks for clusters of read pairs where only one read could be mapped and searches for split read mappings for the other read MoDIL (Lee et al. 2009): Mixture of distributions; only on simulated data due to runtime requirements. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 21 / 34
  • 22. Simulated Data Very little publicly available NGS data from a genome with fully characterized structural variations Can match algorithm output to validated SVs, but dont know if novel predictions are wrong or undiscovered. Way to get a simulated data set with ground truth known and realistic events: take a (somewhat) fully characterized genome, apply variants to reference sequence, simulate reads from modified reference. Use Venter genome (Levy et al, 2007), chromosome 2. To simulate heterozygosity, randomly assign half of the variants to be homozygous and half heterozygous, and create two modified references. Simulated 100bp paired reads with a 100bp insert size to 30X coverage. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 22 / 34
  • 23. ROC curve for Chromosome 2 Deletion Simulation Deletions in Venter diploid chr2 simulation 350 300 250 True Positives 200 Cloudbreak Breakdancer Pindel 150 GASVPro DELLY 100 50 0 0 100 200 300 400 False Positives Caveat: Methods perform better on simulated data than on real whole genome datasets. Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 23 / 34
  • 24. Ability to find simulated deletions by size at 10% FDR Number of deletions found in each size class (number of exclusive predictions for algorithm in that class) Cloudbreak competitive for a range of size classes 40-100bp 101-250bp 251-500bp 501-1000bp > 1000bp Total Number 224 84 82 31 26 Cloudbreak 47 ( 7) 50 ( 2) 55 ( 4) 12 ( 4) 15 (0) Breakdancer 52 ( 10) 49 ( 2) 49 (0) 7 (0) 14 (0) GASVPro 31 ( 4) 25 (0) 23 (0) 2 (0) 6 (0) DELLY 22 ( 2) 56 ( 3) 40 (0) 8 (0) 12 (0) Pindel 60 ( 35) 16 (0) 41 ( 2) 1 (0) 12 (0) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 24 / 34
  • 25. Insertions in Simulated Data Insertions in Venter diploid chr2 simulation 80 60 True Positives Cloudbreak Breakdancer Pindel 40 20 0 0 20 40 60 80 False Positives Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 25 / 34
  • 26. NA18507 Data Set Well studied sample from a Yoruban male individual High quality sequence to 37X coverage, 100bp reads with a 100bp insert size We created a gold standard set of deletions from three different studies with low false discovery rates: Mills et al. 2011, Human Genome Structural Variation Project (Kidd et al. 2008), and the 1000 Genomes Project (Mills et al. 2011) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 26 / 34
  • 27. ROC Curve for NA18507 Deletions All algorithms look much worse on real data (could be lack of complete truth) NA18507 2000 1500 True Positives 1000 Cloudbreak Breakdancer 500 Pindel GASVPro DELLY Cloudbreak 0 0 5000 10000 15000 Novel Predictions Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 27 / 34
  • 28. Ability to find NA18507 deletions by size Using the same cutoffs that yielded a 10% FDR on the simulated chromosome 2 data set, adjusted for the difference in coverage from 30X to 37X. Cloudbreak identifies more small deletions Cloudbreak contributes more exclusive predictions Prec. Recall 40-100bp 101-250bp 251-500bp 501-1000bp > 1000bp Total Number 7,466 235 218 110 375 Cloudbreak 0.0978 0.115 423 ( 179) 128 ( 9) 158 ( 8) 70 ( 3) 186 ( 12) Breakdancer 0.122 0.112 261 ( 41) 132 ( 8) 167 ( 1) 92 (0) 288 ( 10) GASVPro 0.134 0.0401 104 ( 17) 37 ( 2) 77 (0) 26 (0) 93 (0) DELLY 0.0824 0.091 143 ( 9) 125 ( 7) 158 ( 1) 83 ( 1) 256 ( 3) Pindel 0.16 0.0685 149 ( 12) 57 (0) 140 ( 1) 58 (0) 172 ( 2) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 28 / 34
  • 29. Ability to detect deletions in repetitive regions Detected deletions on the simulated and NA18507 data sets identified by each tool, broken down by whether the deletion overlaps with a RepeatMasker-annotated element. Simulated Data NA18507 Non-repeat Repeat Non-repeat Repeat Total Number 120 327 553 7851 Cloudbreak 28 ( 4) 151 ( 13) 204 ( 46) 761 ( 165) Breakdancer 29 ( 5) 142 ( 7) 186 ( 21) 754 ( 39) GASVPro 15 ( 2) 72 ( 2) 71 ( 6) 266 ( 13) DELLY 21 ( 2) 117 ( 3) 147 ( 11) 618 ( 10) Pindel 18 ( 9) 112 ( 28) 103 ( 4) 473 ( 11) Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 29 / 34
  • 30. Genotyping Deletions We can use the mixing parameter that controls the weight of the two components in the GMM to accurately predict deletion genotypes. By setting a simple cutoff of .2 on the average value of the weight in each prediction, we were able to achieve 86.7% and 94.9% accuracy in predicting the genotype of the true positive deletions we detected in the simulated and real data sets, respectively. Actual Genotypes Simulated Data NA18507 Homozygous Heterozygous Homozygous Heterozygous Predicted Homozygous 88 3 70 11 Genotypes Heterozygous 18 70 4 209 Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 30 / 34
  • 31. Running Times Running times (wall) on both data sets Cloudbreak took approx. 150 workers for simulated data, 650 workers for NA18507 (42m in MapReduce) Breakdancer and DELLY were run in a single CPU but can be set to process each chromosome independently (10X speedup) Pindel was run in single-threaded mode MoDIL run on 200 cores Simulated Chromosome 2 Data NA18507 Cloudbreak 835s 106m Breakdancer 653s 36h GASVPro 3339s 33h DELLY 1964s 208m Pindel 1336s 38h MoDIL 48h ** Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 31 / 34
  • 32. Ongoing work: Generate additional features, improve postprocessing Goals: increase accuracy and breakpoint resolution Features involving split read mappings or pairs in which only one end is mapped Features involving sequence and sequence variants Annotations of sequence features and previously identified variants Apply machine learning techniques: conditional random fields, Deep Learning Potential future work: add local assembly of breakpoints Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 32 / 34
  • 33. Ongoing work: automate deployment and execution on cloud providers Many researchers don’t have access to Hadoop clusters, or servers powerful enough process these data sets On-demand creation of clusters with cloud providers can be cost-effective, especially with spot pricing Developing scripts to automate on-demand construction of Hadoop clusters in cloud (Amazon EC2, Rackspace) using Apache Whirr project Bottleneck: transferring data into and out of the cloud Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 33 / 34
  • 34. Conclusions Novel approach to applying MapReduce algorithm to structural variation problem Make insert size distribution clustering approaches have feasible run times Improved accuracy over existing algorithms, especially in repetitive regions Ability to accurately genotype calls Cost of additional CPU hours, somewhat less breakpoint resolution Whelan & S¨nmez (OHSU) o CloudBreak March 5, 2013 34 / 34