SlideShare une entreprise Scribd logo
1  sur  21
Télécharger pour lire hors ligne
Fast algorithms for large scale
   genome alignment and
   comparison

                                                             Davide Eynard
                                                       eynard@elet.polimi.it

                         Dipartimento di Elettronica e Informazione
                                               Politecnico di Milano

                                          2007/05/28

Algorithms for Computational Molecular Biology
The article(s)

        A.L. Delcher, S. Kasif, R.D. Fleischmann, J.
         Peterson, O. White, S.L. Salzberg: “Alignment of
         whole genomes”, 1999
        A.L. Delcher, A. Philippy, J. Carlton, S.L.
         Salzberg: “Fast algorithms for large-scale
         genome alignment and comparison”, 2002
        S. Kurtz, A. Philippy, A.L. Delcher, M. Smoot, M.
         Shumway, C. Antonescu, S.L. Salzberg:
         “Versatile and open software for comparing large
         genomes”, 2004



p. 2    2007/05/28          ACMB
The problem

        When the genome sequence of two closely
         related organisms becomes available, one of the
         first questions researchers want to ask is how the
         two genomes align
        Aligning (very) long sequences
          • Single gene sequences may be as long as tens of
            thousand of nucleotides
          • Whole genomes are usually millions of nucleotides
            or larger!




p. 3    2007/05/28           ACMB
The challenge

        Naïve
          • O(n2) space and time
        Hashing
          • faster, but still partly O(n2)
        Dynamic Programming
          • O(n) space, takes more time
        MUMmer
          • Suffix trees: O(n) space and time
          • LIS: O(k log k) where k is the number of MUMs




p. 4    2007/05/28               ACMB
The algorithm

       1) Perform a Maximal Unique Match (MUM)
         decomposition of the two genomes
       2) Sort the matches found in the MUM alignment,
         and extract the LIS (Longest Increasing
         Sequence) of matches that occur in the same
         order in both genomes
       3) Close the gaps in the alignment, performing
         local identification of large inserts, repeats, small
         mutated regions, tandem repeats and SNPs
       4) Output the alignment



p. 5    2007/05/28            ACMB
MUM: the suffix tree




p. 6   2007/05/28          ACMB
Longest Increasing Subsequence




p. 7   2007/05/28   ACMB
Closing the gaps




p. 8   2007/05/28         ACMB
MUMmer v2.0

        Relaxes the uniqueness constraint
        Faster, takes less space
        Algorithmic improvements
          • memory
          • streaming query
          • new module to cluster matches
        Able to align not only simple DNA sequences, but
         also human chromosomes
        Able to align incomplete genomes and protein
         sequences



p. 9    2007/05/28           ACMB
Time-space improvements

         The amount of memory used in the suffix tree
          has been reduced
           • from at most 37bytes/bp to at most 20bytes/bp
         Speed has increased
           • E.coli vs. V.cholerae, from 74sec,293MB to 27sec,
               100MB
         Suffix tree is used to store only one sequence,
          while the second one (query) is streamed against
          the suffix tree
           • once the suffix tree has been built, multiple queries
             can be streamed
           • quick way to find the next match
           • matches are maximal on the right hand side
p. 10    2007/05/28             ACMB
Streaming queries




p. 11   2007/05/28         ACMB
Clustering of matches

         Old version computed a single longest alignment
          between the sequences
         New version works as follows:
           • first, the system outputs a series of separate,
             independent alignment regions
           • clustering is performed by finding pairs of matches
             that are sufficiently close
           • finally, a LIS computation is done within each
             component to yield the most consistent sequence
             of matches in the cluster




p. 12    2007/05/28             ACMB
Alignment of incomplete genomes

         In a typical Whole-Genome Shotgun-Sequencing,
          the genome is broken up into millions of pieces
           • If the reads are generated at random, then >99%
             of a genome will be covered by sequencing
             enough reads to cover the genome eight times
           • The result of assembly is usually a collection of
             large, unordered DNA sequences called contigs
         NUCmer (nucleotide MUMmer) is a multiple-
          contig alignment program that uses MUMmer 2
          as its core aligment engine




p. 13    2007/05/28            ACMB
Alignment of incomplete genomes

        1)NUCmer input: two multi-fasta files representing
          partial or complete assemblies
        2)Create a map of all contig positions within each
          file
        3)Concatenate files separately and run MUMmer to
          find exact matches
        4)Map matches to separate contigs
        5)MUMs are clustered together if they are
          separated by no more than a user-specifiedd
          distance
        6)Dynamic programming is used to align
          sequences between the MUMs

p. 14    2007/05/28         ACMB
NUCmer




p. 15   2007/05/28    ACMB
PROmer

        1)Given two multi-fasta files, PROmer translates the
          DNA to amino acids
        2)An index is created that maps all protein
          sequences and lengths to the source DNA
        3)Pseudo-proteomes (amino acid sequences) are
          passed to MUMmer
        4)The index is used to translate the matches back
          to the original DNA input
        5)Clustering step




p. 16    2007/05/28          ACMB
MUMmer v3.0

         New improvements in code
           • slightly faster than 2.0, 25% less memory
         More modular and configurable
           • possibility to build hybrid systems
         Ability to run a multi-contig query against a multi-
          contig reference
         Non-unique maximal matches
         Speed-up of Nucmer and Promer modules
          (approx. 10-fold)
         Graphical viewers



p. 17    2007/05/28             ACMB
Graphical interfaces




p. 18   2007/05/28           ACMB
Graphical interfaces




p. 19   2007/05/28           ACMB
Graphical interfaces




p. 20   2007/05/28           ACMB
That's All, Folks



                          Thank you!
                     Questions are welcome




p. 21   2007/05/28          ACMB

Contenu connexe

Similaire à Fast algorithms for large scale genome alignment and comparison

20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
Cell Processor Based Sequence Alignment
Cell Processor Based Sequence AlignmentCell Processor Based Sequence Alignment
Cell Processor Based Sequence Alignmentguestbe9138
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryIAEME Publication
 
Lightning
LightningLightning
LightningArvados
 
Computational Analysis with ICM
Computational Analysis with ICMComputational Analysis with ICM
Computational Analysis with ICMVernon D Dutch Jr
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitShubham Verma
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn GraphAshwani kumar
 
How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets? ehsan sepahi
 
Associative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networksAssociative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networkseSAT Publishing House
 
Computer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresComputer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresAqeel Khudhair
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsOregon State University
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64PeterMaf
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64PeterMaf
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesAdina Chuang Howe
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processingPage Maker
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pubsesejun
 

Similaire à Fast algorithms for large scale genome alignment and comparison (20)

20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
Sequence assembly
Sequence assemblySequence assembly
Sequence assembly
 
Cell Processor Based Sequence Alignment
Cell Processor Based Sequence AlignmentCell Processor Based Sequence Alignment
Cell Processor Based Sequence Alignment
 
A new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binaryA new revisited compression technique through innovative partition group binary
A new revisited compression technique through innovative partition group binary
 
2012 stamps-mbl-1
2012 stamps-mbl-12012 stamps-mbl-1
2012 stamps-mbl-1
 
Lightning
LightningLightning
Lightning
 
Computational Analysis with ICM
Computational Analysis with ICMComputational Analysis with ICM
Computational Analysis with ICM
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
 
Report-de Bruijn Graph
Report-de Bruijn GraphReport-de Bruijn Graph
Report-de Bruijn Graph
 
How we revealed genomes secrets?
How we revealed genomes secrets? How we revealed genomes secrets?
How we revealed genomes secrets?
 
Associative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networksAssociative memory implementation with artificial neural networks
Associative memory implementation with artificial neural networks
 
Final doc of dna
Final  doc of dnaFinal  doc of dna
Final doc of dna
 
JBUON-21-1-33
JBUON-21-1-33JBUON-21-1-33
JBUON-21-1-33
 
Computer Simulation of Nano-Structures
Computer Simulation of Nano-StructuresComputer Simulation of Nano-Structures
Computer Simulation of Nano-Structures
 
Making effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computationsMaking effective use of graphics processing units (GPUs) in computations
Making effective use of graphics processing units (GPUs) in computations
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
 
Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64Genome res. 2002-kent-656-64
Genome res. 2002-kent-656-64
 
ASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop SlidesASM 2013 Metagenomic Assembly Workshop Slides
ASM 2013 Metagenomic Assembly Workshop Slides
 
Applications of paralleL processing
Applications of paralleL processingApplications of paralleL processing
Applications of paralleL processing
 
20110524zurichngs 2nd pub
20110524zurichngs 2nd pub20110524zurichngs 2nd pub
20110524zurichngs 2nd pub
 

Plus de Davide Eynard

Building Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and ManifoldsBuilding Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and ManifoldsDavide Eynard
 
Laplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsLaplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsDavide Eynard
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral ClusteringDavide Eynard
 
An integrated approach to discover tag semantics
An integrated approach to discover tag semanticsAn integrated approach to discover tag semantics
An integrated approach to discover tag semanticsDavide Eynard
 
SAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationSAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationDavide Eynard
 
A Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and ParticipationA Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and ParticipationDavide Eynard
 
ReSearch - Searching for Researchers
ReSearch - Searching for ResearchersReSearch - Searching for Researchers
ReSearch - Searching for ResearchersDavide Eynard
 
PhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD StudentsPhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD StudentsDavide Eynard
 
Exploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotationExploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotationDavide Eynard
 
Performance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection SystemsPerformance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection SystemsDavide Eynard
 
Cracking Codes With Genetic Algorithms
Cracking Codes With Genetic AlgorithmsCracking Codes With Genetic Algorithms
Cracking Codes With Genetic AlgorithmsDavide Eynard
 
Unambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional LanguagesUnambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional LanguagesDavide Eynard
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systemsDavide Eynard
 

Plus de Davide Eynard (15)

Building Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and ManifoldsBuilding Compatible Bases on Graphs, Images, and Manifolds
Building Compatible Bases on Graphs, Images, and Manifolds
 
Laplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformationsLaplacian Colormaps: a framework for structure-preserving color transformations
Laplacian Colormaps: a framework for structure-preserving color transformations
 
Notes on Spectral Clustering
Notes on Spectral ClusteringNotes on Spectral Clustering
Notes on Spectral Clustering
 
An integrated approach to discover tag semantics
An integrated approach to discover tag semanticsAn integrated approach to discover tag semantics
An integrated approach to discover tag semantics
 
SAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotationSAnno: a unifying framework for semantic annotation
SAnno: a unifying framework for semantic annotation
 
A Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and ParticipationA Virtuous Cycle of Semantics and Participation
A Virtuous Cycle of Semantics and Participation
 
Talk Hpl
Talk HplTalk Hpl
Talk Hpl
 
ReSearch - Searching for Researchers
ReSearch - Searching for ResearchersReSearch - Searching for Researchers
ReSearch - Searching for Researchers
 
PhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD StudentsPhDLinux: A Linux Crash Course for PhD Students
PhDLinux: A Linux Crash Course for PhD Students
 
Exploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotationExploiting user gratification for collaborative semantic annotation
Exploiting user gratification for collaborative semantic annotation
 
Performance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection SystemsPerformance Attacks on Intrusion Detection Systems
Performance Attacks on Intrusion Detection Systems
 
Cracking Codes With Genetic Algorithms
Cracking Codes With Genetic AlgorithmsCracking Codes With Genetic Algorithms
Cracking Codes With Genetic Algorithms
 
Rewire the Net
Rewire the NetRewire the Net
Rewire the Net
 
Unambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional LanguagesUnambiguous Recognizable Two-dimensional Languages
Unambiguous Recognizable Two-dimensional Languages
 
Research on collaborative information sharing systems
Research on collaborative information sharing systemsResearch on collaborative information sharing systems
Research on collaborative information sharing systems
 

Dernier

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 

Dernier (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 

Fast algorithms for large scale genome alignment and comparison

  • 1. Fast algorithms for large scale genome alignment and comparison Davide Eynard eynard@elet.polimi.it Dipartimento di Elettronica e Informazione Politecnico di Milano 2007/05/28 Algorithms for Computational Molecular Biology
  • 2. The article(s)  A.L. Delcher, S. Kasif, R.D. Fleischmann, J. Peterson, O. White, S.L. Salzberg: “Alignment of whole genomes”, 1999  A.L. Delcher, A. Philippy, J. Carlton, S.L. Salzberg: “Fast algorithms for large-scale genome alignment and comparison”, 2002  S. Kurtz, A. Philippy, A.L. Delcher, M. Smoot, M. Shumway, C. Antonescu, S.L. Salzberg: “Versatile and open software for comparing large genomes”, 2004 p. 2 2007/05/28 ACMB
  • 3. The problem  When the genome sequence of two closely related organisms becomes available, one of the first questions researchers want to ask is how the two genomes align  Aligning (very) long sequences • Single gene sequences may be as long as tens of thousand of nucleotides • Whole genomes are usually millions of nucleotides or larger! p. 3 2007/05/28 ACMB
  • 4. The challenge  Naïve • O(n2) space and time  Hashing • faster, but still partly O(n2)  Dynamic Programming • O(n) space, takes more time  MUMmer • Suffix trees: O(n) space and time • LIS: O(k log k) where k is the number of MUMs p. 4 2007/05/28 ACMB
  • 5. The algorithm 1) Perform a Maximal Unique Match (MUM) decomposition of the two genomes 2) Sort the matches found in the MUM alignment, and extract the LIS (Longest Increasing Sequence) of matches that occur in the same order in both genomes 3) Close the gaps in the alignment, performing local identification of large inserts, repeats, small mutated regions, tandem repeats and SNPs 4) Output the alignment p. 5 2007/05/28 ACMB
  • 6. MUM: the suffix tree p. 6 2007/05/28 ACMB
  • 8. Closing the gaps p. 8 2007/05/28 ACMB
  • 9. MUMmer v2.0  Relaxes the uniqueness constraint  Faster, takes less space  Algorithmic improvements • memory • streaming query • new module to cluster matches  Able to align not only simple DNA sequences, but also human chromosomes  Able to align incomplete genomes and protein sequences p. 9 2007/05/28 ACMB
  • 10. Time-space improvements  The amount of memory used in the suffix tree has been reduced • from at most 37bytes/bp to at most 20bytes/bp  Speed has increased • E.coli vs. V.cholerae, from 74sec,293MB to 27sec, 100MB  Suffix tree is used to store only one sequence, while the second one (query) is streamed against the suffix tree • once the suffix tree has been built, multiple queries can be streamed • quick way to find the next match • matches are maximal on the right hand side p. 10 2007/05/28 ACMB
  • 11. Streaming queries p. 11 2007/05/28 ACMB
  • 12. Clustering of matches  Old version computed a single longest alignment between the sequences  New version works as follows: • first, the system outputs a series of separate, independent alignment regions • clustering is performed by finding pairs of matches that are sufficiently close • finally, a LIS computation is done within each component to yield the most consistent sequence of matches in the cluster p. 12 2007/05/28 ACMB
  • 13. Alignment of incomplete genomes  In a typical Whole-Genome Shotgun-Sequencing, the genome is broken up into millions of pieces • If the reads are generated at random, then >99% of a genome will be covered by sequencing enough reads to cover the genome eight times • The result of assembly is usually a collection of large, unordered DNA sequences called contigs  NUCmer (nucleotide MUMmer) is a multiple- contig alignment program that uses MUMmer 2 as its core aligment engine p. 13 2007/05/28 ACMB
  • 14. Alignment of incomplete genomes 1)NUCmer input: two multi-fasta files representing partial or complete assemblies 2)Create a map of all contig positions within each file 3)Concatenate files separately and run MUMmer to find exact matches 4)Map matches to separate contigs 5)MUMs are clustered together if they are separated by no more than a user-specifiedd distance 6)Dynamic programming is used to align sequences between the MUMs p. 14 2007/05/28 ACMB
  • 15. NUCmer p. 15 2007/05/28 ACMB
  • 16. PROmer 1)Given two multi-fasta files, PROmer translates the DNA to amino acids 2)An index is created that maps all protein sequences and lengths to the source DNA 3)Pseudo-proteomes (amino acid sequences) are passed to MUMmer 4)The index is used to translate the matches back to the original DNA input 5)Clustering step p. 16 2007/05/28 ACMB
  • 17. MUMmer v3.0  New improvements in code • slightly faster than 2.0, 25% less memory  More modular and configurable • possibility to build hybrid systems  Ability to run a multi-contig query against a multi- contig reference  Non-unique maximal matches  Speed-up of Nucmer and Promer modules (approx. 10-fold)  Graphical viewers p. 17 2007/05/28 ACMB
  • 18. Graphical interfaces p. 18 2007/05/28 ACMB
  • 19. Graphical interfaces p. 19 2007/05/28 ACMB
  • 20. Graphical interfaces p. 20 2007/05/28 ACMB
  • 21. That's All, Folks Thank you! Questions are welcome p. 21 2007/05/28 ACMB