SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Overview of methods for variant calling from next-
   generation sequence data


   Thomas Keane,
   Vertebrate Resequencing Informatics,
   Wellcome Trust Sanger Institute
   Email: tk2@sanger.ac.uk




Vertebrate Resequencing Informatics   22nd July, 2010
SAM/BAM Format

 Proliferation of alignment formats over the years: Cigar, psl, gff, xml etc.
 SAM (Sequence Alignment/Map) format
     Single unified format for storing read alignments to a reference genome
 BAM (Binary Alignment/Map) format
     Binary equivalent of SAM
     Developed for fast processing/indexing
 Advantages
     Can store alignments from most aligners
     Supports multiple sequencing technologies
     Supports indexing for quick retrieval/viewing
     Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space)
     Reads can be grouped into logical groups e.g. lanes, libraries, individuals/
       genotypes
     Supports second best base call/quality for hard to call bases
 Possibility of storing raw sequencing data in BAM as replacement to SRF & fastq




Vertebrate Resequencing Informatics   22nd July, 2010
Read Entries in SAM


  No.     Name                   Description
  1       QNAME                  Query NAME of the read or the read pair
  2       FLAG                   Bitwise FLAG (pairing, strand, mate strand, etc.)
  3       RNAME                  Reference sequence NAME
  4       POS                    1-Based leftmost POSition of clipped alignment
  5       MAPQ                   MAPping Quality (Phred-scaled)
  6       CIGAR                  Extended CIGAR string (operations: MIDNSHP)
  7       MRNM                   Mate Reference NaMe (‘=’ if same as RNAME)
  8       MPOS                   1-Based leftmost Mate POSition
  9       ISIZE                  Inferred Insert SIZE
  10      SEQ                    Query SEQuence on the same strand as the reference
  11      QUAL                   Query QUALity (ASCII-33=Phred base quality)



         Heng Li , Bob Handsaker , Alec Wysoker , Tim Fennell , Jue Ruan , Nils Homer , Gabor Marth , Goncalo Abecasis ,
         Richard Durbin , and 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map
         format and SAMtools, Bioinformatics, 25:2078-2079


Vertebrate Resequencing Informatics    22nd July, 2010
Extended Cigar Format

 Cigar has been traditionally used as a compact way to represent a
 sequence alignment
 Operations include
      M - match or mismatch
      I - insertion
      D - deletion
 SAM extends these to include
      S - soft clip
      H - hard clip
      N - skipped bases
      P – padding
 E.g.          Read: ACGCA-TGCAGTtagacgt
              
Ref: 
ACTCAGTG—-GT
              
Cigar: 5M1D2M2I2M7S

Vertebrate Resequencing Informatics   22nd July, 2010
What is the cigar line?

 E.g.          Read: tgtcgtcACGCATG---CAGTtagacgt
              
Ref: 
       ACGCATGCGGCAGT
              
Cigar:




Vertebrate Resequencing Informatics   22nd July, 2010
Read Group Tag

 Each lane has a unique RG tag
 1000 Genomes
      Meta information derived from DCC
 RG tags
      ID: SRR/ERR number
      PL: Sequencing platform
      PU: Run name
      LB: Library name
      PI: Insert fragment size
      SM: Individual
      CN: Sequencing center




Vertebrate Resequencing Informatics   22nd July, 2010
1000 Genomes BAM File




 samtools view –h mybam.bam
Vertebrate Resequencing Informatics   22nd July, 2010
SAM/BAM Tools

 Well defined specification for SAM/BAM
 Several tools and programming APIs for interacting with SAM/BAM files
    Samtools - Sanger/C (http://samtools.sourceforge.net)
              Convert SAM <-> BAM
              Sort, index, BAM files
              Flagstat – summary of the mapping flags
              Merge multiple BAM files
              Rmdup – remove PCR duplicates from the library preparation
      Picard - Broad Institute/Java (http://picard.sourceforge.net)
              MarkDuplicates, CollectAlignmentSummaryMetrics, CreateSequenceDictionary,
               SamToFastq, MeanQualityByCycle, FixMateInformation…….
      Bio-SamTool – Perl (http://search.cpan.org/~lds/Bio-SamTools/)
      Pysam – Python (http://code.google.com/p/pysam/)
 BAM Visualisation
      BamView, LookSeq, Gap5: http://www.sanger.ac.uk/Software
      IGV: http://www.broadinstitute.org/igv/v1.3
      Tablet: http://bioinf.scri.ac.uk/tablet/



Vertebrate Resequencing Informatics   22nd July, 2010
BAM Visualisation




                                                        http://www.sanger.ac.uk/mousegenomes
Vertebrate Resequencing Informatics   22nd July, 2010
BAM Visualisation




                                                        http://www.sanger.ac.uk/mousegenomes
Vertebrate Resequencing Informatics   22nd July, 2010
SNP Calling

 SNP – single nucleotide polymorphisms
 View the bases over a reference position and look for differences
 Homozygous vs heterozygous SNPs
 Factors to consider when calling SNPs
      Base call qualities of each supporting base
      Proximity to
             Small indel
             Homopolymer run (>4-5bp for 454 and >10bp for illumina)
      Mapping qualities of the reads supporting the SNP
             Low mapping qualities indicates repetitive sequence
      Read length
             Possible to align reads with high confidence to larger portion of the genome with
              longer reads
      Paired reads
      Sequencing depth
             Few individuals/strains at high coverage vs. low coverage many individuals/strains
                      1000 genomes is low coverage sequencing across many individuals
                      Population based SNP calling methods


Vertebrate Resequencing Informatics   22nd July, 2010
Read Length




Vertebrate Resequencing Informatics   22nd July, 2010
Mouse SNP




Vertebrate Resequencing Informatics   22nd July, 2010
Is this a SNP?




Vertebrate Resequencing Informatics   22nd July, 2010
Short indel Calling

 Small insertions and deletions observed in the alignment of the read
 relative to the reference genome
 BAM format
      I or D character denote indel in the read
 Simple method
      Call indels based on the I or D events in the BAM file
             Samtools varFilter
 Factors to consider when calling indels
      Misalignment of the read
             Alignment scoring - often cheaper to introduce multiple SNPs than an indel
             Sufficient flanking sequence either side of the read
      Homopolymer runs either side of the indel
      Length of the reads
      Homozygous or heterozygous


Vertebrate Resequencing Informatics   22nd July, 2010
Example Indel




Vertebrate Resequencing Informatics   22nd July, 2010
Is this an indel?




Vertebrate Resequencing Informatics   22nd July, 2010
Is this an indel?




Vertebrate Resequencing Informatics   22nd July, 2010
Local Realignment

 Simple models for calling indels based on the initial alignments show
 high false positives and negatives e.g samtools
 More sophisticated algorithms currently being developed
      E.g. Dindel, GATK
 Example Algorithm overview
      Scan for all I or D operations across the input BAM file
      Foreach I or D operation
             Create new haplotype based on the indel event
             Realign the reads onto the alternative reference
             Count the number of reads that support the indel in the alternative reference
             Make the indel call
 Issues
      Very computationally intensive if testing every possible indel
             Alternatively test a subset of known indels (i.e. genotyping mode)



Vertebrate Resequencing Informatics   22nd July, 2010
Structural Variation

 Several types of structural variations (SVs)
      Large Insertions/deletions                                  76bp           76bp
      Inversions
                                                                          300bp
      Translocations
      Copy number variations
 Read pair information used to detect these events
      Paired end sequencing of either end of DNA
       fragment
             Observe deviations from the expected fragment size
      Presence/absence of mate pairs
      Read depth to detect copy number variations
      Several SV callers published recently
 Run several callers and produce large set of
 partially overlapping calls


Vertebrate Resequencing Informatics   22nd July, 2010
SV Types




Vertebrate Resequencing Informatics   22nd July, 2010
Structural Variations




                                                        Medvedev et al, Nat Meth, 6(11), 2009



Vertebrate Resequencing Informatics   22nd July, 2010
What is this?




Vertebrate Resequencing Informatics   22nd July, 2010
What is this?




Vertebrate Resequencing Informatics   22nd July, 2010
What is this?



                                Mate pairs align in the same orientation




Vertebrate Resequencing Informatics   22nd July, 2010
Tomorrow’s Lab 11-12

 BAM Files
      Using samtools to manipulate BAM files
      Visualising reads in a BAM file
 SNP Calling
      Calling SNPs from a BAM file
 Variant Call Format (VCF)
      Introduction to VCF for storing SNPs and meta information
 VCFTools
      Manipulating/comparing/intersecting lists of SNPs in VCF format




Vertebrate Resequencing Informatics   22nd July, 2010

Contenu connexe

Tendances

ApplicationNote-Brian-D-Gregory_1008V1
ApplicationNote-Brian-D-Gregory_1008V1ApplicationNote-Brian-D-Gregory_1008V1
ApplicationNote-Brian-D-Gregory_1008V1Jason Holzman
 
Lessons learned from high throughput CRISPR targeting in human cell lines
Lessons learned from high throughput CRISPR targeting in human cell linesLessons learned from high throughput CRISPR targeting in human cell lines
Lessons learned from high throughput CRISPR targeting in human cell linesChris Thorne
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Sri Ambati
 
CRISPR Technology
CRISPR TechnologyCRISPR Technology
CRISPR TechnologyRomilMistry
 
Genetic Engineering and Genomics Notes - MH-CET 2015
Genetic Engineering and Genomics Notes - MH-CET 2015 Genetic Engineering and Genomics Notes - MH-CET 2015
Genetic Engineering and Genomics Notes - MH-CET 2015 Ednexa
 
Genome editing with CRISPR/Cas9
Genome editing with CRISPR/Cas9Genome editing with CRISPR/Cas9
Genome editing with CRISPR/Cas9Saravanan KA
 
CRISPR Based Diagnosis For SARS-CoV-2[FELUDA]
CRISPR Based Diagnosis For SARS-CoV-2[FELUDA]CRISPR Based Diagnosis For SARS-CoV-2[FELUDA]
CRISPR Based Diagnosis For SARS-CoV-2[FELUDA]Barun Kumar Sahu
 
CRISPR- Cas technology: a new antiviral weapon for plants
CRISPR- Cas technology: a new antiviral weapon for plantsCRISPR- Cas technology: a new antiviral weapon for plants
CRISPR- Cas technology: a new antiviral weapon for plantsSachin Bhor
 
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and Characterization
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and CharacterizationDr. Ben Hause - Metagenomic Sequencing for Virus Discovery and Characterization
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and CharacterizationJohn Blue
 
Illumina (sequencing by synthesis) method
Illumina (sequencing by synthesis) methodIllumina (sequencing by synthesis) method
Illumina (sequencing by synthesis) methodFekaduKorsa
 

Tendances (19)

Crispr/Cas9
Crispr/Cas9Crispr/Cas9
Crispr/Cas9
 
ApplicationNote-Brian-D-Gregory_1008V1
ApplicationNote-Brian-D-Gregory_1008V1ApplicationNote-Brian-D-Gregory_1008V1
ApplicationNote-Brian-D-Gregory_1008V1
 
Lessons learned from high throughput CRISPR targeting in human cell lines
Lessons learned from high throughput CRISPR targeting in human cell linesLessons learned from high throughput CRISPR targeting in human cell lines
Lessons learned from high throughput CRISPR targeting in human cell lines
 
Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...Next Generation Sequencing and its Applications in Medical Research - Frances...
Next Generation Sequencing and its Applications in Medical Research - Frances...
 
CRISPR Technology
CRISPR TechnologyCRISPR Technology
CRISPR Technology
 
CRISPR
CRISPRCRISPR
CRISPR
 
Crispr cas9
Crispr cas9Crispr cas9
Crispr cas9
 
Genetic Engineering and Genomics Notes - MH-CET 2015
Genetic Engineering and Genomics Notes - MH-CET 2015 Genetic Engineering and Genomics Notes - MH-CET 2015
Genetic Engineering and Genomics Notes - MH-CET 2015
 
p21 mechanism slide
p21 mechanism slidep21 mechanism slide
p21 mechanism slide
 
Crispr cas9
Crispr cas9Crispr cas9
Crispr cas9
 
Eukar transcription
Eukar transcriptionEukar transcription
Eukar transcription
 
Crispr cas9
Crispr cas9Crispr cas9
Crispr cas9
 
Crispr cas
Crispr casCrispr cas
Crispr cas
 
Genome editing with CRISPR/Cas9
Genome editing with CRISPR/Cas9Genome editing with CRISPR/Cas9
Genome editing with CRISPR/Cas9
 
CRISPR Based Diagnosis For SARS-CoV-2[FELUDA]
CRISPR Based Diagnosis For SARS-CoV-2[FELUDA]CRISPR Based Diagnosis For SARS-CoV-2[FELUDA]
CRISPR Based Diagnosis For SARS-CoV-2[FELUDA]
 
CRISPR- Cas technology: a new antiviral weapon for plants
CRISPR- Cas technology: a new antiviral weapon for plantsCRISPR- Cas technology: a new antiviral weapon for plants
CRISPR- Cas technology: a new antiviral weapon for plants
 
Snyder, Evan
Snyder, EvanSnyder, Evan
Snyder, Evan
 
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and Characterization
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and CharacterizationDr. Ben Hause - Metagenomic Sequencing for Virus Discovery and Characterization
Dr. Ben Hause - Metagenomic Sequencing for Virus Discovery and Characterization
 
Illumina (sequencing by synthesis) method
Illumina (sequencing by synthesis) methodIllumina (sequencing by synthesis) method
Illumina (sequencing by synthesis) method
 

En vedette

Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Thomas Keane
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraThomas Keane
 
Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Thomas Keane
 
Cigar cutting and smoking
Cigar cutting and smokingCigar cutting and smoking
Cigar cutting and smokingdilraj singh
 
Patentability Searching
Patentability SearchingPatentability Searching
Patentability SearchingSimple Patents
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Thomas Keane
 
Ch13 Business Continuity Planning and Procedures
Ch13 Business Continuity Planning and ProceduresCh13 Business Continuity Planning and Procedures
Ch13 Business Continuity Planning and ProceduresInformation Technology
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencingDayananda Salam
 
NGS technologies - platforms and applications
NGS technologies - platforms and applicationsNGS technologies - platforms and applications
NGS technologies - platforms and applicationsAGRF_Ltd
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.mkim8
 
Free Download Powerpoint Slides
Free Download Powerpoint SlidesFree Download Powerpoint Slides
Free Download Powerpoint SlidesGeorge
 
Slideshare Powerpoint presentation
Slideshare Powerpoint presentationSlideshare Powerpoint presentation
Slideshare Powerpoint presentationelliehood
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminarshilpi nagpal
 
How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)Steven Hoober
 
What 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From FailureWhat 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From FailureReferralCandy
 

En vedette (20)

Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010Mouse Genomes Project Summary June 2010
Mouse Genomes Project Summary June 2010
 
Next generation sequencing in cloud computing era
Next generation sequencing in cloud computing eraNext generation sequencing in cloud computing era
Next generation sequencing in cloud computing era
 
Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...Assessing the impact of transposable element variation on mouse phenotypes an...
Assessing the impact of transposable element variation on mouse phenotypes an...
 
Cigar cutting and smoking
Cigar cutting and smokingCigar cutting and smoking
Cigar cutting and smoking
 
Patentability Searching
Patentability SearchingPatentability Searching
Patentability Searching
 
Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1Wellcome Trust Advances Course: NGS Course - Lecture1
Wellcome Trust Advances Course: NGS Course - Lecture1
 
Ch13 Business Continuity Planning and Procedures
Ch13 Business Continuity Planning and ProceduresCh13 Business Continuity Planning and Procedures
Ch13 Business Continuity Planning and Procedures
 
Patent
PatentPatent
Patent
 
Next generation sequencing
Next generation sequencingNext generation sequencing
Next generation sequencing
 
NGS technologies - platforms and applications
NGS technologies - platforms and applicationsNGS technologies - platforms and applications
NGS technologies - platforms and applications
 
Introduction to next generation sequencing
Introduction to next generation sequencingIntroduction to next generation sequencing
Introduction to next generation sequencing
 
Kitchen order Ticket
Kitchen order TicketKitchen order Ticket
Kitchen order Ticket
 
A Comparison of NGS Platforms.
A Comparison of NGS Platforms.A Comparison of NGS Platforms.
A Comparison of NGS Platforms.
 
Free Download Powerpoint Slides
Free Download Powerpoint SlidesFree Download Powerpoint Slides
Free Download Powerpoint Slides
 
Slideshare Powerpoint presentation
Slideshare Powerpoint presentationSlideshare Powerpoint presentation
Slideshare Powerpoint presentation
 
Best topics for seminar
Best topics for seminarBest topics for seminar
Best topics for seminar
 
The Minimum Loveable Product
The Minimum Loveable ProductThe Minimum Loveable Product
The Minimum Loveable Product
 
Displaying Data
Displaying DataDisplaying Data
Displaying Data
 
How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)How People Really Hold and Touch (their Phones)
How People Really Hold and Touch (their Phones)
 
What 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From FailureWhat 33 Successful Entrepreneurs Learned From Failure
What 33 Successful Entrepreneurs Learned From Failure
 

Similaire à Overview of methods for variant calling from next-generation sequence data

Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataThomas Keane
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08Computer Science Club
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_PresentationToyin23
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingmikaelhuss
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers Golden Helix Inc
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Leighton Pritchard
 
Forsharing cshl2011 sequencing
Forsharing cshl2011 sequencingForsharing cshl2011 sequencing
Forsharing cshl2011 sequencingSean Davis
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysismikaelhuss
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceJustin Johnson
 

Similaire à Overview of methods for variant calling from next-generation sequence data (20)

Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
Overview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence dataOverview of methods for variant calling from next-generation sequence data
Overview of methods for variant calling from next-generation sequence data
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
20100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture0820100516 bioinformatics kapushesky_lecture08
20100516 bioinformatics kapushesky_lecture08
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
RNA-seq quality control and pre-processing
RNA-seq quality control and pre-processingRNA-seq quality control and pre-processing
RNA-seq quality control and pre-processing
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
 
Under the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS ResearchersUnder the Hood of Alignment Algorithms for NGS Researchers
Under the Hood of Alignment Algorithms for NGS Researchers
 
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmmBioinformatica 08-12-2011-t8-go-hmm
Bioinformatica 08-12-2011-t8-go-hmm
 
NCBI
NCBINCBI
NCBI
 
Biological databases
Biological databasesBiological databases
Biological databases
 
Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)Microbial Genomics and Bioinformatics: BM405 (2015)
Microbial Genomics and Bioinformatics: BM405 (2015)
 
Forsharing cshl2011 sequencing
Forsharing cshl2011 sequencingForsharing cshl2011 sequencing
Forsharing cshl2011 sequencing
 
RNA-seq differential expression analysis
RNA-seq differential expression analysisRNA-seq differential expression analysis
RNA-seq differential expression analysis
 
Rna seq pipeline
Rna seq pipelineRna seq pipeline
Rna seq pipeline
 
Gen bank
Gen bankGen bank
Gen bank
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
 
Closing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real ScienceClosing the Gap in Time: From Raw Data to Real Science
Closing the Gap in Time: From Raw Data to Real Science
 
Hong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptxHong_Celine_ES_workshop.pptx
Hong_Celine_ES_workshop.pptx
 

Plus de Thomas Keane

2014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture22014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture2Thomas Keane
 
Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingThomas Keane
 
Large Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and ChallengesLarge Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and ChallengesThomas Keane
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Thomas Keane
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...Thomas Keane
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Thomas Keane
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialThomas Keane
 

Plus de Thomas Keane (7)

2014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture22014 Wellcome Trust Advances Course: NGS Course - Lecture2
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
 
Mouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-EditingMouse Genomes Project + RNA-Editing
Mouse Genomes Project + RNA-Editing
 
Large Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and ChallengesLarge Scale Resequencing: Approaches and Challenges
Large Scale Resequencing: Approaches and Challenges
 
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...Enhanced structural variant and breakpoint detection using SVMerge by integra...
Enhanced structural variant and breakpoint detection using SVMerge by integra...
 
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
 
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010Mouse Genomes Poster - Genetics 2010
Mouse Genomes Poster - Genetics 2010
 
ECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing TutorialECCB 2010 Next-gen sequencing Tutorial
ECCB 2010 Next-gen sequencing Tutorial
 

Overview of methods for variant calling from next-generation sequence data

  • 1. Overview of methods for variant calling from next- generation sequence data Thomas Keane, Vertebrate Resequencing Informatics, Wellcome Trust Sanger Institute Email: tk2@sanger.ac.uk Vertebrate Resequencing Informatics 22nd July, 2010
  • 2. SAM/BAM Format Proliferation of alignment formats over the years: Cigar, psl, gff, xml etc. SAM (Sequence Alignment/Map) format   Single unified format for storing read alignments to a reference genome BAM (Binary Alignment/Map) format   Binary equivalent of SAM   Developed for fast processing/indexing Advantages   Can store alignments from most aligners   Supports multiple sequencing technologies   Supports indexing for quick retrieval/viewing   Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space)   Reads can be grouped into logical groups e.g. lanes, libraries, individuals/ genotypes   Supports second best base call/quality for hard to call bases Possibility of storing raw sequencing data in BAM as replacement to SRF & fastq Vertebrate Resequencing Informatics 22nd July, 2010
  • 3. Read Entries in SAM No. Name Description 1 QNAME Query NAME of the read or the read pair 2 FLAG Bitwise FLAG (pairing, strand, mate strand, etc.) 3 RNAME Reference sequence NAME 4 POS 1-Based leftmost POSition of clipped alignment 5 MAPQ MAPping Quality (Phred-scaled) 6 CIGAR Extended CIGAR string (operations: MIDNSHP) 7 MRNM Mate Reference NaMe (‘=’ if same as RNAME) 8 MPOS 1-Based leftmost Mate POSition 9 ISIZE Inferred Insert SIZE 10 SEQ Query SEQuence on the same strand as the reference 11 QUAL Query QUALity (ASCII-33=Phred base quality) Heng Li , Bob Handsaker , Alec Wysoker , Tim Fennell , Jue Ruan , Nils Homer , Gabor Marth , Goncalo Abecasis , Richard Durbin , and 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map format and SAMtools, Bioinformatics, 25:2078-2079 Vertebrate Resequencing Informatics 22nd July, 2010
  • 4. Extended Cigar Format Cigar has been traditionally used as a compact way to represent a sequence alignment Operations include   M - match or mismatch   I - insertion   D - deletion SAM extends these to include   S - soft clip   H - hard clip   N - skipped bases   P – padding E.g. Read: ACGCA-TGCAGTtagacgt Ref: ACTCAGTG—-GT Cigar: 5M1D2M2I2M7S Vertebrate Resequencing Informatics 22nd July, 2010
  • 5. What is the cigar line? E.g. Read: tgtcgtcACGCATG---CAGTtagacgt Ref: ACGCATGCGGCAGT Cigar: Vertebrate Resequencing Informatics 22nd July, 2010
  • 6. Read Group Tag Each lane has a unique RG tag 1000 Genomes   Meta information derived from DCC RG tags   ID: SRR/ERR number   PL: Sequencing platform   PU: Run name   LB: Library name   PI: Insert fragment size   SM: Individual   CN: Sequencing center Vertebrate Resequencing Informatics 22nd July, 2010
  • 7. 1000 Genomes BAM File samtools view –h mybam.bam Vertebrate Resequencing Informatics 22nd July, 2010
  • 8. SAM/BAM Tools Well defined specification for SAM/BAM Several tools and programming APIs for interacting with SAM/BAM files   Samtools - Sanger/C (http://samtools.sourceforge.net)   Convert SAM <-> BAM   Sort, index, BAM files   Flagstat – summary of the mapping flags   Merge multiple BAM files   Rmdup – remove PCR duplicates from the library preparation   Picard - Broad Institute/Java (http://picard.sourceforge.net)   MarkDuplicates, CollectAlignmentSummaryMetrics, CreateSequenceDictionary, SamToFastq, MeanQualityByCycle, FixMateInformation…….   Bio-SamTool – Perl (http://search.cpan.org/~lds/Bio-SamTools/)   Pysam – Python (http://code.google.com/p/pysam/) BAM Visualisation   BamView, LookSeq, Gap5: http://www.sanger.ac.uk/Software   IGV: http://www.broadinstitute.org/igv/v1.3   Tablet: http://bioinf.scri.ac.uk/tablet/ Vertebrate Resequencing Informatics 22nd July, 2010
  • 9. BAM Visualisation http://www.sanger.ac.uk/mousegenomes Vertebrate Resequencing Informatics 22nd July, 2010
  • 10. BAM Visualisation http://www.sanger.ac.uk/mousegenomes Vertebrate Resequencing Informatics 22nd July, 2010
  • 11. SNP Calling SNP – single nucleotide polymorphisms View the bases over a reference position and look for differences Homozygous vs heterozygous SNPs Factors to consider when calling SNPs   Base call qualities of each supporting base   Proximity to  Small indel  Homopolymer run (>4-5bp for 454 and >10bp for illumina)   Mapping qualities of the reads supporting the SNP  Low mapping qualities indicates repetitive sequence   Read length  Possible to align reads with high confidence to larger portion of the genome with longer reads   Paired reads   Sequencing depth  Few individuals/strains at high coverage vs. low coverage many individuals/strains   1000 genomes is low coverage sequencing across many individuals   Population based SNP calling methods Vertebrate Resequencing Informatics 22nd July, 2010
  • 12. Read Length Vertebrate Resequencing Informatics 22nd July, 2010
  • 13. Mouse SNP Vertebrate Resequencing Informatics 22nd July, 2010
  • 14. Is this a SNP? Vertebrate Resequencing Informatics 22nd July, 2010
  • 15. Short indel Calling Small insertions and deletions observed in the alignment of the read relative to the reference genome BAM format   I or D character denote indel in the read Simple method   Call indels based on the I or D events in the BAM file  Samtools varFilter Factors to consider when calling indels   Misalignment of the read  Alignment scoring - often cheaper to introduce multiple SNPs than an indel  Sufficient flanking sequence either side of the read   Homopolymer runs either side of the indel   Length of the reads   Homozygous or heterozygous Vertebrate Resequencing Informatics 22nd July, 2010
  • 16. Example Indel Vertebrate Resequencing Informatics 22nd July, 2010
  • 17. Is this an indel? Vertebrate Resequencing Informatics 22nd July, 2010
  • 18. Is this an indel? Vertebrate Resequencing Informatics 22nd July, 2010
  • 19. Local Realignment Simple models for calling indels based on the initial alignments show high false positives and negatives e.g samtools More sophisticated algorithms currently being developed   E.g. Dindel, GATK Example Algorithm overview   Scan for all I or D operations across the input BAM file   Foreach I or D operation  Create new haplotype based on the indel event  Realign the reads onto the alternative reference  Count the number of reads that support the indel in the alternative reference  Make the indel call Issues   Very computationally intensive if testing every possible indel  Alternatively test a subset of known indels (i.e. genotyping mode) Vertebrate Resequencing Informatics 22nd July, 2010
  • 20. Structural Variation Several types of structural variations (SVs)   Large Insertions/deletions 76bp 76bp   Inversions 300bp   Translocations   Copy number variations Read pair information used to detect these events   Paired end sequencing of either end of DNA fragment  Observe deviations from the expected fragment size   Presence/absence of mate pairs   Read depth to detect copy number variations   Several SV callers published recently Run several callers and produce large set of partially overlapping calls Vertebrate Resequencing Informatics 22nd July, 2010
  • 21. SV Types Vertebrate Resequencing Informatics 22nd July, 2010
  • 22. Structural Variations Medvedev et al, Nat Meth, 6(11), 2009 Vertebrate Resequencing Informatics 22nd July, 2010
  • 23. What is this? Vertebrate Resequencing Informatics 22nd July, 2010
  • 24. What is this? Vertebrate Resequencing Informatics 22nd July, 2010
  • 25. What is this? Mate pairs align in the same orientation Vertebrate Resequencing Informatics 22nd July, 2010
  • 26. Tomorrow’s Lab 11-12 BAM Files   Using samtools to manipulate BAM files   Visualising reads in a BAM file SNP Calling   Calling SNPs from a BAM file Variant Call Format (VCF)   Introduction to VCF for storing SNPs and meta information VCFTools   Manipulating/comparing/intersecting lists of SNPs in VCF format Vertebrate Resequencing Informatics 22nd July, 2010