Overview of methods for variant calling from next-generation sequence data
1. Overview of methods for variant calling from next-
generation sequence data
Thomas Keane,
Vertebrate Resequencing Informatics,
Wellcome Trust Sanger Institute
Email: tk2@sanger.ac.uk
Vertebrate Resequencing Informatics 22nd July, 2010
2. SAM/BAM Format
Proliferation of alignment formats over the years: Cigar, psl, gff, xml etc.
SAM (Sequence Alignment/Map) format
Single unified format for storing read alignments to a reference genome
BAM (Binary Alignment/Map) format
Binary equivalent of SAM
Developed for fast processing/indexing
Advantages
Can store alignments from most aligners
Supports multiple sequencing technologies
Supports indexing for quick retrieval/viewing
Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space)
Reads can be grouped into logical groups e.g. lanes, libraries, individuals/
genotypes
Supports second best base call/quality for hard to call bases
Possibility of storing raw sequencing data in BAM as replacement to SRF & fastq
Vertebrate Resequencing Informatics 22nd July, 2010
3. Read Entries in SAM
No. Name Description
1 QNAME Query NAME of the read or the read pair
2 FLAG Bitwise FLAG (pairing, strand, mate strand, etc.)
3 RNAME Reference sequence NAME
4 POS 1-Based leftmost POSition of clipped alignment
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR Extended CIGAR string (operations: MIDNSHP)
7 MRNM Mate Reference NaMe (‘=’ if same as RNAME)
8 MPOS 1-Based leftmost Mate POSition
9 ISIZE Inferred Insert SIZE
10 SEQ Query SEQuence on the same strand as the reference
11 QUAL Query QUALity (ASCII-33=Phred base quality)
Heng Li , Bob Handsaker , Alec Wysoker , Tim Fennell , Jue Ruan , Nils Homer , Gabor Marth , Goncalo Abecasis ,
Richard Durbin , and 1000 Genome Project Data Processing Subgroup (2009) The Sequence Alignment/Map
format and SAMtools, Bioinformatics, 25:2078-2079
Vertebrate Resequencing Informatics 22nd July, 2010
4. Extended Cigar Format
Cigar has been traditionally used as a compact way to represent a
sequence alignment
Operations include
M - match or mismatch
I - insertion
D - deletion
SAM extends these to include
S - soft clip
H - hard clip
N - skipped bases
P – padding
E.g. Read: ACGCA-TGCAGTtagacgt
Ref:
ACTCAGTG—-GT
Cigar: 5M1D2M2I2M7S
Vertebrate Resequencing Informatics 22nd July, 2010
5. What is the cigar line?
E.g. Read: tgtcgtcACGCATG---CAGTtagacgt
Ref:
ACGCATGCGGCAGT
Cigar:
Vertebrate Resequencing Informatics 22nd July, 2010
6. Read Group Tag
Each lane has a unique RG tag
1000 Genomes
Meta information derived from DCC
RG tags
ID: SRR/ERR number
PL: Sequencing platform
PU: Run name
LB: Library name
PI: Insert fragment size
SM: Individual
CN: Sequencing center
Vertebrate Resequencing Informatics 22nd July, 2010
11. SNP Calling
SNP – single nucleotide polymorphisms
View the bases over a reference position and look for differences
Homozygous vs heterozygous SNPs
Factors to consider when calling SNPs
Base call qualities of each supporting base
Proximity to
Small indel
Homopolymer run (>4-5bp for 454 and >10bp for illumina)
Mapping qualities of the reads supporting the SNP
Low mapping qualities indicates repetitive sequence
Read length
Possible to align reads with high confidence to larger portion of the genome with
longer reads
Paired reads
Sequencing depth
Few individuals/strains at high coverage vs. low coverage many individuals/strains
1000 genomes is low coverage sequencing across many individuals
Population based SNP calling methods
Vertebrate Resequencing Informatics 22nd July, 2010
14. Is this a SNP?
Vertebrate Resequencing Informatics 22nd July, 2010
15. Short indel Calling
Small insertions and deletions observed in the alignment of the read
relative to the reference genome
BAM format
I or D character denote indel in the read
Simple method
Call indels based on the I or D events in the BAM file
Samtools varFilter
Factors to consider when calling indels
Misalignment of the read
Alignment scoring - often cheaper to introduce multiple SNPs than an indel
Sufficient flanking sequence either side of the read
Homopolymer runs either side of the indel
Length of the reads
Homozygous or heterozygous
Vertebrate Resequencing Informatics 22nd July, 2010
17. Is this an indel?
Vertebrate Resequencing Informatics 22nd July, 2010
18. Is this an indel?
Vertebrate Resequencing Informatics 22nd July, 2010
19. Local Realignment
Simple models for calling indels based on the initial alignments show
high false positives and negatives e.g samtools
More sophisticated algorithms currently being developed
E.g. Dindel, GATK
Example Algorithm overview
Scan for all I or D operations across the input BAM file
Foreach I or D operation
Create new haplotype based on the indel event
Realign the reads onto the alternative reference
Count the number of reads that support the indel in the alternative reference
Make the indel call
Issues
Very computationally intensive if testing every possible indel
Alternatively test a subset of known indels (i.e. genotyping mode)
Vertebrate Resequencing Informatics 22nd July, 2010
20. Structural Variation
Several types of structural variations (SVs)
Large Insertions/deletions 76bp 76bp
Inversions
300bp
Translocations
Copy number variations
Read pair information used to detect these events
Paired end sequencing of either end of DNA
fragment
Observe deviations from the expected fragment size
Presence/absence of mate pairs
Read depth to detect copy number variations
Several SV callers published recently
Run several callers and produce large set of
partially overlapping calls
Vertebrate Resequencing Informatics 22nd July, 2010
25. What is this?
Mate pairs align in the same orientation
Vertebrate Resequencing Informatics 22nd July, 2010
26. Tomorrow’s Lab 11-12
BAM Files
Using samtools to manipulate BAM files
Visualising reads in a BAM file
SNP Calling
Calling SNPs from a BAM file
Variant Call Format (VCF)
Introduction to VCF for storing SNPs and meta information
VCFTools
Manipulating/comparing/intersecting lists of SNPs in VCF format
Vertebrate Resequencing Informatics 22nd July, 2010