Hong_Celine_ES_workshop.pptx

Exome Sequencing &
Variant Analysis

Sequencers
Ion torrent proton
Ilumina MiSeq
Ilumina HiSeq 2500
Ilumina XTen
PacBio

Sequencers
Machine Cost methods Throughput
per run
Read length Error rate
Illumina
MiSeq
$128K Small genomes,
targeted gene
1.5-2Gb 2X300 0.8%
Ion Torrent $80K Small genomes,
targeted gene
1Gb 400 1.71%
Illumina
NextSeq
$250K Exomes/transcript
ome
120Gb 2x150 0.8%
Illumina
HiSeq
$654K Genomes/exomes/
transcriptomics
600Gb 2X150 0.76%
Illumina X
Ten
$10
Mil.
Genomes 1.6Tb 2x150 0.5%
PacBio $695K Genomes 100Mb 15K 12.86%

Next-generation sequencing overview
Exomes (ES/WES) Genomes (GS/WGS)
Cost ~$1000 ~$2000
(~$1000 with HiSeq X Ten)
Size of bam files ~10 Gb ~200 Gb
DNA
Targeted and
captured
Sheared DNA
What you can get
Most coding regions
(+UTR)
coding and non-
coding
Variants that can be
examined
SNVs, indels (CNVs)
SNVs, indels, CNVs,
structural variations

• Genome has even coverage
• Even a deletion is observed by eye
How exomes and genomes look!
WT Deletion
ES
GS

All about exomes
• What is exome?
– Sequencing targeted exonic regions
– ~2% of genome
• Important to know
– You will NOT get a whole exome!
– Not all exons in all genes are captured!
– Important to know the negative results vs no data
• Coverage will vary in targets

What to know about your data
• What is the sequence depth?
– The depth your sequence
– 10x, 30x, 50x, 100x
• Read length
– How long is your read length?
• What software was used to align?
• What variant calling was used to call variants?
• Which reference was used?
• Which capture kit was used? What is covered?

Partial list of capture enrichment kits
Manufacturer Kits Regions targeted Bases covered
Illumina Nextera Rapid capture
Exons + UTRs+
miRNA 62 Mb
Nimblegen
SeqCap EZ Exome Exons + UTR 96 Mb
SeqCap EZMedExome Disease-associated
regions
47 Mb
Agilent
SureSelect Human All
Exon V6
Exons+UTRs
60Mb
Clinical Research
Exome
Disease-relevant
targets
51Mb

Overview of next-generation
sequencing processing
Sequence
Align
reads/mapping
Variant calling Annotate
Downstream
analysis

Popular tools
Task Popular tools
Align reads/mapping BWA-mem
Novoalign
Isaac
Variant calling GATK (Broad)
Platypus (Wellcome Trust)
Starling (Illumina)
Annotating variants AnnoVar
VEP
snpEff

Popular source
Task Popular source
Control population frequency ExAC
1000GP
ESP
Annotation RefSeq
Ensemble
UCSC genes
GENCODE
Visualization UCSC genome browser
IGV
Clinical relevance HGMD
OMIM
CGD
ClinVar

Exome sequencing overview
Ann
https://en.wikipedia.org/wiki/Exome_sequencing#/media/File:E

GATK Best Practices Website
https://www.broadinstitute.org/gatk/guide/best-practices

Introduction to exome
sequencing analysis workflow

ES pipeline overview
Sequence Alignment
Quality
control
Variant
discovery
Quality
control
Annotate
variants
Analyze
Pre-processing Variant discovery Analysis

Files and tools used
File type Origin
FASTQ Raw reads from sequencer
SAM
BAM
Sequence Alignment/Map
Binary version of SAM
gVCF/VCF Variant call format
Tool Purpose
BWA mem Read alignment to reference
Picard Mark duplicates
GATK (Haplotype caller) QC/variant calling
Samtools Sort sam/bam, convert sam<->bam

Reference genome
• There are different versions of human reference genome
Reference Name Chr notation Mitochondrial sequence
Additional sequences
included
GRCh37
(Genome Reference
Consortium)
1, 2… X,Y,MT Yes
• Unlocalized
• Unplaced
• Alternate loci
hg19
(UCSC genome browser)
Chr1, Chr2…ChrX, ChrY,
ChrM
Copied from previous
release
• Unlocalized
• Unplaced
• Alternate loci
b37/b37+decoy/hs37d5
(1000GP)
1, 2… X,Y,MT
Yes
• Unlocalized
• Unplaced
• “decoy” sequence
• Human herpevirus 4
type 1
• Unlocalized: chromosome known, exact location unknown
• Unplaced: known to originate from human genome, chromosome unknown
• Alternate loci: alternate representation of specific human regions

Index files
• Index files are needed for files in next-generation analysis, as
file sizes are big!
• Enables program to efficiently access the data, rather than
having to read the whole file
File type Index
FASTA *fai
BAM *bai
VCF *vcf.idx

Why BWA+ GATK Haplotype caller?
• Widely accepted as the “conventional” way of
processing next-gen data
• Well assessed
• Well documented
• Software is supported
• Community support for troubleshooting or
information

Data processing
Align FASTQ to the reference
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Mark duplicates (PICARD)
Realign indels
(GATK IndelRealigner)
Recalibrate base quality scores
(GATK BaseRecalibrator)
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, VCF
(GATK GenotypeGVCFs)
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
Variant classification
annotation
(ANNOVAR)
Variant effect
annotation
(ANNOVAR)
Filter variants to
identify candidates
of interest

FASTQ
Sequence ID Sequence Sequence quality
Sequence ID
Bitwise
flags
chrosome
position
MappingQ
CIGAR
Paired-end
chr Paired-end
position
Observed
template
length
http://samtools.github.io/hts-specs/SAMv1.pdf
BAM/SAM
Align
Pre-processing

Marking duplicates and why is it necessary?
• It is assumed that each read corresponds to an
independent DNA fragment from randomly
sheared DNA
• However, PCR amplification can cause duplicates
– Identify based on start + stop of reads
– Choose the best and ignore the rest
Broad Institute
Pre-processing QC
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Realign indels

Indel realignment
Broad Institute
Pre-processing QC
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Realign indels
• Misalignment around indels cause high
number of SNPs
• These regions are identified and locally
realigned to minimize mismatches

Recalibrate Base Quality Scores
Broad Institute
Pre-processing QC
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Realign indels
• Base scores are produced by sequencers
• Quality scores are inaccurate and biased
– Prone to various technical errors
– QS are often over- or under estimated
• To identify and correct non-random technical error
– Physics or the chemistry of sequencing reactions
– Manufacturing flaws in the equipment
• Error covariates e.g.
– Reported quality score
– Position within the read (machine cycle)
– Preceding and current nucleotide (sequencing chemistry)

Recalibrate Base Quality Scores
Broad Institute
Pre-processing QC
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Realign indels
Over-estimation
Under-estimation
• GATK BSQR builds model based on the known variants
set
• Adjusts the base quality scores in the data based on
the model

Now we are ready to call
variants!

Data processing
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Realign indels
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
OR
annotation
(ANNOVAR)
Variant effect
annotation
(ANNOVAR)
Filter variants to
identify candidates
of interest
✔

Generate genotype likelihoods in each sample (gVCF)
• For a single sample, calculates normalized Phred-scaled
likelihoods (PL) for genotypes:
• “likelihood of the genotype”= “the probability that
the genotype is not correct”
• Normalized so that the most likely genotype is 0
Broad Institute
Variant Discovery
indel VCF
(GATK GenotypeGVCFs
OR

What is a joint genotyping?
• If we analyze Sample 1 or
Sample N alone, we are not
confident that the variant is
real
• If we see both samples, we
are more confident that
there is real variation at this
site in this cohort
Broad Institute
Variant Discovery

Generate new variant quality score using VQSR
Broad Institute
Variant Discovery-QC
indel VCF
(GATK GenotypeGVCFs
OR
• What is Variant Quality Score
Recalibration
– NOT adjusting scores!!
– Generate new score VQSLOD
(variant quality score log-odds)
• Approach
– Machine learning to profile good
variants vs bad variants
– Using multiple dimensions (5-8,
typically)
– Uses INFO annotations for each
variant (eg. Allele count, allele
frequency, etc)

Generate new variant quality score using VQSR
Train a model using “truth” set of known
variants
Apply the model to your
samples
Toss
Keep
indel VCF
(GATK GenotypeGVCFs
OR
http://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr

Alternatively! IF you have few samples
indel VCF
(GATK GenotypeGVCFs
OR
http://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr
• Apply hard filter!
• Define how to filter your variant or use
default filter parameters
– QualByDepth (QD) 2.0
– FisherStrand (FS) 60.0
– RMSMappingQuality (MQ) 40.0
– MappingQualityRankSumTest
(MQRankSum) -12.5
– ReadPosRankSumTest (ReadPosRankSum) -
8.0 (only het calls)

Final VCF
Variant Discovery-Final VCF
Header
Body

Annotating Variants using ANNOVAR
annotation
(ANNOVAR)
Variant effect
annotation
(ANNOVAR)
Filter variants to
identify candidates
of interest
Variant Annotation
Gene Annotation refSeq Gene, mitochondrial
variants, USC/EnSembl,
GENCODE/CCDS
Region-based Annotation Conserved genomic elements,
transcription factor binding
site, cytogenetic band,
segmental duplications,
GWAS..
Filter-based annotation 1000 GP, dbSNP, ESP, EXAC,
non-synonymous variants
annotation
(SIFT/Polyphen2/MutationTast
er/LRT/FATHMM/CADD..),
ClinVar…
For comprehensive list, see http://annovar.openbioinformatics.org/en/latest/

Question
• We have fastq files
from a “sample”
• Are there any
deleterious variants in
this person?
• We’ll only be looking
at chromosome 16

Go to the class folder and make your folder
Copy the commands to your folder and open it
for your convenience
TRICK TO A COPY/PASTE!
Open 2 terminal windows:
One to view commands
One to RUN commands (will log in interactively)
Terminal 1 (to view commands)
Create your folder, copy
commands.sh file to your folder,
and open it
cd /hpcdata/scratch/2016_Exome_Training
mkdir directory_name
cd directory_name
cp ../commands.sh .
vi commands.sh
Terminal 2 (To run commands)
Log in interactively and go to
your folder
qrsh -l h_vmem=10G,mem_free=5G #to log in to
interactive node
cd
/hpcdata/scratch/2016_Exome_Training/directory_name

Load modules
module load GATK
module load FastQC
module load BWA
module load SAMtools
module load VCFtools
module load IGV
module load annovar/1.0
module load BEDTools
module load picard
module load BCFtools

QC on fastq/bams (bad quality)

QC on fastq/bams (good quality)

Step 1: QC and align a fastq file
1) run fastqc on fastq files
– generates .html file
with qc statistics
– Generates compressed
folders with images
fastqc ../sample.fastq -o ./
Pre-processing

Step 2: align fastq files
• Align with BWA MEM using -M to mark
secondary alignments and -R to
annotate Read Groups (e.g., different
samples)
– ID, LB, SM, PU, and PL tags are required
Pre-processing
bwa mem -R
"@RGtID:dadtLB:dadtSM:dadtPU:FCC1
89PACXXtPL:ILLUMINA" -M
../human_g1k_v37.fasta ../sample.fastq | gzip
> ./sample.sam.gz

Step 3: sort sam, convert to bam
• Aligned reads need to be
sorted
java -jar ${EBROOTPICARD}/picard.jar
SortSam I=sample.sam.gz O=sample.bam
SO=coordinate CREATE_INDEX=true
Now we have mapped and sorted reads
Pre-processing
samtools index sample.bam

Step 4: Mark duplicates
• Using picard to mark
duplicates
Pre-processing
java -jar ${EBROOTPICARD}/picard.jar
MarkDuplicates INPUT=sample.bam
OUTPUT=sample.dedup.bam AS=true
CREATE_INDEX=true M=sample.metrics.txt
How many were marked as duplicates??

Step 5: Realign indels
• Create target list for
potential indel sites
java -jar $EBROOTGATK/GenomeAnalysisTK.jar
-T RealignerTargetCreator -R
../human_g1k_v37.fasta -I sample.dedup.bam
-o sample.intervals -L
../S03723314_Regions_chr16.fix.bed
Pre-processing
Let’s visualize a potential indel
region!

Viewing potential indel site!
• Use samtools tview to view
bam file
samtools tview sample.dedup.bam
../human_g1k_v37.fasta
Pre-processing
• #press "g" to prompt "Go to",
type 16:46744672 and enter

• Realign target list
-T RealignerTargetCreator -R
../human_g1k_v37.fasta -I
dad.chr16.dedup.bam -o dad.chr16.intervals -
L ../S03723314_Regions_chr16.fix.bed
Pre-processing

Step 6: Recalibrate base QS
• Build a model
java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T
BaseRecalibrator -R ../human_g1k_v37.fasta -
knownSites ../dbsnp_138.b37.vcf -knownSites
../Mills_and_1000G_gold_standard.indels.b37.vcf -I
sample.realigned.bam -L
../S03723314_Regions_chr16.fix.bed -o
sample.recal_report.grp
Pre-processing
• Recalibrate scores
-T PrintReads -R ../human_g1k_v37.fasta -I
sample.realigned.bam -o sample.recal.bam -
BQSR sample.recal_report.grp -L

Step 1: Generate gVCFs
Broad Institute
Variant Discovery
indel VCF
(GATK GenotypeGVCFs
OR
java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T
HaplotypeCaller -R ../human_g1k_v37.fasta -I
sample.recal.bam -stand_call_conf 30.0 -stand_emit_conf
10.0 -o sample.g.vcf -ERC BP_RESOLUTION -L

Step 2: genotype gVCFs
Broad Institute
Variant Discovery
indel VCF
(GATK GenotypeGVCFs
OR

-T GenotypeGVCFs
-R ../human_g1k_v37.fasta
--max_alternate_alleles 2
-stand_call_conf 30
-stand_emit_conf 10
--variant sample.g.vcf
-o sample.vcf

Step 3: Recalibrate variants VQSR (>30 samples)
Broad Institute
Variant Discovery
indel VCF
(GATK GenotypeGVCFs
OR
Build model
Apply model
Take the output and do indel recalibration (see commandline)

Step 3: Apply hard filter (<30 samples)
Broad Institute
Variant Discovery
indel VCF
(GATK GenotypeGVCFs
OR
-T SelectVariants
-L ../S03723314_Regions_chr16.fix.bed
-V sample.vcf
-selectType SNP
-o sample_SNPs.vcf
-T VariantFiltration
-V sample_SNPs.vcf
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"
--filterName "my_snp_filter"
-o sample.filtered_SNPs.vcf
• Extract SNPS
• Apply filters on SNPS

Broad Institute
Variant Discovery
indel VCF
(GATK GenotypeGVCFs
OR
-T SelectVariants
-V sample.vcf
-selectType INDEL
-o sample.indels.vcf
• Extract INDELs
• Apply filters in indels
-T VariantFiltration
-V sample.indels.vcf
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0"
--filterName "my_indel_filter"
-o sample.filtered_indels.vcf

Broad Institute
Variant Discovery
indel VCF
(GATK GenotypeGVCFs
OR
• Merge two vcf files

-T CombineVariants
--variant:indel sample.filtered_indels.vcf
--variant:snps sample.filtered_SNPs.vcf
-o sample.filtered_SNP_indels.vcf
-genotypeMergeOptions PRIORITIZE
-priority snps,indel

Data processing
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Realign indels
indel VCF
(GATK GenotypeGVCFs
OR
annotation
(ANNOVAR)
Variant effect
annotation
(ANNOVAR)
Filter variants to
identify candidates
of interest
✔ ✔

Step1: Annotate variants
annotation
(Annovar)
Variant effect
annotation
(Annovar)
Filter variants to
identify candidates
of interest
Variant analysis
• We will use annovar to annotate the VCF file
• Left-align VCF file using bcftool
bcftools norm -m-both -o trio_filtered_SNP_indels_step1.vcf trio_filtered_SNP_indels.vcf
bcftools norm -f ../human_g1k_v37.fasta -o trio_filtered_SNP_indels_step2.vcf
trio_filtered_SNP_indels_step1.vcf

annotation
(Annovar)
Variant effect
annotation
(Annovar)
Filter variants to
identify candidates
of interest
Variant analysis
• Gene annotation
• AA change, classification of mutation
• Population frequency
• ESP 6500
• 1000GP
• Snp anntation (SNP138)
• Deleteriousness of non-synonymous variants
annotation
• SIFT
• Polyphen
• LTR
• MutationTaster
• FATHMM
• PPROVEAN
• VEST3
• CADD
• DANN
• Fathmm-MKL
• etc
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#overview

annotation
(Annovar)
Variant effect
annotation
(Annovar)
Filter variants to
identify candidates
of interest
Variant analysis
table_annovar.pl sample.filtered_SNP_indels_step2.vcf -buildver
hg19 -out sample.filtered_snps.annotated.vcf -remove -protocol
refGene,esp6500siv2_all,ljb26_all -operation g,f,f -nastring . -
vcfinput /sysapps/cluster/software/annovar/1.0/humandb/
http://annovar.openbioinformatics.org/en/latest/user-
guide/download/

Now we have annotated VCF file!!
Variant analysis

Now what?
What do you do with all these
variants?

Variant Analysis…
like finding a needle in a ‘deep’
haystack
66

Look for evidence of variants of
interest

Further filters needed
• High number of variants
• The goal is to narrow
down your list of variants
• Eliminate variants that
are not interesting
68
2 novel in chr10
In house
exome
dbSNPs
1000
genomes
17,687 SNPs
PLoS One. 2012;7(1):e29708

Things to consider
Filter
based on
Population
frequency/Novel-
variants
synonymous
vs non-
synonymous
Exonic
vs
intronic
Genes of
interest/understand
the gene
Clinically-
relevant
genes
(OMIM,
HGMD)
Predictions
(quality of
variants,
deleteriousness)
Study
phenotype
(literature
search)

Step2: Filter variants
annotation
(Annovar)
Variant effect
annotation
(Annovar)
Filter variants to
identify candidates
of interest
Variant analysis
• Filter by nonsynonymous mutations
cat sample.filtered_snps.annotated.vcf.hg19_multianno.txt |
(head -1;grep nonsynonymous)
>sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynony
mous.txt
• Filter by population frequency < 0.01 ESP
cat
sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynonym
ous.txt | (head -1;awk -F"t" {'if ($11<0.01){print'}}) >
sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynonym
ous.0.01.txt

annotation
(Annovar)
Variant effect
annotation
(Annovar)
Filter variants to
identify candidates
of interest
Variant analysis
• Download annovar text file to your local computer
• Open from your excel
• Open terminal
cd Desktop
sftp username@ai-submit1.niaid.nih.gov
input password when prompted
Go to your working directory
cd /hpcdata/scratch/2016_Exome_Training/XXXX
• File should now be on your desktop, open with excel
get sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynonymous.0.01.txt

Variant analysis
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/

Open the file in excel
• Examine variants that can be potentially disease causing
• Examine for rarity of the variants
• Look at the prediction scores
• Candidate:
– 16 23456431 23456431 G->C exonic in gene COG7
– Good quality variant
– SIFT score- 0.04 (considered deleterious)
– Polyphen2_HDIV-0.999 (probably damaging)
– CADD-27.8 (top 1% of damaging variants)
• Is this gene known to be disease-related? (OMIM, ClinVar, HGMD)
• Check frequency in ExAC database
• Probably a good candidate to follow-up!
Variant analysis

Follow-up
• Do literature search!!!
– THIS IS A MUST!
– Time consuming, but much needed to interpret
variants accordingly

Additional Analysis
• Trio/Family Based Analysis
– PhaseByTransmission (GATK)
– GEMINI https://gemini.readthedocs.org/en/latest/
– pVAAST
• Somatic Variation
– Tumor/normal
– MuTech
http://archive.broadinstitute.org/cancer/cga/mutect
• Association Studies
• Copy number variation analysis

GWAS Association studies
• A typical
analysis:
– Identify SNPs
where one
allele is
significantly
more common
in cases than
controls
– Hardy-
Weinberg Chi
Square
http://en.wikipedia.org/wiki/Genome-wide_association_study http://www.ebi.ac.uk/gwas/

GWAS manhattan plot
• Only applicable for large cohort

Analysis Tools for genotype-phenotype
association from sequencing data
• Plink-seq -
http://atgu.mgh.harvard.edu/plinkseq/
• EPACTS -
http://genome.sph.umich.edu/wiki/EPACTS
• SNPTestv2 / GRANVIL -
http://www.well.ox.ac.uk/GRANVIL/
79

CNVs from exome
• High variability of read-depth in exomes
• CNV prediction is very challenging problem
– High false positives
– Break-points limited due to targets
– Long range CNVs have higher positive predictive value
• Useful when other alternatives (SNPs or aCGH) are not
available
• Lots of prediction tools!! (some examples below)
Population Caller Somatic Caller
XHMM
CoNIFER
EXCAVATOR
ExomeDepth
CONTRA
ADTEx
ExomeCNV
Varscan2
Control-FREEC

Popular utility of exome sequencing
eXome Hidden Markov Model
http://atgu.mgh.harvard.edu/xhmm/tutorial.shtml

Suggestions
• Request for a helix/biowulf account
– Programs are already installed
– You can request programs to be installed

Credit: http://omogemura.com/thank-you/

Hands-on tutorials
• https://github.com/niaid/ACE/tree/master/D
NASeq
• Gemini:
https://gist.github.com/oleraj/cd33616a29bf5
6c62c63e68c788f3d72
• Old:
/hpcdata/scratch/2016_Exome_Training/com
mands.sh

Hong_Celine_ES_workshop.pptx

Recommandé

Recommandé

Contenu connexe

Similaire à Hong_Celine_ES_workshop.pptx

Similaire à Hong_Celine_ES_workshop.pptx (20)

Plus de Bioinformatics and Computational Biosciences Branch

Plus de Bioinformatics and Computational Biosciences Branch (20)

Dernier

Dernier (20)

Hong_Celine_ES_workshop.pptx

Notes de l'éditeur