SlideShare une entreprise Scribd logo
1  sur  84
Exome Sequencing &
Variant Analysis
Exome Sequencing
Sequencers
Ion torrent proton
Ilumina MiSeq
Ilumina HiSeq 2500
Ilumina XTen
PacBio
Sequencers
Machine Cost methods Throughput
per run
Read length Error rate
Illumina
MiSeq
$128K Small genomes,
targeted gene
1.5-2Gb 2X300 0.8%
Ion Torrent $80K Small genomes,
targeted gene
1Gb 400 1.71%
Illumina
NextSeq
$250K Exomes/transcript
ome
120Gb 2x150 0.8%
Illumina
HiSeq
$654K Genomes/exomes/
transcriptomics
600Gb 2X150 0.76%
Illumina X
Ten
$10
Mil.
Genomes 1.6Tb 2x150 0.5%
PacBio $695K Genomes 100Mb 15K 12.86%
Next-generation sequencing overview
Exomes (ES/WES) Genomes (GS/WGS)
Cost ~$1000 ~$2000
(~$1000 with HiSeq X Ten)
Size of bam files ~10 Gb ~200 Gb
DNA
Targeted and
captured
Sheared DNA
What you can get
Most coding regions
(+UTR)
coding and non-
coding
Variants that can be
examined
SNVs, indels (CNVs)
SNVs, indels, CNVs,
structural variations
• Genome has even coverage
• Even a deletion is observed by eye
How exomes and genomes look!
WT Deletion
ES
GS
All about exomes
• What is exome?
– Sequencing targeted exonic regions
– ~2% of genome
• Important to know
– You will NOT get a whole exome!
– Not all exons in all genes are captured!
– Important to know the negative results vs no data
• Coverage will vary in targets
What to know about your data
• What is the sequence depth?
– The depth your sequence
– 10x, 30x, 50x, 100x
• Read length
– How long is your read length?
• What software was used to align?
• What variant calling was used to call variants?
• Which reference was used?
• Which capture kit was used? What is covered?
Partial list of capture enrichment kits
Manufacturer Kits Regions targeted Bases covered
Illumina Nextera Rapid capture
Exons + UTRs+
miRNA 62 Mb
Nimblegen
SeqCap EZ Exome Exons + UTR 96 Mb
SeqCap EZMedExome Disease-associated
regions
47 Mb
Agilent
SureSelect Human All
Exon V6
Exons+UTRs
60Mb
Clinical Research
Exome
Disease-relevant
targets
51Mb
Overview of next-generation
sequencing processing
Sequence
Align
reads/mapping
Variant calling Annotate
Downstream
analysis
Popular tools
Task Popular tools
Align reads/mapping BWA-mem
Novoalign
Isaac
Variant calling GATK (Broad)
Platypus (Wellcome Trust)
Starling (Illumina)
Annotating variants AnnoVar
VEP
snpEff
Popular source
Task Popular source
Control population frequency ExAC
1000GP
ESP
Annotation RefSeq
Ensemble
UCSC genes
GENCODE
Visualization UCSC genome browser
IGV
Clinical relevance HGMD
OMIM
CGD
ClinVar
Exome sequencing overview
Ann
https://en.wikipedia.org/wiki/Exome_sequencing#/media/File:E
GATK Best Practices Website
https://www.broadinstitute.org/gatk/guide/best-practices
Introduction to exome
sequencing analysis workflow
ES pipeline overview
Sequence Alignment
Quality
control
Variant
discovery
Quality
control
Annotate
variants
Analyze
Pre-processing Variant discovery Analysis
Files and tools used
File type Origin
FASTQ Raw reads from sequencer
SAM
BAM
Sequence Alignment/Map
Binary version of SAM
gVCF/VCF Variant call format
Tool Purpose
BWA mem Read alignment to reference
Picard Mark duplicates
GATK (Haplotype caller) QC/variant calling
Samtools Sort sam/bam, convert sam<->bam
Reference genome
• There are different versions of human reference genome
Reference Name Chr notation Mitochondrial sequence
Additional sequences
included
GRCh37
(Genome Reference
Consortium)
1, 2… X,Y,MT Yes
• Unlocalized
• Unplaced
• Alternate loci
hg19
(UCSC genome browser)
Chr1, Chr2…ChrX, ChrY,
ChrM
Copied from previous
release
• Unlocalized
• Unplaced
• Alternate loci
b37/b37+decoy/hs37d5
(1000GP)
1, 2… X,Y,MT
Yes
• Unlocalized
• Unplaced
• “decoy” sequence
• Human herpevirus 4
type 1
• Unlocalized: chromosome known, exact location unknown
• Unplaced: known to originate from human genome, chromosome unknown
• Alternate loci: alternate representation of specific human regions
Index files
• Index files are needed for files in next-generation analysis, as
file sizes are big!
• Enables program to efficiently access the data, rather than
having to read the whole file
File type Index
FASTA *fai
BAM *bai
VCF *vcf.idx
Reference FASTA index file
Why BWA+ GATK Haplotype caller?
• Widely accepted as the “conventional” way of
processing next-gen data
• Well assessed
• Well documented
• Software is supported
• Community support for troubleshooting or
information
Data processing
Align FASTQ to the reference
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Mark duplicates (PICARD)
Realign indels
(GATK IndelRealigner)
Recalibrate base quality scores
(GATK BaseRecalibrator)
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, VCF
(GATK GenotypeGVCFs)
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
Variant classification
annotation
(ANNOVAR)
Variant effect
annotation
(ANNOVAR)
Filter variants to
identify candidates
of interest
Pre-processing Variant discovery Analysis
FASTQ
Sequence ID Sequence Sequence quality
Sequence ID
Bitwise
flags
chrosome
position
MappingQ
CIGAR
Paired-end
chr Paired-end
position
Observed
template
length
http://samtools.github.io/hts-specs/SAMv1.pdf
BAM/SAM
Align
Pre-processing
Marking duplicates and why is it necessary?
• It is assumed that each read corresponds to an
independent DNA fragment from randomly
sheared DNA
• However, PCR amplification can cause duplicates
– Identify based on start + stop of reads
– Choose the best and ignore the rest
Broad Institute
Pre-processing QC
Align FASTQ to the reference
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Mark duplicates (PICARD)
Realign indels
(GATK IndelRealigner)
Recalibrate base quality scores
(GATK BaseRecalibrator)
Indel realignment
Broad Institute
Pre-processing QC
Align FASTQ to the reference
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Mark duplicates (PICARD)
Realign indels
(GATK IndelRealigner)
Recalibrate base quality scores
(GATK BaseRecalibrator)
• Misalignment around indels cause high
number of SNPs
• These regions are identified and locally
realigned to minimize mismatches
Recalibrate Base Quality Scores
Broad Institute
Pre-processing QC
Align FASTQ to the reference
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Mark duplicates (PICARD)
Realign indels
(GATK IndelRealigner)
Recalibrate base quality scores
(GATK BaseRecalibrator)
• Base scores are produced by sequencers
• Quality scores are inaccurate and biased
– Prone to various technical errors
– QS are often over- or under estimated
• To identify and correct non-random technical error
– Physics or the chemistry of sequencing reactions
– Manufacturing flaws in the equipment
• Error covariates e.g.
– Reported quality score
– Position within the read (machine cycle)
– Preceding and current nucleotide (sequencing chemistry)
Recalibrate Base Quality Scores
Broad Institute
Pre-processing QC
Align FASTQ to the reference
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Mark duplicates (PICARD)
Realign indels
(GATK IndelRealigner)
Recalibrate base quality scores
(GATK BaseRecalibrator)
Over-estimation
Under-estimation
• GATK BSQR builds model based on the known variants
set
• Adjusts the base quality scores in the data based on
the model
Now we are ready to call
variants!
Data processing
Align FASTQ to the reference
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Mark duplicates (PICARD)
Realign indels
(GATK IndelRealigner)
Recalibrate base quality scores
(GATK BaseRecalibrator)
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
Variant classification
annotation
(ANNOVAR)
Variant effect
annotation
(ANNOVAR)
Filter variants to
identify candidates
of interest
Pre-processing Variant discovery Analysis
✔
Generate genotype likelihoods in each sample (gVCF)
• For a single sample, calculates normalized Phred-scaled
likelihoods (PL) for genotypes:
• “likelihood of the genotype”= “the probability that
the genotype is not correct”
• Normalized so that the most likely genotype is 0
Broad Institute
Variant Discovery
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
What is a joint genotyping?
• If we analyze Sample 1 or
Sample N alone, we are not
confident that the variant is
real
• If we see both samples, we
are more confident that
there is real variation at this
site in this cohort
Broad Institute
Variant Discovery
Generate new variant quality score using VQSR
Broad Institute
Variant Discovery-QC
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
• What is Variant Quality Score
Recalibration
– NOT adjusting scores!!
– Generate new score VQSLOD
(variant quality score log-odds)
• Approach
– Machine learning to profile good
variants vs bad variants
– Using multiple dimensions (5-8,
typically)
– Uses INFO annotations for each
variant (eg. Allele count, allele
frequency, etc)
Generate new variant quality score using VQSR
Train a model using “truth” set of known
variants
Apply the model to your
samples
Toss
Keep
Variant Discovery-QC
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
http://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr
Alternatively! IF you have few samples
Variant Discovery-QC
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
http://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr
• Apply hard filter!
• Define how to filter your variant or use
default filter parameters
– QualByDepth (QD) 2.0
– FisherStrand (FS) 60.0
– RMSMappingQuality (MQ) 40.0
– MappingQualityRankSumTest
(MQRankSum) -12.5
– ReadPosRankSumTest (ReadPosRankSum) -
8.0 (only het calls)
Final VCF
Variant Discovery-Final VCF
Header
Body
Annotating Variants using ANNOVAR
Variant classification
annotation
(ANNOVAR)
Variant effect
annotation
(ANNOVAR)
Filter variants to
identify candidates
of interest
Variant Annotation
Gene Annotation refSeq Gene, mitochondrial
variants, USC/EnSembl,
GENCODE/CCDS
Region-based Annotation Conserved genomic elements,
transcription factor binding
site, cytogenetic band,
segmental duplications,
GWAS..
Filter-based annotation 1000 GP, dbSNP, ESP, EXAC,
non-synonymous variants
annotation
(SIFT/Polyphen2/MutationTast
er/LRT/FATHMM/CADD..),
ClinVar…
For comprehensive list, see http://annovar.openbioinformatics.org/en/latest/
Shall we start!?!
Hands-on exercises
Question
• We have fastq files
from a “sample”
• Are there any
deleterious variants in
this person?
• We’ll only be looking
at chromosome 16
Go to the class folder and make your folder
Copy the commands to your folder and open it
for your convenience
TRICK TO A COPY/PASTE!
Open 2 terminal windows:
One to view commands
One to RUN commands (will log in interactively)
Terminal 1 (to view commands)
Create your folder, copy
commands.sh file to your folder,
and open it
cd /hpcdata/scratch/2016_Exome_Training
mkdir directory_name
cd directory_name
cp ../commands.sh .
vi commands.sh
Terminal 2 (To run commands)
Log in interactively and go to
your folder
qrsh -l h_vmem=10G,mem_free=5G #to log in to
interactive node
cd
/hpcdata/scratch/2016_Exome_Training/directory_name
Load modules
module load GATK
module load FastQC
module load BWA
module load SAMtools
module load VCFtools
module load IGV
module load annovar/1.0
module load BEDTools
module load picard
module load BCFtools
QC on fastq/bams (bad quality)
QC on fastq/bams (good quality)
Step 1: QC and align a fastq file
1) run fastqc on fastq files
– generates .html file
with qc statistics
– Generates compressed
folders with images
fastqc ../sample.fastq -o ./
Pre-processing
Step 2: align fastq files
• Align with BWA MEM using -M to mark
secondary alignments and -R to
annotate Read Groups (e.g., different
samples)
– ID, LB, SM, PU, and PL tags are required
Pre-processing
bwa mem -R
"@RGtID:dadtLB:dadtSM:dadtPU:FCC1
89PACXXtPL:ILLUMINA" -M
../human_g1k_v37.fasta ../sample.fastq | gzip
> ./sample.sam.gz
Step 3: sort sam, convert to bam
• Aligned reads need to be
sorted
java -jar ${EBROOTPICARD}/picard.jar
SortSam I=sample.sam.gz O=sample.bam
SO=coordinate CREATE_INDEX=true
Now we have mapped and sorted reads
Pre-processing
samtools index sample.bam
Step 4: Mark duplicates
• Using picard to mark
duplicates
Pre-processing
java -jar ${EBROOTPICARD}/picard.jar
MarkDuplicates INPUT=sample.bam
OUTPUT=sample.dedup.bam AS=true
CREATE_INDEX=true M=sample.metrics.txt
How many were marked as duplicates??
Step 5: Realign indels
• Create target list for
potential indel sites
java -jar $EBROOTGATK/GenomeAnalysisTK.jar
-T RealignerTargetCreator -R
../human_g1k_v37.fasta -I sample.dedup.bam
-o sample.intervals -L
../S03723314_Regions_chr16.fix.bed
Pre-processing
Let’s visualize a potential indel
region!
Step 5: Realign indels
Viewing potential indel site!
• Use samtools tview to view
bam file
samtools tview sample.dedup.bam
../human_g1k_v37.fasta
Pre-processing
• #press "g" to prompt "Go to",
type 16:46744672 and enter
Step 5: Realign indels
• Realign target list
java -jar $EBROOTGATK/GenomeAnalysisTK.jar
-T RealignerTargetCreator -R
../human_g1k_v37.fasta -I
dad.chr16.dedup.bam -o dad.chr16.intervals -
L ../S03723314_Regions_chr16.fix.bed
Pre-processing
Step 6: Recalibrate base QS
• Build a model
java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T
BaseRecalibrator -R ../human_g1k_v37.fasta -
knownSites ../dbsnp_138.b37.vcf -knownSites
../Mills_and_1000G_gold_standard.indels.b37.vcf -I
sample.realigned.bam -L
../S03723314_Regions_chr16.fix.bed -o
sample.recal_report.grp
Pre-processing
• Recalibrate scores
java -jar $EBROOTGATK/GenomeAnalysisTK.jar
-T PrintReads -R ../human_g1k_v37.fasta -I
sample.realigned.bam -o sample.recal.bam -
BQSR sample.recal_report.grp -L
../S03723314_Regions_chr16.fix.bed
Data processing
Align FASTQ to the reference
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Mark duplicates (PICARD)
Realign indels
(GATK IndelRealigner)
Recalibrate base quality scores
(GATK BaseRecalibrator)
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
Variant classification
annotation
(ANNOVAR)
Variant effect
annotation
(ANNOVAR)
Filter variants to
identify candidates
of interest
Pre-processing Variant discovery Analysis
✔
Step 1: Generate gVCFs
Broad Institute
Variant Discovery
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T
HaplotypeCaller -R ../human_g1k_v37.fasta -I
sample.recal.bam -stand_call_conf 30.0 -stand_emit_conf
10.0 -o sample.g.vcf -ERC BP_RESOLUTION -L
../S03723314_Regions_chr16.fix.bed
Step 2: genotype gVCFs
Broad Institute
Variant Discovery
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
java -jar $EBROOTGATK/GenomeAnalysisTK.jar

-T GenotypeGVCFs 
-R ../human_g1k_v37.fasta 
--max_alternate_alleles 2 
-stand_call_conf 30 
-stand_emit_conf 10 
--variant sample.g.vcf 
-o sample.vcf
Step 3: Recalibrate variants VQSR (>30 samples)
Broad Institute
Variant Discovery
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
Build model
Apply model
Take the output and do indel recalibration (see commandline)
Step 3: Apply hard filter (<30 samples)
Broad Institute
Variant Discovery
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
java -jar $EBROOTGATK/GenomeAnalysisTK.jar 
-T SelectVariants 
-R ../human_g1k_v37.fasta 
-L ../S03723314_Regions_chr16.fix.bed 
-V sample.vcf 
-selectType SNP 
-o sample_SNPs.vcf
java -jar $EBROOTGATK/GenomeAnalysisTK.jar 
-T VariantFiltration 
-R ../human_g1k_v37.fasta 
-L ../S03723314_Regions_chr16.fix.bed 
-V sample_SNPs.vcf 
--filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" 
--filterName "my_snp_filter" 
-o sample.filtered_SNPs.vcf
• Extract SNPS
• Apply filters on SNPS
Step 3: Apply hard filter (<30 samples)
Broad Institute
Variant Discovery
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
java -jar $EBROOTGATK/GenomeAnalysisTK.jar 
-T SelectVariants 
-R ../human_g1k_v37.fasta 
-V sample.vcf 
-selectType INDEL 
-o sample.indels.vcf
• Extract INDELs
• Apply filters in indels
java -jar $EBROOTGATK/GenomeAnalysisTK.jar 
-T VariantFiltration 
-R ../human_g1k_v37.fasta 
-V sample.indels.vcf 
--filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" 
--filterName "my_indel_filter" 
-o sample.filtered_indels.vcf
Step 3: Apply hard filter (<30 samples)
Broad Institute
Variant Discovery
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
• Merge two vcf files
java -jar $EBROOTGATK/GenomeAnalysisTK.jar

-T CombineVariants 
-L ../S03723314_Regions_chr16.fix.bed 
--variant:indel sample.filtered_indels.vcf 
--variant:snps sample.filtered_SNPs.vcf 
-R ../human_g1k_v37.fasta 
-o sample.filtered_SNP_indels.vcf 
-genotypeMergeOptions PRIORITIZE 
-priority snps,indel
Data processing
Align FASTQ to the reference
genome (BWA-mem)
SAM/BAM file
Sort mapped reads
Mark duplicates (PICARD)
Realign indels
(GATK IndelRealigner)
Recalibrate base quality scores
(GATK BaseRecalibrator)
Generate genotype likelihoods
for EVERY base, gVCFs
(GATK HaplotypeCaller)
Joint genotype, raw SNP &
indel VCF
(GATK GenotypeGVCFs
Recalibrate variants
(GATK VariantRecalibrator)
OR
apply hard filter (<30 samples)
Variant classification
annotation
(ANNOVAR)
Variant effect
annotation
(ANNOVAR)
Filter variants to
identify candidates
of interest
Pre-processing Variant discovery Analysis
✔ ✔
Step1: Annotate variants
Variant classification
annotation
(Annovar)
Variant effect
annotation
(Annovar)
Filter variants to
identify candidates
of interest
Variant analysis
• We will use annovar to annotate the VCF file
• Left-align VCF file using bcftool
bcftools norm -m-both -o trio_filtered_SNP_indels_step1.vcf trio_filtered_SNP_indels.vcf
bcftools norm -f ../human_g1k_v37.fasta -o trio_filtered_SNP_indels_step2.vcf
trio_filtered_SNP_indels_step1.vcf
Step1: Annotate variants
Variant classification
annotation
(Annovar)
Variant effect
annotation
(Annovar)
Filter variants to
identify candidates
of interest
Variant analysis
• Gene annotation
• AA change, classification of mutation
• Population frequency
• ESP 6500
• 1000GP
• Snp anntation (SNP138)
• Deleteriousness of non-synonymous variants
annotation
• SIFT
• Polyphen
• LTR
• MutationTaster
• FATHMM
• PPROVEAN
• VEST3
• CADD
• DANN
• Fathmm-MKL
• etc
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#overview
Step1: Annotate variants
Variant classification
annotation
(Annovar)
Variant effect
annotation
(Annovar)
Filter variants to
identify candidates
of interest
Variant analysis
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#overview
table_annovar.pl sample.filtered_SNP_indels_step2.vcf -buildver
hg19 -out sample.filtered_snps.annotated.vcf -remove -protocol
refGene,esp6500siv2_all,ljb26_all -operation g,f,f -nastring . -
vcfinput /sysapps/cluster/software/annovar/1.0/humandb/
http://annovar.openbioinformatics.org/en/latest/user-
guide/download/
Now we have annotated VCF file!!
Variant analysis
Now what?
What do you do with all these
variants?
Variant Analysis…
like finding a needle in a ‘deep’
haystack
66
Look for evidence of variants of
interest
Further filters needed
• High number of variants
• The goal is to narrow
down your list of variants
• Eliminate variants that
are not interesting
68
2 novel in chr10
In house
exome
dbSNPs
1000
genomes
17,687 SNPs
PLoS One. 2012;7(1):e29708
Things to consider
Filter
based on
Population
frequency/Novel-
variants
synonymous
vs non-
synonymous
Exonic
vs
intronic
Genes of
interest/understand
the gene
Clinically-
relevant
genes
(OMIM,
HGMD)
Predictions
(quality of
variants,
deleteriousness)
Study
phenotype
(literature
search)
Step2: Filter variants
Variant classification
annotation
(Annovar)
Variant effect
annotation
(Annovar)
Filter variants to
identify candidates
of interest
Variant analysis
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#overview
• Filter by nonsynonymous mutations
cat sample.filtered_snps.annotated.vcf.hg19_multianno.txt |
(head -1;grep nonsynonymous)
>sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynony
mous.txt
• Filter by population frequency < 0.01 ESP
cat
sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynonym
ous.txt | (head -1;awk -F"t" {'if ($11<0.01){print'}}) >
sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynonym
ous.0.01.txt
Step2: Filter variants
Variant classification
annotation
(Annovar)
Variant effect
annotation
(Annovar)
Filter variants to
identify candidates
of interest
Variant analysis
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#overview
• Download annovar text file to your local computer
• Open from your excel
• Open terminal
cd Desktop
sftp username@ai-submit1.niaid.nih.gov
input password when prompted
Go to your working directory
cd /hpcdata/scratch/2016_Exome_Training/XXXX
• File should now be on your desktop, open with excel
get sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynonymous.0.01.txt
Step2: Filter variants
Variant analysis
http://annovar.openbioinformatics.org/en/latest/user-guide/filter/
Open the file in excel
• Examine variants that can be potentially disease causing
• Examine for rarity of the variants
• Look at the prediction scores
• Candidate:
– 16 23456431 23456431 G->C exonic in gene COG7
– Good quality variant
– SIFT score- 0.04 (considered deleterious)
– Polyphen2_HDIV-0.999 (probably damaging)
– CADD-27.8 (top 1% of damaging variants)
• Is this gene known to be disease-related? (OMIM, ClinVar, HGMD)
• Check frequency in ExAC database
• Probably a good candidate to follow-up!
Variant analysis
Follow-up
• Do literature search!!!
– THIS IS A MUST!
– Time consuming, but much needed to interpret
variants accordingly
Beyond SNVs
Additional Analysis
• Trio/Family Based Analysis
– PhaseByTransmission (GATK)
– GEMINI https://gemini.readthedocs.org/en/latest/
– pVAAST
• Somatic Variation
– Tumor/normal
– MuTech
http://archive.broadinstitute.org/cancer/cga/mutect
• Association Studies
• Copy number variation analysis
GWAS Association studies
• A typical
analysis:
– Identify SNPs
where one
allele is
significantly
more common
in cases than
controls
– Hardy-
Weinberg Chi
Square
http://en.wikipedia.org/wiki/Genome-wide_association_study http://www.ebi.ac.uk/gwas/
GWAS manhattan plot
• Only applicable for large cohort
Analysis Tools for genotype-phenotype
association from sequencing data
• Plink-seq -
http://atgu.mgh.harvard.edu/plinkseq/
• EPACTS -
http://genome.sph.umich.edu/wiki/EPACTS
• SNPTestv2 / GRANVIL -
http://www.well.ox.ac.uk/GRANVIL/
79
CNVs from exome
• High variability of read-depth in exomes
• CNV prediction is very challenging problem
– High false positives
– Break-points limited due to targets
– Long range CNVs have higher positive predictive value
• Useful when other alternatives (SNPs or aCGH) are not
available
• Lots of prediction tools!! (some examples below)
Population Caller Somatic Caller
XHMM
CoNIFER
EXCAVATOR
ExomeDepth
CONTRA
ADTEx
ExomeCNV
Varscan2
Control-FREEC
Popular utility of exome sequencing
eXome Hidden Markov Model
http://atgu.mgh.harvard.edu/xhmm/tutorial.shtml
Suggestions
• Request for a helix/biowulf account
– Programs are already installed
– You can request programs to be installed
Credit: http://omogemura.com/thank-you/
Hands-on tutorials
• https://github.com/niaid/ACE/tree/master/D
NASeq
• Gemini:
https://gist.github.com/oleraj/cd33616a29bf5
6c62c63e68c788f3d72
• Old:
/hpcdata/scratch/2016_Exome_Training/com
mands.sh

Contenu connexe

Similaire à Hong_Celine_ES_workshop.pptx

Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsDelaina Hawkins
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsGolden Helix Inc
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012Dan Gaston
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data AnalysisRavi Gandham
 
Apac distributor training series 3 swift product for cancer study
Apac distributor training series 3  swift product for cancer studyApac distributor training series 3  swift product for cancer study
Apac distributor training series 3 swift product for cancer studySwift Biosciences
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014LutzFr
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Nathan Olson
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_PresentationToyin23
 
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR GenomicsTarget Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR GenomicsAndrea Telatin
 
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Integrated DNA Technologies
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineGabe Rudy
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsGolden Helix Inc
 
Eccmid meet the-expert
Eccmid meet the-expertEccmid meet the-expert
Eccmid meet the-expertNick Loman
 
20160219 - S. De Toffol - Dal Sanger al NGS nello studio delle mutazioni BRCA
20160219 - S. De Toffol -  Dal Sanger al NGS nello studio delle mutazioni BRCA �20160219 - S. De Toffol -  Dal Sanger al NGS nello studio delle mutazioni BRCA �
20160219 - S. De Toffol - Dal Sanger al NGS nello studio delle mutazioni BRCA Roberto Scarafia
 
01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for educationaryajayakottarathil
 
Whole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxWhole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxHaibo Liu
 
Population-Based DNA Variant Analysis
Population-Based DNA Variant AnalysisPopulation-Based DNA Variant Analysis
Population-Based DNA Variant AnalysisGolden Helix
 

Similaire à Hong_Celine_ES_workshop.pptx (20)

Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Using VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research WorkflowsUsing VarSeq to Improve Variant Analysis Research Workflows
Using VarSeq to Improve Variant Analysis Research Workflows
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
RNA Seq Data Analysis
RNA Seq Data AnalysisRNA Seq Data Analysis
RNA Seq Data Analysis
 
Apac distributor training series 3 swift product for cancer study
Apac distributor training series 3  swift product for cancer studyApac distributor training series 3  swift product for cancer study
Apac distributor training series 3 swift product for cancer study
 
Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014Bioinformatics workshop Sept 2014
Bioinformatics workshop Sept 2014
 
20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop20140711 4 e_tseng_ercc2.0_workshop
20140711 4 e_tseng_ercc2.0_workshop
 
Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.Evaluation of the impact of error correction algorithms on SNP calling.
Evaluation of the impact of error correction algorithms on SNP calling.
 
RNA-Seq_Presentation
RNA-Seq_PresentationRNA-Seq_Presentation
RNA-Seq_Presentation
 
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR GenomicsTarget Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
 
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
Characterizing Alzheimer’s Disease candidate genes and transcripts with targe...
 
Rnaseq forgenefinding
Rnaseq forgenefindingRnaseq forgenefinding
Rnaseq forgenefinding
 
CS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision MedicineCS Lecture 2017 04-11 from Data to Precision Medicine
CS Lecture 2017 04-11 from Data to Precision Medicine
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 
Knowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and VariantsKnowing Your NGS Upstream: Alignment and Variants
Knowing Your NGS Upstream: Alignment and Variants
 
Eccmid meet the-expert
Eccmid meet the-expertEccmid meet the-expert
Eccmid meet the-expert
 
20160219 - S. De Toffol - Dal Sanger al NGS nello studio delle mutazioni BRCA
20160219 - S. De Toffol -  Dal Sanger al NGS nello studio delle mutazioni BRCA �20160219 - S. De Toffol -  Dal Sanger al NGS nello studio delle mutazioni BRCA �
20160219 - S. De Toffol - Dal Sanger al NGS nello studio delle mutazioni BRCA
 
01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education01-Sequencing_Technologies (1).ppt for education
01-Sequencing_Technologies (1).ppt for education
 
Whole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptxWhole exome sequencing data analysis.pptx
Whole exome sequencing data analysis.pptx
 
Population-Based DNA Variant Analysis
Population-Based DNA Variant AnalysisPopulation-Based DNA Variant Analysis
Population-Based DNA Variant Analysis
 

Plus de Bioinformatics and Computational Biosciences Branch

Plus de Bioinformatics and Computational Biosciences Branch (20)

Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019Virus Sequence Alignment and Phylogenetic Analysis 2019
Virus Sequence Alignment and Phylogenetic Analysis 2019
 
Nephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele resultsNephele 2.0: How to get the most out of your Nephele results
Nephele 2.0: How to get the most out of your Nephele results
 
Introduction to METAGENOTE
Introduction to METAGENOTE Introduction to METAGENOTE
Introduction to METAGENOTE
 
Intro to homology modeling
Intro to homology modelingIntro to homology modeling
Intro to homology modeling
 
Protein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modelingProtein fold recognition and ab_initio modeling
Protein fold recognition and ab_initio modeling
 
Homology modeling: Modeller
Homology modeling: ModellerHomology modeling: Modeller
Homology modeling: Modeller
 
Protein docking
Protein dockingProtein docking
Protein docking
 
Protein function prediction
Protein function predictionProtein function prediction
Protein function prediction
 
Protein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on RosettaProtein structure prediction with a focus on Rosetta
Protein structure prediction with a focus on Rosetta
 
Biological networks
Biological networksBiological networks
Biological networks
 
UNIX Basics and Cluster Computing
UNIX Basics and Cluster ComputingUNIX Basics and Cluster Computing
UNIX Basics and Cluster Computing
 
Statistical applications in GraphPad Prism
Statistical applications in GraphPad PrismStatistical applications in GraphPad Prism
Statistical applications in GraphPad Prism
 
Intro to JMP for statistics
Intro to JMP for statisticsIntro to JMP for statistics
Intro to JMP for statistics
 
Categorical models
Categorical modelsCategorical models
Categorical models
 
Better graphics in R
Better graphics in RBetter graphics in R
Better graphics in R
 
Automating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtoolsAutomating biostatistics workflows using R-based webtools
Automating biostatistics workflows using R-based webtools
 
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistical tests: Data handling and data quality (Part II)
 
Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)Overview of statistics: Statistical testing (Part I)
Overview of statistics: Statistical testing (Part I)
 
GraphPad Prism: Curve fitting
GraphPad Prism: Curve fittingGraphPad Prism: Curve fitting
GraphPad Prism: Curve fitting
 
Appendix: Crash course in R and BioConductor
Appendix: Crash course in R and BioConductorAppendix: Crash course in R and BioConductor
Appendix: Crash course in R and BioConductor
 

Dernier

Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Silpa
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Bookingroncy bisnoi
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Servicemonikaservice1
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxSuji236384
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicinesherlingomez2
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICEayushi9330
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)Areesha Ahmad
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Joonhun Lee
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and ClassificationsAreesha Ahmad
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Mohammad Khajehpour
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfSumit Kumar yadav
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsSérgio Sacani
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...ssuser79fe74
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learninglevieagacer
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oManavSingh202607
 

Dernier (20)

Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
Site Acceptance Test .
Site Acceptance Test                    .Site Acceptance Test                    .
Site Acceptance Test .
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts ServiceJustdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
Justdial Call Girls In Indirapuram, Ghaziabad, 8800357707 Escorts Service
 
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptxPSYCHOSOCIAL NEEDS. in nursing II sem pptx
PSYCHOSOCIAL NEEDS. in nursing II sem pptx
 
IDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicineIDENTIFICATION OF THE LIVING- forensic medicine
IDENTIFICATION OF THE LIVING- forensic medicine
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICESAMASTIPUR CALL GIRL 7857803690  LOW PRICE  ESCORT SERVICE
SAMASTIPUR CALL GIRL 7857803690 LOW PRICE ESCORT SERVICE
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
Dopamine neurotransmitter determination using graphite sheet- graphene nano-s...
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Unit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 oUnit5-Cloud.pptx for lpu course cse121 o
Unit5-Cloud.pptx for lpu course cse121 o
 

Hong_Celine_ES_workshop.pptx

  • 3. Sequencers Ion torrent proton Ilumina MiSeq Ilumina HiSeq 2500 Ilumina XTen PacBio
  • 4. Sequencers Machine Cost methods Throughput per run Read length Error rate Illumina MiSeq $128K Small genomes, targeted gene 1.5-2Gb 2X300 0.8% Ion Torrent $80K Small genomes, targeted gene 1Gb 400 1.71% Illumina NextSeq $250K Exomes/transcript ome 120Gb 2x150 0.8% Illumina HiSeq $654K Genomes/exomes/ transcriptomics 600Gb 2X150 0.76% Illumina X Ten $10 Mil. Genomes 1.6Tb 2x150 0.5% PacBio $695K Genomes 100Mb 15K 12.86%
  • 5. Next-generation sequencing overview Exomes (ES/WES) Genomes (GS/WGS) Cost ~$1000 ~$2000 (~$1000 with HiSeq X Ten) Size of bam files ~10 Gb ~200 Gb DNA Targeted and captured Sheared DNA What you can get Most coding regions (+UTR) coding and non- coding Variants that can be examined SNVs, indels (CNVs) SNVs, indels, CNVs, structural variations
  • 6. • Genome has even coverage • Even a deletion is observed by eye How exomes and genomes look! WT Deletion ES GS
  • 7. All about exomes • What is exome? – Sequencing targeted exonic regions – ~2% of genome • Important to know – You will NOT get a whole exome! – Not all exons in all genes are captured! – Important to know the negative results vs no data • Coverage will vary in targets
  • 8. What to know about your data • What is the sequence depth? – The depth your sequence – 10x, 30x, 50x, 100x • Read length – How long is your read length? • What software was used to align? • What variant calling was used to call variants? • Which reference was used? • Which capture kit was used? What is covered?
  • 9. Partial list of capture enrichment kits Manufacturer Kits Regions targeted Bases covered Illumina Nextera Rapid capture Exons + UTRs+ miRNA 62 Mb Nimblegen SeqCap EZ Exome Exons + UTR 96 Mb SeqCap EZMedExome Disease-associated regions 47 Mb Agilent SureSelect Human All Exon V6 Exons+UTRs 60Mb Clinical Research Exome Disease-relevant targets 51Mb
  • 10. Overview of next-generation sequencing processing Sequence Align reads/mapping Variant calling Annotate Downstream analysis
  • 11. Popular tools Task Popular tools Align reads/mapping BWA-mem Novoalign Isaac Variant calling GATK (Broad) Platypus (Wellcome Trust) Starling (Illumina) Annotating variants AnnoVar VEP snpEff
  • 12. Popular source Task Popular source Control population frequency ExAC 1000GP ESP Annotation RefSeq Ensemble UCSC genes GENCODE Visualization UCSC genome browser IGV Clinical relevance HGMD OMIM CGD ClinVar
  • 14. GATK Best Practices Website https://www.broadinstitute.org/gatk/guide/best-practices
  • 15. Introduction to exome sequencing analysis workflow
  • 16. ES pipeline overview Sequence Alignment Quality control Variant discovery Quality control Annotate variants Analyze Pre-processing Variant discovery Analysis
  • 17. Files and tools used File type Origin FASTQ Raw reads from sequencer SAM BAM Sequence Alignment/Map Binary version of SAM gVCF/VCF Variant call format Tool Purpose BWA mem Read alignment to reference Picard Mark duplicates GATK (Haplotype caller) QC/variant calling Samtools Sort sam/bam, convert sam<->bam
  • 18. Reference genome • There are different versions of human reference genome Reference Name Chr notation Mitochondrial sequence Additional sequences included GRCh37 (Genome Reference Consortium) 1, 2… X,Y,MT Yes • Unlocalized • Unplaced • Alternate loci hg19 (UCSC genome browser) Chr1, Chr2…ChrX, ChrY, ChrM Copied from previous release • Unlocalized • Unplaced • Alternate loci b37/b37+decoy/hs37d5 (1000GP) 1, 2… X,Y,MT Yes • Unlocalized • Unplaced • “decoy” sequence • Human herpevirus 4 type 1 • Unlocalized: chromosome known, exact location unknown • Unplaced: known to originate from human genome, chromosome unknown • Alternate loci: alternate representation of specific human regions
  • 19. Index files • Index files are needed for files in next-generation analysis, as file sizes are big! • Enables program to efficiently access the data, rather than having to read the whole file File type Index FASTA *fai BAM *bai VCF *vcf.idx
  • 21. Why BWA+ GATK Haplotype caller? • Widely accepted as the “conventional” way of processing next-gen data • Well assessed • Well documented • Software is supported • Community support for troubleshooting or information
  • 22. Data processing Align FASTQ to the reference genome (BWA-mem) SAM/BAM file Sort mapped reads Mark duplicates (PICARD) Realign indels (GATK IndelRealigner) Recalibrate base quality scores (GATK BaseRecalibrator) Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, VCF (GATK GenotypeGVCFs) Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) Variant classification annotation (ANNOVAR) Variant effect annotation (ANNOVAR) Filter variants to identify candidates of interest Pre-processing Variant discovery Analysis
  • 23. FASTQ Sequence ID Sequence Sequence quality Sequence ID Bitwise flags chrosome position MappingQ CIGAR Paired-end chr Paired-end position Observed template length http://samtools.github.io/hts-specs/SAMv1.pdf BAM/SAM Align Pre-processing
  • 24. Marking duplicates and why is it necessary? • It is assumed that each read corresponds to an independent DNA fragment from randomly sheared DNA • However, PCR amplification can cause duplicates – Identify based on start + stop of reads – Choose the best and ignore the rest Broad Institute Pre-processing QC Align FASTQ to the reference genome (BWA-mem) SAM/BAM file Sort mapped reads Mark duplicates (PICARD) Realign indels (GATK IndelRealigner) Recalibrate base quality scores (GATK BaseRecalibrator)
  • 25. Indel realignment Broad Institute Pre-processing QC Align FASTQ to the reference genome (BWA-mem) SAM/BAM file Sort mapped reads Mark duplicates (PICARD) Realign indels (GATK IndelRealigner) Recalibrate base quality scores (GATK BaseRecalibrator) • Misalignment around indels cause high number of SNPs • These regions are identified and locally realigned to minimize mismatches
  • 26. Recalibrate Base Quality Scores Broad Institute Pre-processing QC Align FASTQ to the reference genome (BWA-mem) SAM/BAM file Sort mapped reads Mark duplicates (PICARD) Realign indels (GATK IndelRealigner) Recalibrate base quality scores (GATK BaseRecalibrator) • Base scores are produced by sequencers • Quality scores are inaccurate and biased – Prone to various technical errors – QS are often over- or under estimated • To identify and correct non-random technical error – Physics or the chemistry of sequencing reactions – Manufacturing flaws in the equipment • Error covariates e.g. – Reported quality score – Position within the read (machine cycle) – Preceding and current nucleotide (sequencing chemistry)
  • 27. Recalibrate Base Quality Scores Broad Institute Pre-processing QC Align FASTQ to the reference genome (BWA-mem) SAM/BAM file Sort mapped reads Mark duplicates (PICARD) Realign indels (GATK IndelRealigner) Recalibrate base quality scores (GATK BaseRecalibrator) Over-estimation Under-estimation • GATK BSQR builds model based on the known variants set • Adjusts the base quality scores in the data based on the model
  • 28. Now we are ready to call variants!
  • 29. Data processing Align FASTQ to the reference genome (BWA-mem) SAM/BAM file Sort mapped reads Mark duplicates (PICARD) Realign indels (GATK IndelRealigner) Recalibrate base quality scores (GATK BaseRecalibrator) Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) Variant classification annotation (ANNOVAR) Variant effect annotation (ANNOVAR) Filter variants to identify candidates of interest Pre-processing Variant discovery Analysis ✔
  • 30. Generate genotype likelihoods in each sample (gVCF) • For a single sample, calculates normalized Phred-scaled likelihoods (PL) for genotypes: • “likelihood of the genotype”= “the probability that the genotype is not correct” • Normalized so that the most likely genotype is 0 Broad Institute Variant Discovery Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples)
  • 31. What is a joint genotyping? • If we analyze Sample 1 or Sample N alone, we are not confident that the variant is real • If we see both samples, we are more confident that there is real variation at this site in this cohort Broad Institute Variant Discovery
  • 32. Generate new variant quality score using VQSR Broad Institute Variant Discovery-QC Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) • What is Variant Quality Score Recalibration – NOT adjusting scores!! – Generate new score VQSLOD (variant quality score log-odds) • Approach – Machine learning to profile good variants vs bad variants – Using multiple dimensions (5-8, typically) – Uses INFO annotations for each variant (eg. Allele count, allele frequency, etc)
  • 33. Generate new variant quality score using VQSR Train a model using “truth” set of known variants Apply the model to your samples Toss Keep Variant Discovery-QC Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) http://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr
  • 34. Alternatively! IF you have few samples Variant Discovery-QC Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) http://gatkforums.broadinstitute.org/gatk/discussion/39/variant-quality-score-recalibration-vqsr • Apply hard filter! • Define how to filter your variant or use default filter parameters – QualByDepth (QD) 2.0 – FisherStrand (FS) 60.0 – RMSMappingQuality (MQ) 40.0 – MappingQualityRankSumTest (MQRankSum) -12.5 – ReadPosRankSumTest (ReadPosRankSum) - 8.0 (only het calls)
  • 36. Annotating Variants using ANNOVAR Variant classification annotation (ANNOVAR) Variant effect annotation (ANNOVAR) Filter variants to identify candidates of interest Variant Annotation Gene Annotation refSeq Gene, mitochondrial variants, USC/EnSembl, GENCODE/CCDS Region-based Annotation Conserved genomic elements, transcription factor binding site, cytogenetic band, segmental duplications, GWAS.. Filter-based annotation 1000 GP, dbSNP, ESP, EXAC, non-synonymous variants annotation (SIFT/Polyphen2/MutationTast er/LRT/FATHMM/CADD..), ClinVar… For comprehensive list, see http://annovar.openbioinformatics.org/en/latest/
  • 39. Question • We have fastq files from a “sample” • Are there any deleterious variants in this person? • We’ll only be looking at chromosome 16
  • 40. Go to the class folder and make your folder Copy the commands to your folder and open it for your convenience TRICK TO A COPY/PASTE! Open 2 terminal windows: One to view commands One to RUN commands (will log in interactively) Terminal 1 (to view commands) Create your folder, copy commands.sh file to your folder, and open it cd /hpcdata/scratch/2016_Exome_Training mkdir directory_name cd directory_name cp ../commands.sh . vi commands.sh Terminal 2 (To run commands) Log in interactively and go to your folder qrsh -l h_vmem=10G,mem_free=5G #to log in to interactive node cd /hpcdata/scratch/2016_Exome_Training/directory_name
  • 41. Load modules module load GATK module load FastQC module load BWA module load SAMtools module load VCFtools module load IGV module load annovar/1.0 module load BEDTools module load picard module load BCFtools
  • 42. QC on fastq/bams (bad quality)
  • 43. QC on fastq/bams (good quality)
  • 44. Step 1: QC and align a fastq file 1) run fastqc on fastq files – generates .html file with qc statistics – Generates compressed folders with images fastqc ../sample.fastq -o ./ Pre-processing
  • 45. Step 2: align fastq files • Align with BWA MEM using -M to mark secondary alignments and -R to annotate Read Groups (e.g., different samples) – ID, LB, SM, PU, and PL tags are required Pre-processing bwa mem -R "@RGtID:dadtLB:dadtSM:dadtPU:FCC1 89PACXXtPL:ILLUMINA" -M ../human_g1k_v37.fasta ../sample.fastq | gzip > ./sample.sam.gz
  • 46. Step 3: sort sam, convert to bam • Aligned reads need to be sorted java -jar ${EBROOTPICARD}/picard.jar SortSam I=sample.sam.gz O=sample.bam SO=coordinate CREATE_INDEX=true Now we have mapped and sorted reads Pre-processing samtools index sample.bam
  • 47. Step 4: Mark duplicates • Using picard to mark duplicates Pre-processing java -jar ${EBROOTPICARD}/picard.jar MarkDuplicates INPUT=sample.bam OUTPUT=sample.dedup.bam AS=true CREATE_INDEX=true M=sample.metrics.txt How many were marked as duplicates??
  • 48. Step 5: Realign indels • Create target list for potential indel sites java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../human_g1k_v37.fasta -I sample.dedup.bam -o sample.intervals -L ../S03723314_Regions_chr16.fix.bed Pre-processing Let’s visualize a potential indel region!
  • 49. Step 5: Realign indels Viewing potential indel site! • Use samtools tview to view bam file samtools tview sample.dedup.bam ../human_g1k_v37.fasta Pre-processing • #press "g" to prompt "Go to", type 16:46744672 and enter
  • 50. Step 5: Realign indels • Realign target list java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T RealignerTargetCreator -R ../human_g1k_v37.fasta -I dad.chr16.dedup.bam -o dad.chr16.intervals - L ../S03723314_Regions_chr16.fix.bed Pre-processing
  • 51. Step 6: Recalibrate base QS • Build a model java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T BaseRecalibrator -R ../human_g1k_v37.fasta - knownSites ../dbsnp_138.b37.vcf -knownSites ../Mills_and_1000G_gold_standard.indels.b37.vcf -I sample.realigned.bam -L ../S03723314_Regions_chr16.fix.bed -o sample.recal_report.grp Pre-processing • Recalibrate scores java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T PrintReads -R ../human_g1k_v37.fasta -I sample.realigned.bam -o sample.recal.bam - BQSR sample.recal_report.grp -L ../S03723314_Regions_chr16.fix.bed
  • 52.
  • 53. Data processing Align FASTQ to the reference genome (BWA-mem) SAM/BAM file Sort mapped reads Mark duplicates (PICARD) Realign indels (GATK IndelRealigner) Recalibrate base quality scores (GATK BaseRecalibrator) Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) Variant classification annotation (ANNOVAR) Variant effect annotation (ANNOVAR) Filter variants to identify candidates of interest Pre-processing Variant discovery Analysis ✔
  • 54. Step 1: Generate gVCFs Broad Institute Variant Discovery Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../human_g1k_v37.fasta -I sample.recal.bam -stand_call_conf 30.0 -stand_emit_conf 10.0 -o sample.g.vcf -ERC BP_RESOLUTION -L ../S03723314_Regions_chr16.fix.bed
  • 55. Step 2: genotype gVCFs Broad Institute Variant Discovery Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T GenotypeGVCFs -R ../human_g1k_v37.fasta --max_alternate_alleles 2 -stand_call_conf 30 -stand_emit_conf 10 --variant sample.g.vcf -o sample.vcf
  • 56. Step 3: Recalibrate variants VQSR (>30 samples) Broad Institute Variant Discovery Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) Build model Apply model Take the output and do indel recalibration (see commandline)
  • 57. Step 3: Apply hard filter (<30 samples) Broad Institute Variant Discovery Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T SelectVariants -R ../human_g1k_v37.fasta -L ../S03723314_Regions_chr16.fix.bed -V sample.vcf -selectType SNP -o sample_SNPs.vcf java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T VariantFiltration -R ../human_g1k_v37.fasta -L ../S03723314_Regions_chr16.fix.bed -V sample_SNPs.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "my_snp_filter" -o sample.filtered_SNPs.vcf • Extract SNPS • Apply filters on SNPS
  • 58. Step 3: Apply hard filter (<30 samples) Broad Institute Variant Discovery Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T SelectVariants -R ../human_g1k_v37.fasta -V sample.vcf -selectType INDEL -o sample.indels.vcf • Extract INDELs • Apply filters in indels java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T VariantFiltration -R ../human_g1k_v37.fasta -V sample.indels.vcf --filterExpression "QD < 2.0 || FS > 200.0 || ReadPosRankSum < -20.0" --filterName "my_indel_filter" -o sample.filtered_indels.vcf
  • 59. Step 3: Apply hard filter (<30 samples) Broad Institute Variant Discovery Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) • Merge two vcf files java -jar $EBROOTGATK/GenomeAnalysisTK.jar -T CombineVariants -L ../S03723314_Regions_chr16.fix.bed --variant:indel sample.filtered_indels.vcf --variant:snps sample.filtered_SNPs.vcf -R ../human_g1k_v37.fasta -o sample.filtered_SNP_indels.vcf -genotypeMergeOptions PRIORITIZE -priority snps,indel
  • 60. Data processing Align FASTQ to the reference genome (BWA-mem) SAM/BAM file Sort mapped reads Mark duplicates (PICARD) Realign indels (GATK IndelRealigner) Recalibrate base quality scores (GATK BaseRecalibrator) Generate genotype likelihoods for EVERY base, gVCFs (GATK HaplotypeCaller) Joint genotype, raw SNP & indel VCF (GATK GenotypeGVCFs Recalibrate variants (GATK VariantRecalibrator) OR apply hard filter (<30 samples) Variant classification annotation (ANNOVAR) Variant effect annotation (ANNOVAR) Filter variants to identify candidates of interest Pre-processing Variant discovery Analysis ✔ ✔
  • 61. Step1: Annotate variants Variant classification annotation (Annovar) Variant effect annotation (Annovar) Filter variants to identify candidates of interest Variant analysis • We will use annovar to annotate the VCF file • Left-align VCF file using bcftool bcftools norm -m-both -o trio_filtered_SNP_indels_step1.vcf trio_filtered_SNP_indels.vcf bcftools norm -f ../human_g1k_v37.fasta -o trio_filtered_SNP_indels_step2.vcf trio_filtered_SNP_indels_step1.vcf
  • 62. Step1: Annotate variants Variant classification annotation (Annovar) Variant effect annotation (Annovar) Filter variants to identify candidates of interest Variant analysis • Gene annotation • AA change, classification of mutation • Population frequency • ESP 6500 • 1000GP • Snp anntation (SNP138) • Deleteriousness of non-synonymous variants annotation • SIFT • Polyphen • LTR • MutationTaster • FATHMM • PPROVEAN • VEST3 • CADD • DANN • Fathmm-MKL • etc http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#overview
  • 63. Step1: Annotate variants Variant classification annotation (Annovar) Variant effect annotation (Annovar) Filter variants to identify candidates of interest Variant analysis http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#overview table_annovar.pl sample.filtered_SNP_indels_step2.vcf -buildver hg19 -out sample.filtered_snps.annotated.vcf -remove -protocol refGene,esp6500siv2_all,ljb26_all -operation g,f,f -nastring . - vcfinput /sysapps/cluster/software/annovar/1.0/humandb/ http://annovar.openbioinformatics.org/en/latest/user- guide/download/
  • 64. Now we have annotated VCF file!! Variant analysis
  • 65. Now what? What do you do with all these variants?
  • 66. Variant Analysis… like finding a needle in a ‘deep’ haystack 66
  • 67. Look for evidence of variants of interest
  • 68. Further filters needed • High number of variants • The goal is to narrow down your list of variants • Eliminate variants that are not interesting 68 2 novel in chr10 In house exome dbSNPs 1000 genomes 17,687 SNPs PLoS One. 2012;7(1):e29708
  • 69. Things to consider Filter based on Population frequency/Novel- variants synonymous vs non- synonymous Exonic vs intronic Genes of interest/understand the gene Clinically- relevant genes (OMIM, HGMD) Predictions (quality of variants, deleteriousness) Study phenotype (literature search)
  • 70. Step2: Filter variants Variant classification annotation (Annovar) Variant effect annotation (Annovar) Filter variants to identify candidates of interest Variant analysis http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#overview • Filter by nonsynonymous mutations cat sample.filtered_snps.annotated.vcf.hg19_multianno.txt | (head -1;grep nonsynonymous) >sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynony mous.txt • Filter by population frequency < 0.01 ESP cat sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynonym ous.txt | (head -1;awk -F"t" {'if ($11<0.01){print'}}) > sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynonym ous.0.01.txt
  • 71. Step2: Filter variants Variant classification annotation (Annovar) Variant effect annotation (Annovar) Filter variants to identify candidates of interest Variant analysis http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#overview • Download annovar text file to your local computer • Open from your excel • Open terminal cd Desktop sftp username@ai-submit1.niaid.nih.gov input password when prompted Go to your working directory cd /hpcdata/scratch/2016_Exome_Training/XXXX • File should now be on your desktop, open with excel get sample.filtered_snps.annotated.vcf.hg19_multianno.nonsynonymous.0.01.txt
  • 72. Step2: Filter variants Variant analysis http://annovar.openbioinformatics.org/en/latest/user-guide/filter/
  • 73. Open the file in excel • Examine variants that can be potentially disease causing • Examine for rarity of the variants • Look at the prediction scores • Candidate: – 16 23456431 23456431 G->C exonic in gene COG7 – Good quality variant – SIFT score- 0.04 (considered deleterious) – Polyphen2_HDIV-0.999 (probably damaging) – CADD-27.8 (top 1% of damaging variants) • Is this gene known to be disease-related? (OMIM, ClinVar, HGMD) • Check frequency in ExAC database • Probably a good candidate to follow-up! Variant analysis
  • 74. Follow-up • Do literature search!!! – THIS IS A MUST! – Time consuming, but much needed to interpret variants accordingly
  • 76. Additional Analysis • Trio/Family Based Analysis – PhaseByTransmission (GATK) – GEMINI https://gemini.readthedocs.org/en/latest/ – pVAAST • Somatic Variation – Tumor/normal – MuTech http://archive.broadinstitute.org/cancer/cga/mutect • Association Studies • Copy number variation analysis
  • 77. GWAS Association studies • A typical analysis: – Identify SNPs where one allele is significantly more common in cases than controls – Hardy- Weinberg Chi Square http://en.wikipedia.org/wiki/Genome-wide_association_study http://www.ebi.ac.uk/gwas/
  • 78. GWAS manhattan plot • Only applicable for large cohort
  • 79. Analysis Tools for genotype-phenotype association from sequencing data • Plink-seq - http://atgu.mgh.harvard.edu/plinkseq/ • EPACTS - http://genome.sph.umich.edu/wiki/EPACTS • SNPTestv2 / GRANVIL - http://www.well.ox.ac.uk/GRANVIL/ 79
  • 80. CNVs from exome • High variability of read-depth in exomes • CNV prediction is very challenging problem – High false positives – Break-points limited due to targets – Long range CNVs have higher positive predictive value • Useful when other alternatives (SNPs or aCGH) are not available • Lots of prediction tools!! (some examples below) Population Caller Somatic Caller XHMM CoNIFER EXCAVATOR ExomeDepth CONTRA ADTEx ExomeCNV Varscan2 Control-FREEC
  • 81. Popular utility of exome sequencing eXome Hidden Markov Model http://atgu.mgh.harvard.edu/xhmm/tutorial.shtml
  • 82. Suggestions • Request for a helix/biowulf account – Programs are already installed – You can request programs to be installed
  • 84. Hands-on tutorials • https://github.com/niaid/ACE/tree/master/D NASeq • Gemini: https://gist.github.com/oleraj/cd33616a29bf5 6c62c63e68c788f3d72 • Old: /hpcdata/scratch/2016_Exome_Training/com mands.sh

Notes de l'éditeur

  1. -finding CNVs from exome is an additional information,
  2. Part 1. Double-stranded genomic DNA is fragmented by sonication. Linkers are then attached to the DNA fragments, which are then hybridized to a capture microarray designed to target only the exons. Part 2. Target exons are enriched, eluted and then amplified by ligation-mediated PCR. Amplified target DNA is then ready for high-throughput sequencing.
  3. -done in 2 steps 1)Determining (small) suspicious intervals which are likely in need of realignment (see the RealignerTargetCreator tool) 2)Running the realigner over those intervals (IndelRealigner)
  4. Empirical quality based on known variants from publicly available databases
  5. This is especially useful for low-quality calls, Identify false negatives because: 1) present as known variant in population, and 2) present among other individuals in cohort. Would have missed them otherwise because not the best quality. Identify false positives because absent from population and cohort – remove because borderline quality. (If high quality, keep it – rare variants)
  6. QualByDepth (QD) 2.0: This is the variant confidence (from the QUAL field) divided by the unfiltered depth of non-reference samples. FisherStrand (FS) 60.0 :Phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads. More bias is indicative of false positive calls. RMSMappingQuality (MQ) 40.0 :This is the Root Mean Square of the mapping quality of the reads across all samples. MappingQualityRankSumTest (MQRankSum) -12.5 :This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele). Note that the mapping quality rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls. ReadPosRankSumTest (ReadPosRankSum) -8.0: This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error. Note that the read position rank sum test can not be calculated for sites without a mixture of reads showing both the reference and alternate alleles, i.e. this will only be applied to heterozygous calls.
  7. “Trick to copy/paste” In terminals, students can use their mouse to shortcut their copy/paste, highlight with their mouse and use the middle button to paste (without having to copy/paste). Very convenient to copy/paste
  8. -if you have family data or case/control, you can further eliminate variants
  9. Exclude SNPs in Hardy-Weinberg disequilibrium
  10. Also VarB - http://bioinformatics.oxfordjournals.org/content/early/2012/09/13/bioinformatics.bts557.full.pdf
  11. -top figure presents a clear CNV candidate that’s a good quality -bottom figure presents a noisy area, where CNV prediction can be challenging