“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
ASHG sequencing workshop
1. A unique targeted sequencing service providing meaningful
results, not insurmountable data
Dr. Mike Evans — Chief Executive
2. Outline of presentation
• Delivering a unique next generation sequencing service —
Dr Mike Evans, CEO
• Optimised bait design for targeted sequencing — Dr Volker Brenner,
Head of Computational Biology
• Adding value through analysis — Dr Volker Brenner, Head of
Computational Biology
• Summary
• Q&A
3. OGT - provides advanced clinical genetics solutions
- develops innovative molecular diagnostics
• Founded by Ed Southern in 1995
• 64 people
OGT Begbroke: Corporate offices and high- OGT Southern Centre: Biomarker discovery
throughput labs
4. OGT’s key businesses
IP Licensing
40 licence relationships
Technologies
Diagnostic Biomarkers
Genomic- and protein-based diagnostics For Molecular
Medicine
Clinical and Genomic Solutions
Cytogenetics products and genomic services
5. Clinical and Genomic Solutions
Addressing the challenges of high-throughput, high-resolution
molecular technologies:
• High equipment and staff training costs
• Short equipment lifespan
• Complex study design and processes (e.g. platform evaluation &
selection)
• Vast amounts of data
• Extensive computing infrastructure
• Data analysis expertise and resource
The solution: Genefficiency Genomic Services
6. Genefficiency™ — World’s leading aCGH service
High-quality data & complete reassurance
• Experimental and array design expertise
• High-throughput processing (>2000 samples / week)
• Applications: aCGH-CNV, methylation, miRNA, gene expression
analysis
• Comprehensive data analysis services
• >40 QC checks on each sample to ensure high-quality data
7. Independent accreditations
• First Agilent High-Throughput Microarray Certified
Service Provider
• ISO 9001:2008 — Quality management systems
FS 561156
• ISO 27001:2005 — Information security
IS 561157
• ISO 17025:2005 — aCGH Laboratory services
4593
8. Customer satisfaction…
“In order to characterise genetic variants,
reproducible performance and reliable processing
of the high resolution microarrays is essential. We
were pleased with OGT’s responsive approach
and attention to producing high quality data to tight
deadlines”
Dr Matt Hurles, Wellcome Trust Sanger Institute.”
20,000 samples. 1,000 samples / week
10. A world-class team
Our expert team deliver:
• Excellent project management and customer service
• >600 projects to date
• >50,000 samples
• Unparalleled expertise in study and probe design
• Advanced data analysis though a dedicated team of
bioinformaticians
• Rapid turnaround times
• A wealth of experience of clinical and translational
research projects
12. Delivering discovery
Genefficiency Targeted Sequencing Services — designed to be different:
• Comprehensive — taking you from genomic DNA to filtered, qualified results
• Rigorously designed — project and probe design expertise maximises your
likelihood of discovery
• Expert support — experienced team of biologists and bioinformaticians
• Dedication to quality — from sample to result, delivering reliable results
every time
13. Delivering an integrated, comprehensive service
1. Selection of most 2. Capture, sample 3. Data analysis and
appropriate genomic multiplexing and advanced filtering of
regions for enrichment sequencing variants
27/10/2011 13
14. Delivering expert project design
Step 1: Selection of most appropriate genomic regions for your project
and budget
Whole exome Custom genomic regions
Pre-designed, validated whole Expert custom design of capture probes
exome capture probes for your regions of interest
Coding regions are “most likely” Flexibility to focus on regions of clinical
candidates for many disorders significance or GWAS regions
15. Delivering class-leading technology
We have fully optimised the DNA capture and sequencing
methodologies, so you don’t have to!
Step 2: Performing the capture, sample multiplexing, library
preparation and sequencing
• Options for sample indexing and multiplexing to minimise
sequencing cost
• Depth of sequencing coverage to suit your samples and project
• Paired-end sequencing on the industry-leading Illumina HiSeq 2000
16. OGT delivers discovery, not just data
Step 3: Data analysis and advanced filtering of variants
• OGT’s dedicated analysis pipeline brings you beyond data, to a
filtered list of variants relevant to your study
SEQUENCE FILTER DISCOVER
17. Genefficiency Targeted Sequencing Services
The PLATFORM
• Core sequencing platform: Illumina HiSeq 2000
• Core sequence capture technology: Agilent SureSelect
The PEOPLE
• Team of highly skilled molecular biologists and bioinformaticians
• Core expertise in probe design
• Successful development of advanced analysis solutions
18. Outline of presentation
• Delivering a unique next generation sequencing service —
Dr Mike Evans, CEO
• Optimised bait design for targeted sequencing — Dr Volker Brenner,
Head of Computational Biology
• Adding value through analysis — Dr Volker Brenner, Head of
Computational Biology
• Summary
• Q&A
20. Definitions and terminologies
• Read length — The number of bases sequenced in a fragment
Region of Interest
• Capture efficiency
Off target On target Off target
Region of Interest
• Paired end sequencing
Fragment 1
Fragment 2
• Read depth — How many times has a base been sequenced?
21. Read depth required for mutation detection
Assuming no allelic bias the theoretical read depth required to detect
heterozygous variation with given accuracy can be calculated using a
binomial distribution
Calculations based on variation being seen in at least 2 reads
• Should not be just one read as this could be ‘noise’
• Required observations could be a percentage of reads
Depth Required Het. Call Accuracy Probability of Error Quality
11 99% 1:100 Q20
14 99.9% 1:1000 Q30
18 99.99% 1:10000 Q40
25 99.999% 1:100000 Q50
• Minimum capacity required = Region of interest (ROI) x required depth
• Q30 variant detection for 15Kb ROI requires 210Kb sequencing capacity
23. Why use targeted enrichment?
Flexibility in choice of genomic loci
• Allows capture of specific regions of interest for SNP and Indel detection
Cost Effectiveness
• Ideal for clinical applications
• Specific candidate genes are targeted
• Fine mapping post-GWAS
• Cost Benefits
• Enables multiplexing to fill capacity
Streamlined Data Analysis
• Reduced noise due to targeted specificity
24. Example of design bias — Insufficient coverage
Targeted gene sequencing can lead to some targets without the
required depth of coverage
Inadequate Coverage
14x (Q30)
*data kindly provided by C. Mattocks National Genetics Reference Lab, Salisbury, UK
25. Solution: Intelligent design to improve coverage:
Option 1: Option 2:
• Increase coverage by • Intelligent design of capture probes
increasing depth of increases under-represented loci
sequencing • More even coverage of entire region,
• Coverage of all targets no loci missed (more likely to find
proportionally increased mutations present)
• Increased cost of • No need to increase sequence depth
sequencing overall (more cost effective)
• Some bases still missed
(Q30)
27. Problems facing users
• Design tools not user friendly
• Design tools only good for draft design
• Potential sources of bias
• Regions of interest too short
• Bait thermodynamic behaviour
• GC content
• Melting Temperature
• Risk of Design Errors
• OGT’s extensive experience in designing probes for microarrays
allows us to minimise bias and ensure evenness of coverage giving
the best chance to identify mutations
28. OGT’s design pipeline — what we need from you
• Regions of Interest
• Gene lists
• Chromosomal locations
• Genome build version
• Data file format
• Text, Excel, etc....
• Consistent e.g. chr1: 2247628-2248537
2. Draft 4. Thermo-
1. Data 3. Singletons 5. Report
Design dynamics
29. Run draft design
• Assess the output:
• Coverage
• Bait distribution
• Repeat masking
Region of Interest Repeat masking
2. Draft 3. Singleton 4. Bait Thermo-
1. Data 5. Report
Design Baits dynamics
30. Custom baits improve coverage at region boundaries
OGT 1KG
OGT custom bait design gives increased read depth around edges of target regions.
31. Correction for singleton baits
• Review the draft design and identify any regions covered by a
single bait
• These regions span less than 120 bases
• Add additional singleton baits to the design
Before After
• This ensures that small regions are captured as well as large
regions
• Advantage — Improves evenness of capture across the design
2. Draft 3. Singleton 4. Bait Thermo-
1. Data 5. Report
Design Baits dynamics
32. Custom approach ensures variant detection
OGT
1KG
Even at more than 50x coverage, whole exome sequencing does not accurately
identify all SNPs.
OGT custom baits design compared with 1000 Genomes whole exome capture data.
33. Correction for bait thermodynamics
GC content Tm content
• Calculate GC content for all baits • Calculate the Tm for all baits
• Identify those baits where GC • Identify those baits where Tm is
content is extreme (for instance extreme (e.g. > 75oC)
>65% and <40%)
• Add additional copies of these baits • Add additional copies of these baits
Region of Interest
GC extreme
Tm extreme
2. Draft 3. Singleton 4. Bait Thermo-
1. Data 5. Report
Design Baits dynamics
34. OGT custom bait designs help overcome GC issues
OGT
SureSelect
In a region with 70% GC content OGT custom bait design achieved a maximum read
depth of 50x.
The Agilent SureSelect 50Mb capture kit does not capture any reads in this region.
35. OGT custom bait designs help overcome GC issues
OGT
SureSelect
Relative capture of targets within a single gene. Agilent coverage is 20x for the target with no GC
content bias, and minimal for targets with a GC content of 65%.
In contrast OGT custom baits perform excellently in this region.
36. Customer report
• Design Parameters
• Depth of Coverage
• On target / Off target
• Regions not covered – and why not
• Bait Details
• Singletons
• GC distribution
• Tm distribution
• Library Design
• Baits generated
2. Draft 3. Singleton 4. Bait Thermo-
1. Data 5. Report
Design Baits dynamics
37. Summary
• Custom design of regions for targeted sequencing offers
significant flexibility for many applications
• Expert probe design will ensure:
• Better ‘evenness’ of coverage helps ensure no regions are
missed and maximises the likelihood of variant detection
• Improvement of overall capture efficiency and on-target
performance equals cost effective sequencing downstream
• Increase capture efficiency of SNPs and Indels equals an increase
in the likelihood of detection
• Reduction of risk and better performance
38. Adding value through analysis
• Introduction
• NGS data analysis
• Primary analysis
• Mapping and assembly
• Q score re-calibration
• NGS sequencing QC
• NGS alignment QC
• Secondary analysis
• SNP and Indel calling
• Annotation and evaluation pipeline
• SIFT and PolyPhen
• Deliverables
• Case study
• Summary
39. The analysis challenge
Hard drive
Sequencer with
~4Gb per exome
Publication
NGS Raw data Mapping
Mapping Annotation
Annotation Filtering
Filtering Reporting
Reporting
40. Raw data: FASTQ
(standard text representation of short reads)
FASTQ uses four lines per sequence.
• Line 1: '@' followed by a sequence identifier
• Line 2: raw sequence letters
• Line 3: '+' (and optional sequence identifier)
• Line 4: quality values for the sequence in Line 2. Must contain the same number of
symbols as letters in the sequence.
(The letters encode Phred Quality Scores from 0 to 93 using ASCII 33 to 126)
Example
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
41. Phred quality scores
• Phred is an accurate base-caller used for capillary traces (Ewing et al
Genome Research 1998)
• Each called base is given a quality score Q
• Quality based on simple metrics (such as peak spacing) calibrated against a
database of hand-edited data
• QPhred = -10 * log10(estimated probability call is wrong)
Probability of incorrect
Phred Quality Score Base call accuracy
base call
10 1 in 10 90 %
20 1 in 100 99 %
30 1 in 1000 99.9 %
40 1 in 10000 99.99 %
Q30 often used as a threshold for useful sequence data
42. Adding value through analysis
• Introduction
• NGS data analysis
• Primary analysis
• Mapping and assembly
• Q score re-calibration
• NGS sequencing QC
• NGS alignment QC
• Secondary analysis
• SNP and Indel calling
• Annotation and evaluation pipeline
• SIFT and PolyPhen
• Deliverables
• Case study
• Summary
43. Primary analysis — Mapping and alignment
Raw
Sequence
Files
FASTQ Format
Raw Local Quality
Duplicate Analysis-
Mapping Alignment Realignment score re-
(around InDels) marking ready
Files calibration Alignment
BWA/Bowtie SAM/BAM Format GATK Picard Picard SAM/BAM Format
44. Why mark duplicates and realignment around indels?
3 incorrect calls within 40bp!
45. Primary analysis — Mapping and alignment
Raw
Sequence
Files
FASTQ Format
Raw Local Quality
Duplicate Analysis-
Mapping Alignment Realignment score re-
(around InDels) marking ready
Files calibration Alignment
BWA/Bowtie SAM/BAM Format GATK Picard Picard SAM/BAM Format
46. NGS variant calling methods
Option 1 - Hard filtering
Example: SNP can only be called if
• read depth >10
• >35% of reads carry SNP
Effective filtering
Transparent to user
– Simplistic approach
– Will miss high quality calls that don’t pass threshold
Option 2 - Statistical analysis
Based on quality scores of individual basepairs, the alignment and statistical probability models
Robust
Optimum balance of sensitivity and specificity due to the use of statistical models
Fewer false positive and false negative SNP calls
– Requires correctly pre-processed data with reliable quality scores
47. Base quality score re-calibration
Before Recalibration After Recalibration
Source: The Broad Institute
http://www.broadinstitute.org/files/shared/mpg/nextgen2010/nextgen_poplin.pdf
48. Primary analysis — Raw data and assembly QC
Raw
Sequence
Files
FASTQ Format Alignment
QC check
Picard
Sequence
QC check Raw Local Quality
Duplicate Analysis-
Mapping Alignment Realignment score re-
(around InDels) marking ready
FastQC Files calibration Alignment
Alignment
QC Report
BWA/Bowtie SAM/BAM Format GATK Picard Picard SAM/BAM Format
Raw data
QC Report
49. Secondary analysis
SNP and Indel calling, annotation and filtering
• Known variant?
• Impact on gene expression?
SNPs
Analysis- • Splicing affected?
Unified Variant
ready
Genotyper Evaluation
alignment • Non-synonymous or frameshift
InDels
mutation?
GATK OGT
SAM/BAM Format
• Impact on protein function?
VCF Format
• How confident are we in the
call?
• Zygosity?
Sequence
QC Report
Alignment
QC Report Comprehensive
interactive OGT
Report
50. SNP/Indel classification
(standard analysis)
We check and annotate every single detected SNP and Indel against all human
Ensembl genes and transcripts and dbSNP
dbSNP annotation:
• Is the variant known?
• Obtain allele frequency
Does it affect any of the following
• Promoter region
• UTR
• Splice sites or intronic region
• CDS
• Synonymous mutation
• Non synonymous mutation
• Frameshift mutation
• Stop codon (truncated/elongated protein sequence)
• Overlap with protein domain
• Consequence on protein function predicted (SIFT & PolyPhen)
51. OGT Processing Overview
Filter out variants
Mapped to Perform pairwise present in “baseline” Additional Filtering
Promoter Regions genome analysis genome (e.g. somatic
Filter out and Analysis
tissue, healthy sibling)
variants
Not Described in Filter out variants
present“baseline”
in any
Non-synonymous Perform pairwise present in Additional Filtering
dbSNP
“baseline” StudyAnalysis
specific
Mapped to Exons, Coding Variations Perform
genome analysis genome (e.g. somatic
tissue, healthy sibling)
and
additional in-
Splice sites or UTRs pairwise exome (e.g.
and Protein somatic variants
Filter out tissue,
depth filtering
Gather All detected domains
Variations with Serious
Consequences to the
genome
Perform pairwise and analysis
Additional Filtering
SNP/Indels Protein Sequence analysis
genome analysis
healthy “baseline”
present in
sibling)
genome (e.g. somatic and Analysis
(SIFT) AND not all
tissue, healthy sibling)
“case” exomes
Filter out variants
Rare RS ID Perform pairwise present in “baseline” Additional Filtering
Described in dbSNP
Variations genome analysis genome (e.g. somatic and Analysis
tissue, healthy sibling)
Multi Genome Analysis, Data Tailored analysis based on client’s
Individual Genome Analysis Gathering and Comparison individual requirements
(Standard Level) (Advanced Level) (Expert Level)
Data
Information
52. NGS data delivery
ship data
Hard drive
(or FTP)
Double click!
File location
& share results
Comprehensive HTML analysis report
68. Customer data: Analysis of consanguineous samples
1 2
I
HACE1
Exon11
c.994C>T
1 2 R332X
II
(CGA -> TGA)
Data courtesy of Dr. Bernd Wollnik, Institute of Human Genetics, University Hospital of Cologne
69. Confirmation by Sanger sequencing
X
H V F R I G P
Control
R332X
69-161 168-258 602-909
ANK1 ANK1 HECT
Mother
Father
Patient1
Patient2
Data courtesy of Dr. Bernd Wollnik, Institute of Human Genetics, University Hospital of Cologne
70. Customer feedback...
Analysis of Consanguineous Samples
“Just wanted to let you know that we have probably identified the
causative gene and mutation in the patient sample.
The mutation is located in the middle of an 18 Mb homozygous
stretch and is a homozygous nonsense mutation!!!
Wow, its going so nicely with your data!!!”
Dr. Bernd Wollnik, Institute of Human Genetics,
University Hospital of Cologne
71. Summary
OGT offers fast, accurate & powerful NGS analysis
Standard Analysis
• Robust statistical data analysis
• Comprehensive variant annotation
• Interactive filtering and prioritisation of data based on
• chromosomal region
• allele frequency / novelty
• zygosity
• confidence score and read depth
• severity of mutation
Advanced Analysis
• Multi-genome comparison
Bespoke analysis
• Tailored to your specific requirements
72. Outline of presentation
• Delivering a unique next generation sequencing service —
Dr Mike Evans, CEO
• Optimised bait design for targeted sequencing — Dr Volker Brenner,
Head of Computational Biology
• Adding value through analysis — Dr Volker Brenner, Head of
Computational Biology
• Summary
• Q&A
73. Speak to one of our team or visit booth 713 to:
• Book a demonstration of our interactive analysis
report — Hurry limited availability
• Discuss your specific project requirements
• Take part in our short survey and have your
chance to win an Amazon Kindle