TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Ryan Poplin - Sources of Bias
1. Understanding sources of bias and
error from a prospective Reference
Material (NA12878)
Ryan Poplin, on behalf of the
Genome Sequencing and Analysis Group
Program in Medical and Population Genetics
August 16, 2012
2. NA12878 is a wonderful reference sample!
• Unrestricted cell lines!
• Extensive pedigree available!
• Extensively sequenced and genotyped at the
Broad and elsewhere!
– All Broad techs (both production and
experimental)!
– Fosmids!
– Many library designs and sample prep
protocols!
3. Our framework for variation discovery
!
Phase 1: NGS data processing Phase 2: Variant discovery and genotyping Phase 3: Integrative analysis
Typically by lane Typically multiple samples simultaneously but can be single sample alone
Sample 1 Sample N Raw Raw Raw
Input Raw reads
reads reads indels SNPs SVs
External data
Mapping
Known
Pedigrees
SNPs variation
Population Known
Local structure genotypes
realignment
Indels
Duplicate Variant quality
marking recalibration
Structural
Base quality variation (SV) Genotype
recalibration refinement
Analysis-ready Analysis-ready
Output Raw variants
reads variants
DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !
4. Lots of work required to turn raw sequencing
reads into something that is useful!
Phase 1:!
NGS data processing!
Input Raw reads
Desired
proper=es
of
analysis-‐ready
reads:
Mapping
• Unbiased
sampling
of
alleles
• Calibrated
mapping
quality
scores
Local
realignment • Indels
have
correct
and
consistent
alignment
in
reads
Duplicate
marking • Duplicate
molecules
shouldn’t
count
as
extra
evidence
for
event
Base quality
recalibration • Calibrated
base
quality
scores
for
base
subs=tu=ons,
base
inser=ons,
and
base
Output
Analysis-ready
reads dele=ons
5. Indels
have
correct
and
consistent
alignment
in
reads
through multiple sequence local realignment!
Phase 1:!
NGS data processing!
Effect of MSA on alignments
NA12878, chr1:1,510,530-1,510,589
rs28782535
Input Raw reads
rs28783181 rs28788974 rs34877486 rs28788974
Mapping
Local
realignment
1,000 Genomes Pilot 2 data, raw MAQ alignments 1,000 Genomes Pilot 2 data, after MSA
Duplicate
marking
Base quality
recalibration
Analysis-ready
Output
reads HiSeq data, raw BWA alignments HiSeq data, after MSA
5!
DePristo, M., Banks, E., Poplin, R. et. al, (2011) A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. !