1. Karen Miga presented on reaching complete telomere-to-telomere chromosome assemblies using long-read sequencing.
2. The current human reference genome is incomplete, with gaps and unresolved issues remaining.
3. Miga's lab generated a high-quality assembly of chromosome 21 for the CHM13 genome using Oxford Nanopore long reads totaling 155 GB of sequence data.
2. New Era in Genetics and Genomics
We are finally reaching complete, high-quality
telomere-to-telomere chromosome assemblies
3. New Era in Genetics and Genomics
We are finally reaching complete, high-quality
telomere-to-telomere chromosome assemblies
Human reference genome is incomplete.
• 368 unresolved issues, 102 gaps
• Segmental duplications, gene families, satellite
arrays, centromeres, rDNAs
• Uncharacterized sequence variation in the human
population
4. New Era in Genetics and Genomics
We are finally reaching complete, high-quality
telomere-to-telomere chromosome assemblies
Human reference genome is incomplete.
• 368 unresolved issues, 102 gaps
• Segmental duplications, gene families, satellite
arrays, centromeres, rDNAs
• Uncharacterized sequence variation in the human
population
chr21
5. New Era in Genetics and Genomics
We are finally reaching complete, high-quality
telomere-to-telomere chromosome assemblies
Human reference genome is incomplete.
• 368 unresolved issues, 102 gaps
• Segmental duplications, gene families, satellite
arrays, centromeres, rDNAs
• Uncharacterized sequence variation in the human
population
Our current understanding of
genome biology and function30 Mb
chr21
6. New Era in Genetics and Genomics
We are finally reaching complete, high-quality
telomere-to-telomere chromosome assemblies
Human reference genome is incomplete.
• 368 unresolved issues, 102 gaps
• Segmental duplications, gene families, satellite
arrays, centromeres, rDNAs
• Uncharacterized sequence variation in the human
population
Our current understanding of
genome biology and function30 Mb
chr21
~20 Mb ?
7. Challenge:
Generating assemblies across repetitive regions that
span hundreds of kilobases.
Repeats (100 kb+)
Unique
variant
Unique
variant
Can high-coverage ultra-long sequencing resolve
complete assemblies of the human genome?
9. It’s time to finish the human genome
The Telomere-to-Telomere (T2T) consortium is an
open, community-based effort to generate the
first complete assembly of a human genome.
10. Our target: CHM13hTERT
Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers
N=46; XX
11. Our target: CHM13hTERT
Cell line from Urvashi Surti, Pitt; SKY karyotype from Jennifer Gerton and Tamara Potapova, Stowers
N=46; XX
12. Intramural Sequencing Center
CHM13 Sequencing
94 MinION/GridION flow cells
11.1M reads
155 Gb (1.6 Gb / flow cell) (50x)
99 Gb in reads >50 kb (32x)
78 Gb in reads >70 kb (25x)
Max mapped read length 1.04 Mb
From May 1/18 – Jan 8/19
13. Intramural Sequencing Center
CHM13 Sequencing
94 MinION/GridION flow cells
11.1M reads
155 Gb (1.6 Gb / flow cell) (50x)
99 Gb in reads >50 kb (32x)
78 Gb in reads >70 kb (25x)
Max mapped read length 1.04 Mb
From May 1/18 – Jan 8/19
50x Nanopore ultra-long
Contig building
60x PacBio
Polishing
50x 10x Genomics
Polishing
BioNano
Structural validation
14. • 2.94 Gbp assembly NG50: 75 Mbp
• Exceeds the continuity of the reference
genome GRCh38 (56 Mbp NG50
contig size).
• Subset of chromosome assemblies
break only at centromere.
Roadmap for completing the genome
Canu
26. @NanoporeConf | #NanoporeConf
Create a scaffold of unique, or
single copy k-mers genome-wide
Marker-assisted mapping
Adam Phillippy Arang Rhie Sergey Koren
Marker-assisted mapping
28. 28
Confident mapping of long reads
using a single-copy k-mer strategy
Identify and mark all sites of unique anchors across the chromosome
chrX
• 21-mers that appear ~c times in Illumina data
• Also found in PacBio/Nanopore reads
• Less frequent in the centromere, but still there
• (Validated with Duplex-Seq)
29. 29
Confident mapping of long reads
using a single-copy k-mer strategy
Filter long read alignments: retaining those with unique k-mer anchoring
chrX
chrX
30. 30
Spacing of single-copy k-mers can be irregular in
repeat-dense regions
chrX
chrX
X CENTROMERE ARRAY
CENTROMERE
CENX: 3.1 Mbps
Number of k-mers: 2,034
Spacing N50: 6,879
Longest distance
between k-mers
: 53,798 bp
32. GAGE pre-polishing
ChrX GAGE array: 19 tandemly arrayed ~9.4 kb repeats
Coverage
250
200
150
100
50
0
Base position
Most frequent base
Second most frequent base (error)
19 tandemly arrayed ~9.4 kb repeats
33. GAGE with marker-assisted polishing
Most frequent base
Second most frequent base (error)
ChrX GAGE array: 19 tandemly arrayed ~9.4 kb repeats
Coverage
250
200
150
100
50
0
Base position
19 tandemly arrayed ~9.4 kb repeats
40. 1. Structurally validated assembly from telomere-to-telomere. Including
3.1 Mb tandem repeat at the X centromere and providing a complete
assessment across tandemly repeated gene families.
Finished T2T X Chromosome:
High Accuracy and High Continuity
41. 1. Structurally validated assembly from telomere-to-telomere. Including
3.1 Mb tandem repeat at the X centromere and providing a complete
assessment across tandemly repeated gene families.
2. Novel polishing strategy capable of improving the quality of large repeat-
rich regions. Demonstrating dramatic improvements in quality over the
entirety of the X chromosome.
Finished T2T X Chromosome:
High Accuracy and High Continuity
42. 1. Structurally validated assembly from telomere-to-telomere. Including
3.1 Mb tandem repeat at the X centromere and providing a complete
assessment across tandemly repeated gene families.
2. Novel polishing strategy capable of improving the quality of large repeat-
rich regions. Demonstrating dramatic improvements in quality over the
entirety of the X chromosome.
3. Statistics of CHM13 full length BAC alignments to polished assembly:
275/341 (81%) QV 37.4 QV 27.9
153/341 (45%) QV 37.7 QV 27.4
Vollger M, Logsdon, G et al. bioRxiv doi.org/10.1101/635037
MeanMedianBACs Aligned
HiFi
UL-asm
Finished T2T X Chromosome:
High Accuracy and High Continuity
46. • Minimal change in continuity
• 79.5 Mbp (rel2) vs. 71.8 Mbp (rel3) NG50
• Don’t judge assemblies based on continuity
• Tricky regions are fixed
• GAGE and more SegDups automatically resolved
• Improved BAC validation
• 288 (rel2) vs. 310 (rel3) of 341 BACs resolved
• 1 chromosome down, 23 to go…
Triple the coverage, what changed?
47. Goal of a complete human genome in the next two
years.
Challenges in front of us:
• Acrocentric p-arms
• Large segmental duplications
• Classical Human satellites 2,3
Establishing new benchmarking standards (XChr)
Pioneering new pipelines: Polishing, repeat assembly, and array
structural validation.
Setting the bar higher for quality and completeness.
Notes de l'éditeur
KEY POINT HERE: spacing of unique variants… Some regions are easier than others….
Number of k-mers: 2,034
Spacing N50: 6,879
Longest distance: 53,798 bp
Median BAC QV 37.4 (mean QV 28.0) vs median QV 37.6 (mean WV 27.4 ) for the best CHM13 HiFi asm. And resolve 85% of BACs at >99.8% idy v.s. 54% for prior PacBio asm.
T
otal BACs: 341
Compressed:
166 1
Median: 99.9895
QV: 39.78811
Mean: 99.8706
QV: 28.88052
Mitchell HiFi:
153 1
Median: 99.9827
QV: 37.61954
Mean: 99.81871
QV: 27.41627
UL + 10x:
275 1
Median: 99.982
QV: 37.44727
Mean: 99.84145
QV: 27.99832
Median BAC QV 37.4 (mean QV 28.0) vs median QV 37.6 (mean WV 27.4 ) for the best CHM13 HiFi asm. And resolve 85% of BACs at >99.8% idy v.s. 54% for prior PacBio asm.
T
otal BACs: 341
Compressed:
166 1
Median: 99.9895
QV: 39.78811
Mean: 99.8706
QV: 28.88052
Mitchell HiFi:
153 1
Median: 99.9827
QV: 37.61954
Mean: 99.81871
QV: 27.41627
UL + 10x:
275 1
Median: 99.982
QV: 37.44727
Mean: 99.84145
QV: 27.99832
Median BAC QV 37.4 (mean QV 28.0) vs median QV 37.6 (mean WV 27.4 ) for the best CHM13 HiFi asm. And resolve 85% of BACs at >99.8% idy v.s. 54% for prior PacBio asm.
T
otal BACs: 341
Compressed:
166 1
Median: 99.9895
QV: 39.78811
Mean: 99.8706
QV: 28.88052
Mitchell HiFi:
153 1
Median: 99.9827
QV: 37.61954
Mean: 99.81871
QV: 27.41627
UL + 10x:
275 1
Median: 99.982
QV: 37.44727
Mean: 99.84145
QV: 27.99832