1) De novo genome assembly and phasing methods can reconstruct complete haplotype information from sequenced reads without relying on a reference genome.
2) Trio binning uses k-mer profiling of parents' genomes to separate child's reads into maternal and paternal bins before assembly, producing fully phased haplotig assemblies.
3) The human pan-genome project aims to build a collection of diverse, high-quality haplotype-resolved genomes from different populations using trio binning to improve representation of genetic variations.
REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
Haplotype-Resolved Genome Assemblies
1. Arang Rhie
Adam Phillippy’s Group
Genome Informatics Section, Computational and Statistical Genomics Branch, NHGRI
De Novo Assembly of Haplotype-Resolved Genomes
and Building a Human Pan-Genome Reference
@ArangRhie
3. The diploid genome assembly problem
Diploid genome
Smashed Assembly
Phased (haploid) assembly
phasing
?
De novo: From scratch,
without looking at the
original picture
(reference)
Sequenced reads
sequencing assembling
Pseudo-haplotype + alts
5. Asian specific insertions and the frequency, found from AK1
Under-Represented Variations in GRCh38
Seo, Rhie, Kim, and Lee et al., De novo assembly and phasing of a Korean human genome, Nature (2016)
6. Identify haplotype differences
A
B
• CYP2D6 is involved in metabolizing >50% of available drugs
• Genetic variation and copy number affects drug efficacy
CYP2D6*10: Intermediate ~ poor metabolizer
CYP2D6*2: Extensive metabolizer
Seo, Rhie, Kim, and Lee et al., De novo assembly and phasing of a Korean human genome, Nature (2016)
Chr. 22
7. Can we phase across the whole chromosomes?
Seo, Rhie, Kim, and Lee et al., De novo assembly and phasing of a Korean human genome, Nature (2016)
9. The diploid genome assembly problem
Diploid genome
Smashed Assembly
Phased (haploid) assembly
phasing
?
De novo: From scratch,
without looking at the
original picture
(reference)
Sequenced reads
sequencing assembling
Complete haplotypes
10. The diploid genome assembly problem
Diploid genome
Paternal assembly
?
De novo: From scratch,
without looking at the
original picture
(reference)
Phased reads
sequencing assembling
Phased reads
Maternal assembly
assembling
11. Trio binning with parental k-mers
Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018)
Paternal haplotigs
Maternal haplotigs
• K-mer profiling of each parent (Illumina, 60x)
Paternal
k-mers
Maternal
k-mers
• K-mer profiling of the child (PacBio, 120x)
Child
Paternal Maternal
49.6%
(67.3x)
10.9 kb
49.3%
(66.9x)
11.7 kb
1.1% (1.4x), avg 1.3 kb
Paternal reads Maternal reads
• Childs’ read binning and assembling
canu
12. Robust for a wide range of heterozygosity
0.8% 1.2% 1.6%0.9%
*Heterozygosity level estimated with GenomeScope
1.5%
0.12 % 0.20 % 0.29 %
NA12878 (CEU) F HG00733 (PUR) F NA19240 (YRI) F HG002 (Ashkenazi) M
Platform PacBio (WashU) PacBio 60kb (20kb) PacBio (WashU) PacBio 15kb CCS
Haplotype
(Cov.)
Maternal
(32+9x)
Paternal
(31+9x)
Maternal
(44.6x)
Paternal
(43.6x)
Maternal
(37x)
Paternal
(31x)
Maternal
(11+8x)
Paternal
(11+8x)
NG50 (Mb) 1.2 1.2 19.1 23.9 9.0 3.0 20.1 16.8
0.17 %
14. 1
4
Human Pan-Genome Project
Population: http://www.internationalgenome.org/
Initiative to collect diverse, high-quality haplotypes with trio binning
• Illumina WGS for the parents, PacBio and Nanopore for the child
• Pilot 10 trios selected to maximize non-ref haplotype AF
2 PUR
1 KHV
3 ACB
1 MSL
1 PJL
1 GWD1 CLM
5 African
3 American
1 East Asian
1 South Asian
15. What can you see from a phased assembly?
Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018)
0
16. Phasing the MHC region
Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018)
Maternal
Paternal
17. • Diploid assembly is solved by trios
Trio binning is current best practice
All levels of assembly quality improved
Complete haplotypes will become the new norm
• A human pan-genome reference
A collection of diverse, high-quality haplotypes
Including complex heterozygous SVs
Summary
18. VGP GenomeArk: 1st data release
https://vgp.github.io/genomeark
Jennifer Vashon of Maine Department of Inland Fisheries and
Wildlife, left, and UMass lynx team coordinator, Tanya Lama,
with an adult male lynx from northern Maine whose DNA was
used to create first-ever whole genome for the species. The
lynx has since been released to the wild. (MassWildlife photo
/ Bill Byrne)
19. Acknowledgements
genomeinformatics.github.io
• Adam Phillippy
• Sergey Koren
• Brian Walenz
• Alexander Dilthey
• Brian Ondov
• Jay Ghurye
Korean (AK1)
Jeong-Sun Seo
Changhoon Kim
Junsoo Kim
Sangjin Lee
Tim Smith
John Williams
Cattle/pigs
Pan-Genome
Karen Miga
Benedict Paten
NIH NHGRI NISC
VGP Assembly
Working Group
Erich Jarvis
Richard Durbin
Gene Myers
Kerstin Howe
Harris Lewin
Olivier Fedrigo
Shane McCarthy
Martin Pippel
Will Chow
Joana Damas
PacBio CCS
Michael Hunkapiller
Paul Peluso
David Rank
We are hiring!
Trio binning is available in https://github.com/marbl/canu
20.
21. Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018) 21
Pseudo-haplotype + alts
Complete haplotypes
Assembly Graph
Smashed haplotypes
22. Trio-binning outperforms FALCON-Unzip
Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018)
Primary = Longest path in the graph (pseudo-hap)
Alternate haplotigs = Alternate path in the bubble
Haplotigs = Contigs in each assembly
agree with parental haplotypes (Phased)
TrioCanu FALCON-unzip
Angusspecifick-mercounts
Angusspecifick-mercounts
Brahman specific k-mer countsBrahman specific k-mer counts
23. Phasing NA12878
Koren and Rhie et al, De novo assembly of haplotype-resolved genomes with trio binning, Nat. Biotech (2018)
TrioCanu FALCON-UnzipSupernova
24. Phasing the F1 Cattle
Kronenberg and Kingan et al.,
FALCON-Phase: Integrating PacBio and Hi-C data for phased diploid genomes, bioRxiv (2018)
0
1,000,000
2,000,000
3,000,000
0 1,000,000 2,000,000 3,000,000
Brahman
Angus
Contig Size
20,000,000
40,000,000
60,000,000
Contig
Hap1
Hap2
Contig
Hap1
Hap2
0
1,000,000
2,000,000
3,000,000
0 1,000,000 2,000,000 3,000,000
Brahman
Angus
Contig Size
20,000,000
40,000,000
60,000,000
80,000,000
Assembly
Angus
Brahman
Assembly
Angus
Brahman
TrioCanu FALCON-Unzip FALCON-Phase
Notes de l'éditeur
Before phasing, short reads indicated a copy gain in CYP2D6
After phasing, we identified that the duplicated copy of CYP2D6 was fused with the last exon of CYP2D7 on haplotype B
ref allele = #1
weight by non-ref allele’s global AF
Black are typed genes, correct call for both haplotypes, all in phase. 1 indel in the DQB1. Confirms expected missing DRB3 in mother, presence in father but also shows there is other sequence there not a simple deletion