Presentation at 2019 ASHG GRC/GIAB workshop describing history of the human reference genome, current curation efforts and future plans, and the relationship of all 3 to efforts to produce a human pan-genome.
2. What's new and what's next for the human
reference assembly?
Valerie Schneider, Ph.D.
NCBI
15 October 2019
https://genomereference.org
3. GRC
• Valerie Schneider
• Kerstin Howe
• Tina Graves
• Paul Flicek
• Tayebeh Rezaie
• Nathan Bouk
• Hsiu-Chuan Chen
• Jo Wood
• Joanna Collins
• Sarah Pelan
• Will Chow
• James Torrance
• Derek Albracht
• Milinn Kremitzki
• Laura Clarke
Thanks to many GRC Collaborators
https://www.ncbi.nlm.nih.gov/grc/credits/
CreditsTwitter: @GenomeRef
Announcements: grc-announce@ncbi.nlm.nih.gov
Funding:
• This work was supported in part by the Intramural Research Program
of the National Library of Medicine, National Institutes of Health.
• The European Molecular Biology Laboratory.
• The Wellcome Trust, UK.
• The MGI was supported by National Institutes of Health grants
5U54HG003079, 5U41HG007635 and 5U24HG009081.
4. • What’s the reference?
• What’s new: GRCh38.p13 through today
• What’s next?
Outline
5. What’s the reference?
Anonymous samples
Individual 1A Individual 2A Individual 1B
Haploid mosaic assembly
• Highly contiguous
• Contig N50: 57.9 Mb
• Highly accurate
• per bp error: <10-5
Today’s reference assembly does not
represent:
1. The most common allele/haplotype
2. The longest allele/haplotype
3. The ancestral allele/haplotype
The reference represents
the available Human
Genome Project sequence
1 library ~
6. What’s the reference? Assembly Model Evolution
Gene1 Gene2
Sample
Gene2
Gene1
chromosome
alt scaffold
Reference Assembly
Gene1
Ref
Assembly
false
gap
chromosome
Sequences from haplotype 1
Sequences from haplotype 2
Linear model: impacts on assembly building and analysis
GRCh37/GRCh38 reference assembly model: represent both haplotypes
many
alt loci scaffold 1
chromosome
alt loci scaffold 2
alt loci scaffold 3
Reference Assembly
7. Reference Assembly 101: Assembly Model Evolution
chromosome
Patch release: No change to chromosome coordinates
Assembly nomenclature: GRCh38.p$
novel patch scaffold
ALLELIC
fix patch scaffold
PREFERRED
8. • What’s the reference?
• What’s new: GRCh38.p13 through today
• What’s next?
Outline
9. GRCh38.p13 (cumulative stats)
• 113 Fix patches: Add >3.88 Mb novel sequence
• 43 added in p13
• 72 Novel patches: Add >1.1 Mb novel sequence
• 2 added in p13
• >25 genes affected
What’s new?: GRCh38.p13
Tayebeh Rezaie
Weds, 9 am
Grand Ballroom B
Level 3 Convention Center
10. What’s new?: NOR Distal Junction Regions
Brian McStay Lab
DJ sequences are >99% identical between acrocentrics
11. What’s new?: NOR Distal Junction Regions
Updated chr 21 p-arm
<<<<CENTROMERE TELOMERE>>>>
Reduced clone path (unordered/unoriented)
GRCh38 chr 21 alignment
_
rDNA + NOR DJ
13. • GRCh38 gaps to be evaluated (n=196)
• Excludes biological gaps and WGS intra-scaffold gaps
• Evaluation: Alignment of 8 collapsed diploid assemblies
• 26 gaps spanned all 8 WGS assemblies, with constant insert length
• Spanning sequence included in GRCh38.p13
• 3 gaps spanned by all 8 WGS assemblies, with variable insert length
• 24 gaps spanned by only a subset of the 8 assemblies
• Remainder of gap evaluations still in progress
Clone CloneWGS WGS WGS
PacBio Assembly
Assessed as one gap
GRCh38
What’s new?: Gap Closures
14. • What’s the reference?
• What’s new: GRCh38.p13 through today
• What’s next?
Outline
15. Unresolved genome issues Current curation status
Resolution likelihoods as determined by the GRC review
n=234
What’s next?
Slide: Tayebeh Rezaie
16. What’s next?
Data Source Origin Status
NA19836 African American Assembly Submission Underway
NA20502 Tuscan Assembly Submission Underway
NA20862 Gujarati Indian Assembly Submission Underway
HG03125 Esan Assembly Assessment Underway
HG02970 Esan Assembly Assessment Underway
NA21309 Maasai Assembly Assessment Underway
NA20300 African American Assembly Assessment Underway
NA20129 African American Assembly Assessment Underway
HG01567 Peruvian Assembly Assessment Underway
HG03719 Telugu Assembly Assessment Underway
HG00766 Chinese Dai Assembly Assessment Underway
NA12395 CEPH Assembly Underway
NA19030 Luyha Assembly Underway
NA19734 Mexican Ancestry Assembly Underway
HG03736 Sri Lankan Assembly Underway
17. 0
20
40
60
80
100
120
2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
Growth of accessioned complete human genome
assemblies in NCBI Assembly database
• Engagement with T2T consortium
– Chr X
– Other missing sequences
• Continued engagement with Gold Genomes project
– Gap closures
– New Novel patches
• New clone paths for immune regions (improve
existing paths and add diversity)
– MHC
– IgH
• Chr 21 p-arm sequence review and update
– Not possible as patches?
• Community outreach
– Workshops
– Website: Help Desk/FAQs
• Your Data?
What’s next?
(For updated assemblies, only date of initial submission is counted)
GRCh38
released
n=98
GRCh38.p14 (2020)
18. What’s next?
• Consortium Goals
– Produce 350
Human whole
genome
assemblies
– Fully phased
diploid assemblies
– Identify SVs
between samples
and current
Reference
GRCh38
– Incorporate those
SVs into the
reference, likely
as a graph
representation
19. • What’s the reference?
• What’s new: GRCh38.p13 through today
• What’s next?
Outline