The human reference genome is incomplete and does not fully represent structural variation. Additional sequences are needed to represent diversity. A hydatidiform mole genome (CHM1) provides an alternate haploid reference with differences from the diploid human reference. The current CHM1 assembly incorporates BAC sequences and Illumina reads. Future work includes improving the assembly using long read technologies and integrating it into the human reference to better represent human variation.
Graph and assembly strategies for the MHC and ribosomal DNA regions
Ashg grc workshop2014_tg
1. ASHG - GRC Workshop
Tina Lindsay
ASHG Oct 18, 2014
2. The Human Reference is Not Complete
• Reference has been found to not be optimal in some
regions
• Structural variation makes it difficult to assemble a truly
representative genome when using a diploid sample
• Some regions were recalcitrant to closure with technology
and resources available at the time
• Additional sequences are needed to capture the full range
of diversity in humans
4. Allelic Diversity vs. Segmental Duplication
A
A
C
T
C
G
C
C
Repeat Copies (noted by color difference)
Allelic
Copies
Diploid Genome
With a diploid genome, there is significant ambiguity sorting allelic copies from repeat copies
Haploid Genome
A C C C
Repeat Copies (ONLY but noted by color difference)
With a haploid genome, allelic differences are eliminated, and base differences are likely
indicative of repeat copies
5. Hydatidiform mole
1. Fertilization of an oocyte without a nucleus
2. Post-zygotic diploidization of triploid zygotes
23x
23X
23X 23X
?
Oocyte Androgenetic HM
6. Initial Use Of CHM1 Source
• CHORI-17 BAC Library
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs
• > 750 have been sequenced
• 590 of them in Genbank as phase 3
7. SRGAP2 Homology between genes
Shows nearly identical segments between SRGAP2A and SRGAP2 paralogs
Shows homology between SRGAP2B and SRGAP2C
SRGAP2A
SRGAP2B
SRGAP2C
Dennis, et.al. 2012
11. Current status of CHM1 resources
• CHORI-17 BAC Library (created from CHM1 cell line)
• CHORI-17 BAC end sequences (n=325,659)
• CHORI-17 multiple enzyme fingerprint map (1560 fpc contigs)
• CHORI-17 BACs (>750 have been sequenced, with 592 of them in
Genbank as phase 3)
• Active cell line
• >100X coverage Illumina 100bp reads
• 300, 500bp, 3kb inserts
• Reference assisted assembly CHM1_1.1
• BioNano genome map
• >50X coverage of PacBio long read data
12. CHM1_1.1 Assembly
• Reference-guided assembly – SRPRISM v2.3, R. Agarwala
• Alignment of Illumina reads to GRCh37 primary assembly
• CHORI-17 BAC clone tilepaths were then incorporated
• 428 total clones
• 324 clones in 45 tilepaths
• 104 clones as singletons
• Comparison back to GRCh37 reference to provide appropriate gaps
sizes
• Assembly submitted to Genbank
• http://www.ncbi.nlm.nih.gov/assembly/GCF_000306695.2
• Paper to be published soon
• Genome Research (in press)
• biorxiv doi (doi: http://dx.doi.org/10.1101/006841)
13. CHM1_1.1 Assembly
Total Sequence Length 3,037,866,619 bp
Total Assembly Gap Length 210,229,812 bp
Number of Scaffolds 163
Scaffold N50 50,362,920 bp
Number of Contigs 40,828
Contig N50 143,936 bp
CHM1_1.1
GRCh3
7
16. PacBio CHM1 Assembly Shows Data Not in GRCH38
GRCh38
PacBio CHM1
Second Pass Alignment
17. CHM1 BioNano Genome Map Aligned to GRCh38
GRCh38
CHM1 BioNano Map
~15kb additional data
18. BioNano SV Calls Identified a Assembly Problems
Collapse
Expansion
in Assembly
CHM1_1.1 Assembly Gap in Sequence
CHM1 BioNano Map
19. Collapse in Sequence Data
Thought to be missing ~100kb in sequenced clones
GRCh38
20. Gap Sizing
Chr8 – Stalled Gap
Estimated at ~150kb
GRCh38
Sized using CHM1 Genome Map - >500 Kb
21. Future of CHM1 Assembly
• Plan to make as contiguous and accurate as possible
• Incorporate PacBio assembly where possible
• Additional CH17 clones being sequenced through
segmentally duplicated and structurally variant regions to
provide local assembly benefits (isolates the repeats)
23. Future Directions
• Continued Improvement on CHM1 Genome
• Integration of Pacific Bioscience whole genome assembly
• BioNano genome map data
• Continue to add diversity to the reference by sequencing
new samples that provide additional diversity than what is
currently represented in GRCh38
• Continued sequencing of CH17 single haplotype BAC
tilepaths to better represent segmentally duplicated
regions
• Additional collaborations with the community to develop
tools to more fully utilize the full reference assembly
(alternate haplotypes)
24. Acknowledgements
The Genome Institute at Washington
University in St. Louis
Rick Wilson
Bob Fulton
Wes Warren
Karyn Meltz Steinberg
Vince Magrini
Derek Albracht
Milinn Kremitzki
Susan Rock
Debbie Scheer
Aye Wollam
The Finishing and Bioinformatics Teams
at The Genome Institute
University of Washington
Evan Eichler
Megan Dennis
Xander Nuttler
NCBI
Richa Argwala
Valerie Schneider
University of Pittsburgh
School of Medicine (CHM1 cell line)
Urvashi Surti
Personalis
Deanna Church
BioNano Genomics
Pacific Biosciences
UCSF
Pui-Yan Kwok
Yvonne Lai
Chin Lin
CHORI Catherine Chu
Pieter de Jong