Centromeric regions contain significant human genetic variation that is not represented in current reference genomes. This document proposes a two-part approach to characterize sequence variation in centromeric regions: (1) construct chromosome-specific reference maps of centromeric DNA, and (2) expand the human variation reference map to include centromeric regions. Key aspects include using long reads to assemble higher-order repeats, short reads to estimate array sizes and variant frequencies, and graph representations to model structural variation while retaining haplotype information. This would provide new insights into centromeric biology and identify centromeric variants associated with disease.
Call Girls Ooty Just Call 8250077686 Top Class Call Girl Service Available
Karen miga centromere sequence characterization and variant detection
1. Centromeric Regions:
A source of new, unexplored human
sequence variation
Karen H. Miga
University of California, Santa Cruz
Jan 25, 2018
GIAB Workshop
2. Allele 1
Allele 2 LINE
Mobile element insertion
Allele 1
Allele 2
Copy Number Variation
Inversion Polymorphism
Allele 1
Allele 2
Single Nucleotide Polymorphisms
Allele 1
Allele 2
…ATACGGATTTCATGACAGGTTA…
…ATACGGATTTGATGACAGGTTA…
CHR 9
Identifying Sequence Variants
4. ?
Inability to track variation
p-arm
q-arm
Multi-Megabase
Assembly Gaps
?Mobile element insertion
Copy Number Variation
Inversion Polymorphism
SNPs
Unable to identify using
standard genomic data:
CENTROMERIC REGIONS
5. ?
chr 9qh
Allele2
Allele1
chr 9qh+
CHR 9
Cytogenetics: Identifying Sequence Variants
CENTROMERIC REGIONS
Mobile element insertion
Copy Number Variation
Inversion Polymorphism
SNPs
Unable to identify using
standard genomic data:
H. E. Wyandt, V. S. Tonk, Human Chromosome Variation: Heteromorphism and Polymorphism, 2011
6. ?
chr 9qh
Allele2
Allele1
chr 9qh+
CHR 9
Cytogenetics: Identifying Sequence Variants
H. E. Wyandt, V. S. Tonk, Human Chromosome Variation: Heteromorphism and Polymorphism, 2011
Regulate
Centromere
Function
Contribute to
Chromosome
Cohesion
Centromeres Play a
Role in Cell Division
7. ?
chr 9qh
Allele2
Allele1
chr 9qh+
CHR 9
Cytogenetics: Identifying Sequence Variants
H. E. Wyandt, V. S. Tonk, Human Chromosome Variation: Heteromorphism and Polymorphism, 2011
• 9qh+ men had significantly
increased frequencies of
hyperdiploid
cells. (Ford et al 1978)
• 9qh+ women showed significant
differences in rates of aneuploidy.
(Ford et al 1978)
• 9qh+ is associated with of an
increased fraction of malformed
spermatozoa (Eiben et al 1987)
• Inversions spanning 9qh relate to
recurrent miscarriages in Italian
populations (Del Porto et al 1993)
8. Unchartered Functional Regions of the
Human Genome
Part I: Constructing a reference map of centromeric DNAs
Part II: Expand the human “variation reference map” to
include centromeric DNAs
9. p-arm q-arm
... ...
multi-megabase array
ALPHA SATELLITE
~171bp
Tandem Repeat
Wide Range of Percent ID: ~60-100%
1 2 3 4
Part I: Constructing a reference map of centromeric DNAs
10. Narrow Range of Percent ID: 94% - 100%
“Higher Order Repeat”
Multi-monomeric Repeat Unit
Human Centromeric DNA: Higher Order Repeats
p-arm q-arm
... ...
1 2 3 4 1 2 3 4 1 2 3 4
multi-megabase array
16. INVERSION
p-arm q-arm
... ...
LINE
SINE
OTHER
... ...-A- -T-
GENES NON-ALPHA SATELLITE
Construct a new genomic reference for each centromeric
region to broaden research in these areas
Genome Informatics
Non-satellite DNA
19. >200 ENCODE datasets
α-Centauri
(centromeric automated repeat identification)
PacBio ~10kb read
A B C D E F
5’…
…3’
10x
10
B
C
D
EF
A
10
10
10
10
10
5’ 3’
Prediction of Higher Order Repeats
21. Experimental Evidence:
Chromosome-specific Satellite DNA tools to
Screening Somatic Cell Hybrid Panel
B
C
D
EF
A
D7Z1
6-mer
Waye
et
al
(1987)
98%
GenBank:
M16101
Flow Sorted Chromosome
Alignment/Enrichment
Illumina sequencing of isolated human
chromosomes
Long Range Read Support
“Anchor” to mapped to the assembled p-arm and/
or q-arm
Chromosome specific assignment
23. Read Depth Estimates of Average Satellite Array Size
7q-arm
D7Z1 (6-mer)
7p-arm
D7Z2 (16-mer)
R Wevrick and H F Willard. NAR ( 1991 )
24. Array size estimate:
~2.65 Mb
Read Depth Estimates of Average Satellite Array Size
7q-arm
D7Z1 (6-mer) D7Z2 (16-mer)
B
C
D
EF
A
7p-arm
Array estimate:
~0.42 Mb
D7Z1
(Illumina Read
Database)
Hybrid approach
Long reads inform
sequence structure
Short, high-quality
reads generate
frequency estimates
25. Array size estimate:
~2.65 Mb
Read Depth Estimates of Average Satellite Array Size
7q-arm
D7Z1 (6-mer) D7Z2 (16-mer)
B
C
D
EF
A
7p-arm
Array estimate:
~0.42 Mb
D7Z1
(Illumina Read
Database)
0
50
100
150
200
D7Z2
D7Z1
Individuals
0.0 5.00.5 1.0 1.5 2.0 3.0 4.0 4.53.52.5
Array Size (Mb)
26. 7q-arm 7p-arm
Predicting HOR Repeat Variants
α-Centauri
(centromeric automated repeat identification)
B
C
D
EF
A
5’…
…3’
(6-mer) (4-mer)
27. 7q-arm
B
C
D
EF
A
7p-arm
Predicting HOR Repeat Variants
1.0
1.0
1.0
0.9
0.9 0.9
0.1
Hybrid approach
Long reads inform
sequence structure
Short, high-quality
reads generate
frequency estimates
28. 7q-arm 7p-arm
Map Single Nucleotide Variants
-G--T-
B
C
D
EF
A
B’
0.9
1.0
0.1
0.9
0.9 0.9
0.9
0.1
0.1
26
2565
Account for SNVs
(frequency and position)
within the array
32. CEN3: 300Kb Segmental Duplication from 6p11.2
Gene: DNA Primase Polypeptide 2
GENES
INVERSION
q-armp-arm
Non-Satellite DNA
Linking to chromosome arms and non-satellite DNA
34. Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
35. Key Advantages of Satellite DNA Graphs
Improves Unambiguous Short Read Mapping
REPEAT REPEAT REPEAT
?
5’ 3’REPEAT
Benedict Paten Adam Novak
Centromere Graphs
Demonstrate unambiguous mapping
the majority ( > 98%) of
1000 genome alpha satellite reads
1. Eliminates sequence redundancy
36. Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
2. Information describing long-range haplotypes are
retained as defined “paths” in the graph:
37. Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
2. Information describing long-range haplotypes are
retained as defined “paths” in the graph
3. Graph data structure and sequence analysis tools
will be consistent with the rest of the human genome
The major histocompatibility complex (Kiran Garimella & Gil McVean)
38. Part II: Variation Map
The major histocompatibility complex (Kiran Garimella & Gil McVean)
Expand the human “variation reference map” to include
centromeric DNAs
45. Detection of Sequence Variants
AJ Trio
Han Chinese
(HG00512)
Yoruba
(NG19340)
Puerto Rican
(HG00733)
Expand graph to include 4 reference populations
Collaboration: Ali Bashir and Matthew Pendleton; Ichan Institute
47. Miga et al (2014)
p-arm q-arm
... ...
Individual A
8.3 Mb
p-arm q-arm
... ...
0.7 Mb
Individual B
Individuals
Array Size (Mb)
0
5
10
15
20
98.587.576.565.554.543.532.521.510.5
Study of Array Size Variation
48. Sequence Variation
Collection of 19 high coverage
genomes (~30-60X)
9 Populations, 3 Trios
Expand genome informatics to provide an
assessment of common satVARs in population
1000 Genome Data (1,092)
individuals from 26 distinct
populations
Identify a new source of human sequence variation
49. Satellite DNA
Variants
Associated
with Cancer
(Germline)
?
Catalogue of
all Common
Human
Satellite DNA
Variants
Novel Human Biomarkers:
Use of genomics to greatly improve CEN variant
detection
Increase population based sampling to improve
statistical tests
Does of human sequence variation in
centromeric regions contribute to disease?
50. David Haussler
Benedict Paten
Jim Kent
(CGL, UCSC Browser,
Haussler Wet Lab)
Sofie Salama
Adam Novak
Maximilian Haeussler
Brian Raney
Ian Fiddes
Yulia Newton (Josh Stuart)
Jason Chin
Volkan Sevim
Creating (and mapping to) a
Universal Reference Genome
Benedict Paten, Adam Novak, David
Haussler, UC Santa Cruz
Acknowledgements
Alex Hastie
Denghong Zhang
Ali Bashir
Thomas Keane
Mark Akeson
Miten Jain
Hugh Olsen