33. GRCh37B Sites for Update: n=1164
Sites with unique successful ctg 1148 (98.6%)
Avg Length 448 bp
Min/Max Success Length 51/791 bp
Avg Coverage 80x
Read Source (all contigs)
High coverage 32%
Low coverage 57%
Exome 10%
Fixing Rare/Incorrect Bases
34. Build sequence contigs based on contigs
defined in TPF (Tiling Path File).
Check for orientation consistencies
Select switch points
Instantiate sequence for further analysis
Switch point
Representative chromosome
sequence
38. NCBI35 (hg17) Tiling Path
GRCh37 (hg19) Tiling Path
Gap Inserted
Moved approximately 2 Mb
distal on chr15
NC_0000015.8 (chr15)
NC_0000015.9 (chr15)
Removed from assembly
Added to assembly
http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/issue_detail.cgi?id=HG-24
39. Sequences from haplotype 1
Sequences from haplotype 2
Old Assembly model: compress into a consensus
New Assembly model: represent both haplotypes
48. Preview of GRCh38 (scheduled Fall 2013)
TEX28 TKTL1
LOC101060233
(opsin related)
LOC101060234
(TEX28 related)
GRCh37 (current reference assembly)
chrX
49. Hydin: chr16 (16q22.2)
Hydin2: chr1 (1q21.1)
Missing in NCBI35/NCBI36 Unlocalized in GRCh37 Finished in GRCh38
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Alignment to Hydin2 Genomic, 300 Kb, 99.4% ID
Alignment to Hydin1 CHM1_1.0, >99.9% ID
Doggett et al., 2006
54. Making the assembly accessible to
existing tools: masking
Query set: 439,109,084 NA12878 HiSeq reads
55. Masking effectively blocks alignments
in regions with high identity
Simulated reads from GRCh37.p9
• Unpaired reads
• 101 bp
• 1x coverage
• Default wgsim parameters
Masking parameters
• Percent Id: 100%
• Step size: 5 bp
• Minimum length: 101 bp
• Center SNPs in unmasked regions
57. NA12878 reads whose best
alignment was on an alt/patch in
the masked assembly were
evaluated for their alignment
location when aligned to the
primary assembly alone
Masking effectively reduces the
increase in NA12878 reads that
have alignments with MAPQ=0 that
occurs when the full assembly is
used as an alignment substrate
1000Gs and ENCODE logos: What ties them together? Data analysis was absolutely dependent on the reference assembly.
CtgN50 stats here
Look up how much novel sequence addedAcross all patches: 35 Mb of sequence added
44 SNVs between Ren2 Tx alignment and Primary, 29 of these have rsIDs: of these, 19 Alt base = Ref (likely paralog diff and no evidence for polymorphism), 9 Alt base = Tx base (SNP and Parolog diff?), 1 Alt base != Ref and Alt base != Tx (craziness)
Insert dot matrix alignment- pull from assembly-assembly alignments
Daly paper on VNTR
For the intermediate build GRCh37B, we are updating a subset of the high-confidence bases, about 1000, as our proof-of-principle. This panel shows reads from NA12878 aligned to chr. 19 that identify a base with MAF=0 in the LIN37 locus. This creates a non-consensus splice site.To create accessioned sequence for correcting the reference, we are using cortex_con (Iqbal and Caccamo) to generate mini-contigs (>= 50 bp) from collections of 1kG and RP11 WGS reads, the former selected from random 1kG populations.
In ph1, 1000G identified just over 235K bases with an MAF < 0.05. These may represent wrong or rare bases, and the GRC has been urged to change all of these.GRC decided to take a conservative approach:Focus on high-confidence subset of these bases (provided by 1kG analysis group: Poplin, Clarke, Streeter): 54K of these; 5K “wrong”; 1.5K overlap a trxpt:In strict accessibility maskHave clone sequence supporting alt baseNo failed variants within 150 bp of questionable baseWill fix wrong bases in set- those cause the most problems for variant analyses. Will update only rare bases in set with functional effects.Not updating ph1 indels. Sanger also doing some independent analyses for bases and indels. Summer will be spent defining the final collection of bases to be updated.
Stats for the mini-contigs built for GRCh37B.This slide shows the correction of the LIN37 issues via insertion of two mini-contigs into the tiling path.In GRCh37B, 56 RefSeqs, corresponding to 26 distinctloci have had their alignments improved by addition of the mini-contigs in GRCh37B. We have some development left for GRCh38:Tweaking process to build contigs that address clustered bases and/or indelsDefine the final set of bases
Alignments refer to pairs of sequence. Once you know how a pair of sequences go together, you can look at stringing the pairs along into a contig. The contig is essentially the consensus sequence that is produced from the components.To create a contig, we use the steps shown on this slide.What are switch points? As you create the consensus sequence of the contig, the switch points tell you where to stop using the sequence from one component and begin using the sequence from the next.
Adding novel sequence for GRCh38.One source of this novel sequence is the 1kG ph1 decoy sequence.Decoy doesn’t provide chromosome context. Thus, if we can place much of the decoy in chromosome context for GRCh38, that adds even more value to the assembly. This slide shows the breakdown of the decoy: by source (bottom), by alignment to GenBank, and by amount and type of repeat.The GRC intends to assess capture by looking at 1kG reads that used to align to the decoy and seeing where they align in the updated assembly.
Other portions of the decoylikely represent sequence that belongs in reference assembly gaps. We are aligning all HuRef and ALLPATHS scaffolds to the reference assembly to identify sequences that extend into or span gaps. This slide shows how a combination of HuRef WGS and PCR product close a gap on chr. 16 and provide complete representation for TMEM114.Analysis of GRCh37B shows:46 of 73 HuRef scaffold insertions involve decoy. 77 ALLPATHS decoy contigs are being added at 46 gaps.
Lastly, some portions of the decoy will represent sequence variants. In these cases, the primary assembly does not need to be changed, but the decoy can be added as a NOVEL patch/alt locus.This slide shows a NOVEL locus that was created to capture a decoy sequence containing 30kb of additional sequence, which represents a repeat expansion.As of GRCh37.p12, 87 of 781 decoy sequences have been captured in chromosome updates/fix patches or as novel patches/alt loci.
There are several mechanisms we can use for capturing decoy.Much of the decoy represents centromeric repeat sequence. In collaboration with Karen Hayden in Jim Kent’s lab at UCSC, the GRC is planning to include modeled centromeric sequences in GRCh38.
The reference is not just the is the chromosome sequences of the primary assembly unit, but also includes the alternate loci and patches, which are used to provide additional sequence representations at selected genomic regions. The GRC has been releasing patches to the human assembly on a quarterly cycle, and we’re now at GRCh37.p12. There are two varieties of patches:FIX patches correct existing assembly problems: chromosome will update, patches integrated in GRCh38NOVEL patches add new sequence representations: will become alternate lociThis ideogram shows the current distribution of patches and alternate loci, and you can see that many regions have changed since GRCh37. Note that approximately 3% of the current public human assembly GRCh37 is associated with a region that is represented by a patch or alternate locus.
Adding NOVEL sequence for GRCh38 doesn’t just mean adding sequence that is completely unrepresented in GRCh37. While many of the NOVEL patches, like the one on the previous slide, represent indels, adding novel sequence also means adding sequence variants for regions too complex to be represented by a single path.There is substantial variation at the LRC/KIR region on chr. 19. As shown on this slide, not only has the GRC replaced the GRCh37 path, which was derived from components from different clone libraries, with a single haplotype path from the CHM1 assembly, it also now has 8 different haplotypes represented as alternate loci. The addition of another 10+ haplotypes at this locus is also under consideration.
The excess of red in the cSRA alignment track comes from secondary alignments. Somewhere in the SAM to cSRA conversion it seems that the secondary alignment CIGAR strings got messed up, resulting in what looks like really bad alignments. There’s no way to turn off the display for just the secondary alignments in Gbench. We will have to try and regenerate the cSRA to get rid of these…