10. Submitted on NCBI35 (hg17)
nsv832911 (nstd68)
http://www.ncbi.nlm.nih.gov/dbvar
11. Moved approximately 2 Mb
distal on chr15
NCBI35 (hg17) Tiling Path
NC_0000015.8 (chr15)
Gap Inserted Removed from assembly
GRCh37 (hg19) Tiling Path Added to assembly
NC_0000015.9 (chr15)
HG-24
12. GRCh37.p10
(160 regions: 2.89% of chromosomes)
111 Fix PATCHES: Chromosome update in GRCh38
(adds >5 Mb of novel sequence to the assembly)
71 Novel PATCHES: Additional sequence added
(adds >800K of novel sequence to the assembly)
Releasing patches quarterly Summer of 2013
21. The Reference Assembly is Evolving
Centralization of assembly and sequence data facilitates:
Reporting problems
Implementing fixes
Building tools
Data management
22. Thanks!
Genome Reference Consortium
The Genome Institute at Washington University
The Wellcome Trust Sanger Institute
European Bioinformatics Institute
National Center for Biotechnology Information
NCBI
Assembly Group
RefSeq Group
Genome Annotation Group
Editor's Notes
What is variant calling? Identifying differences from a reference.
Technical noise: Ideogram showing gaps in human genome
Technical noise example: ISCA variant submitted that was completely within a gap. Note: most people don’t look at assembly tracks when they review data- you might not ever see anything odd about this unless you did (this is actually what caused us to start doing validation on variant definition based on the assembly.
Technical issue: We get tiny reads (relative to the genome) that we have to align to a reference and interpret.Screen shot from 1000 genomes data in the 1000 genomes browser. Tenth exon of CDC27. Highlighting samples that have been sequencing using one technology and aligned using two different methods. Not lack of 1000 genomes calls in this region as well as questionable het SNP (see in BWA, but not Mosaik)
This problem can also arise due to population specific issues. APOL1 genes seems to be correctly assembled but there may be an African specific copy number duplication that causes a SNP to be called- this may not be a SNP but rather a difference between paralogous gene copies.
Focal segmental glomerulosclerosis 4: utility of having data in a centralized resource: we were able to add annotation from multiple sources onto this genome location. Utility of having variant calls in a central repository to allow for addition of knowledge. Early in the new year we will be adding tracks for Evan’s SUN data and alignment of known paralogs/pseudogenes to the genome.
To address assembly issues the GRC to centralize the production of the reference assembly. This gives the community a single point of contact for reporting problems and finding information about the assembly. Additionally, we serve as an aggregator of information- as individual labs find or fix problems, we can integrate this information into the reference assembly so everyone can have access to this data.
Region curation slide: We curate the genome region by region, and make this information available to users on a web site (and as downloadable files for integration with browsers).
The GRC releases patches to the human assembly on a quarterly cycle. There are two varieties of patches:FIX patches correct existing assembly problems: chromosome will update, patches integrated in GRCh38NOVEL patches add new sequence representations: will become alternate lociApproximately 3% of the current public human assembly GRCh37 is associated with a region that is represented by a patch or alternate locus.As you can see, the GRC has been very busy with updating assemblies. I’d now like to talk about the tools and software we use to do this.
If you are not using the entire assembly in your efforts, you may be missing genes in your exome capture reagents.
Example of a fix patches- no one is really screening for these right now despite clear importance in neuronal development.
RefSeqGene/LRG screen shot: stable coordinate system for gene level reporting. Gene centric genomic sequences.