SlideShare une entreprise Scribd logo
1  sur  31
EBI is an Outstation of the European Molecular Biology Laboratory.
Ensembl annotation
Bronwen Aken
21 September 2014
How Ensembl started
• Ewan Birney
• Michele Clamp
• Tim Hubbard
Ensembl’s goals
Annotate
(vertebrate)
genome
Integrate
with other
biological
data
Make
publicly
available
• Stable, automatic
annotation
• High quality
• Regular release cycles
• Open source
“Provide a bioinformatics framework to organise biology around
the sequences of large genomes”
Challenges
1. Find functional elements in a genome
• Data have lots of noise
2. Software / hardware
• Storing and manipulating data
3. Intuitive and comprehensive access to data
• Visualization
GRCh38 annotation in Ensembl
What is Genebuilding?
• Automatic, evidence-based annotation of
genes
• Not ab initio
• Based on sequence alignment
• “Best-in-genome”
• Aim for high specificity
• Prefer to miss a few features than heavily over-
predict
Automated gene annotation pipeline is designed
around decisions made during manual annotation
Advantages of re-annotating
• Add new genes to new / fixed genomic regions
• Updated supporting evidence: Remove models built on
data that has been deleted from archives
• Move alignments to regions with better mapping
Gene annotation pipeline – the basics
Identify interesting regions
• Rough alignment of sequences
to genome
Exhaustive alignment to
produce transcript models
Filter models
• Prioritize data sources
Produce ‘best guess’ gene
set
Repeatmasking
Same-species proteins Other-species proteins
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Protein-coding genebuild
Filtering
TranscriptConsensus
LayerAnnotation
Also:
Small ncRNAs
LincRNAs
Pseudogenes
Repeatmasking
Same-species proteins Other-species proteins
cDNAs/ESTs
UTR addition
Final gene set
Filtering
Protein-coding genebuild
Filtering
RNA-Seq models
Also:
Small ncRNAs
LincRNAs
Pseudogenes
MERGE WITH HAVANA
Release cycle
26 September 2014
11
Regulation
Gene
Allele
Conserved
sequence
Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/
Genes
• Coding & noncoding
• Protein & mRNA
alignments
• GTF & BAM files
Compara
• Conserved DNA sequence
• Multiple genome
alignments
• Homologues
• Protein families
Regulatory regions
• DNA methylation
• TFBS
• Open chromatin
Variation
• SNPs, indels,
structural variation
• Phenotypes
• QTLs
Integrate with other speciesChimpanzeeHuman
Gene SLC12A1
‘Patch’ annotation in Ensembl
Genome assembly representation
• Coord_system table
• Lists the allowed coordinate systems
• chromosome, scaffold, contig
• With ‘versions’
• GRCh37, GRCh38
• Contigs are shared between assemblies so have no version
• ‘Toplevel’ coordinate system
• Chromosomes + unplaced scaffolds + unlocalized scaffolds
+ alternate sequences
• Most popular means to access the whole genome
• API options for including/excluding alternate sequences and
PAR
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
DNA only loaded for contigs
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
DNA only loaded for contigs
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
GRCh37
Genome assembly representation
GRCh38
Scaffolds
Contigs
Chromosome
GRCh37
Seq_region names
• Regions of the genome are given a slice name; it’s like an
address
• eg. chromosome:GRCh37:6:133090509:133119701:1
• Users like to say, ‘chromosome 6’
• INSDC coordinates are versioned, but less human-readable
• chromosome:GRCh37:CM000668.1:133090509:133119701:1
assembly
seq_region.
name
coord_system
start
end
strand
Alternate sequences
• Assembly_exception table defines ‘bubbles’
• Initially set up to handle Y chromosome PAR
• Adapted to work for MHC haplotypes
• Now also used for GRC patches
• Assumes ‘equivalent’ region will be present in primary
assembly
Gene annotation on a ‘patched’ genome
62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH
Assembly excepti...
SNORA76 >
SNORD104 >
MILR1 >
Genes (GENCODE...
Primary assembly...
AC025362.12 > AC016489.18 > < AC234063.4Contigs
< Y_RNA < hsa-mir-1273e
< AC234063.1
< TEX2 < AC016489.1
< PECAM1
Genes (GENCODE...
H.sap-H.sap lastz-...
Assembly excepti...
62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH
protein coding merged Ensembl/Havana
RNA gene pseudogene
Alternative alleles Projection
Gene Legend
62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17
Assembly excepti...
H.sap-H.sap lastz-...
SNORA76 >
SNORD104 >
AC138744.2 >
MILR1 >
Genes (GENCODE...
GL383558.1
... ...GRC alignment i...
AC025362.12 > AC016489.18 > < AC009994.10Contigs
< TEX2 < RPL31P57 < POLG2
Genes (GENCODE...
Assembly excepti...
62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17
Insert relative to reference Delete relative to reference ... Large insert shown truncated due to image scale or edgeMatchAlignment Differe...
protein coding merged Ensembl/Havana
RNA gene pseudogene
Alternative alleles Projection
Gene Legend
331.04 kb Forward strand
Reverse strand 331.04 kb
276.06 kb Forward strand
Reverse strand 276.06 kb
TEX2 gene lies across
the patch boundary
PECAM1 is annotated
only on patch HG183
Gap in primary
assembly
PatchedchromosomePrimarychromosome
Gene annotation on a ‘patched’ genome
Gene annotation on patches
Patch
Primary
Gene annotation on patches
Patch
Primary
1. Manual
annotation
Gene annotation on patches
Patch
Primary
Patch
Primary
2. Project
models to
patch
1. Manual
annotation
Gene annotation on patches
Patch
Primary
Patch
Primary
Patch
Primary
1. Manual
annotation
2. Project
models to
patch
3. Gap-fill
with mini
genebuilld
Ongoing challenges
• How strict should we be when aligning proteins cDNAs to
the genome?
1. Genome assembly
• Sequencing error (inversion, artificial duplication)
• Assembly incomplete
• Alignments must allow for truncated matches
2. Population variation
• Linear genome is made from ‘one’ individual vs protein
databases contain data from many unknown individuals
• Paralogues, gene families, pseudogenes
3. Public databases eg. UniProt
• Include suspect data and incomplete for many species
• When there’s a match, or no match, is it biologically real?
• Aligning proteins from other species must allow for mismatches
Specificity
Sensitivity
Funding
European Commission
Framework Programme 7
Ensembl Acknowledgements
Questions?
Reporting data to users
Visualisation and Data querying:
• - When browsing the primary assembly, how do we make it obvious to users
when alternate sequences are available?
• - How do we show when the alternate genomic sequences are identical or differ
from one another?
• - How do we show whether the alternate genome sequences result in identical or
different transcribed / translated products?
• - How do we make a qualitative call about which allele is “better” to use? eg. ABO
• - Data download options
• - Concept of a ‘canonical’ transcript per gene (per tissue)
Data analysis:
• - Linking between alternate alleles (and paralogues?)
• - How do we show when data have been mapped from an old to new assembly,
compared to freshly aligned to a new assembly? When is it right to map instead of
align?
• - In a non-linear genome model, how will SNPs (rsIDs) work?
• - In a non-linear genome model, what coordinate system should be used?

Contenu connexe

Tendances (20)

Sequence alignments complete coverage
Sequence alignments complete coverageSequence alignments complete coverage
Sequence alignments complete coverage
 
Introduction to sequence alignment partii
Introduction to sequence alignment partiiIntroduction to sequence alignment partii
Introduction to sequence alignment partii
 
PAM : Point Accepted Mutation
PAM : Point Accepted MutationPAM : Point Accepted Mutation
PAM : Point Accepted Mutation
 
Sequence file formats
Sequence file formatsSequence file formats
Sequence file formats
 
Sequence alignment belgaum
Sequence alignment belgaumSequence alignment belgaum
Sequence alignment belgaum
 
Molecular modeling database
Molecular modeling database Molecular modeling database
Molecular modeling database
 
BIOL335: How to annotate a genome
BIOL335: How to annotate a genomeBIOL335: How to annotate a genome
BIOL335: How to annotate a genome
 
COMPARATIVE GENOMICS.ppt
COMPARATIVE GENOMICS.pptCOMPARATIVE GENOMICS.ppt
COMPARATIVE GENOMICS.ppt
 
Finding ORF
Finding ORFFinding ORF
Finding ORF
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
NGS - QC & Dataformat
NGS - QC & Dataformat NGS - QC & Dataformat
NGS - QC & Dataformat
 
Database in bioinformatics
Database in bioinformaticsDatabase in bioinformatics
Database in bioinformatics
 
gene prediction programs
gene prediction programsgene prediction programs
gene prediction programs
 
Ensembl Browser Workshop
Ensembl Browser WorkshopEnsembl Browser Workshop
Ensembl Browser Workshop
 
Whole genome sequence.
Whole genome sequence.Whole genome sequence.
Whole genome sequence.
 
Comparative genomics
Comparative genomicsComparative genomics
Comparative genomics
 
Gene prediction methods vijay
Gene prediction methods  vijayGene prediction methods  vijay
Gene prediction methods vijay
 
Introduction to Biological databases
Introduction to Biological databasesIntroduction to Biological databases
Introduction to Biological databases
 
FASTA
FASTAFASTA
FASTA
 
BITS: Basics of sequence analysis
BITS: Basics of sequence analysisBITS: Basics of sequence analysis
BITS: Basics of sequence analysis
 

En vedette

TGAC Browser bosc 2014
TGAC Browser bosc 2014TGAC Browser bosc 2014
TGAC Browser bosc 2014Anil Thanki
 
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl GenomesGenome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl GenomesEBI
 
News screen annotation
News screen annotationNews screen annotation
News screen annotationtommybolton
 
Modelling and exchanging annotations
Modelling and exchanging annotationsModelling and exchanging annotations
Modelling and exchanging annotationsAntoine Isaac
 
News Screen Annotation
News Screen AnnotationNews Screen Annotation
News Screen Annotationandygoldman21
 
U Pointer Detailed Training Manual
U Pointer Detailed Training ManualU Pointer Detailed Training Manual
U Pointer Detailed Training ManualUPointer
 
Web2.0 tools categorised
Web2.0 tools categorised Web2.0 tools categorised
Web2.0 tools categorised Anne-Mart Olsen
 
USB Video Conferencing Info-graphic
USB Video Conferencing Info-graphicUSB Video Conferencing Info-graphic
USB Video Conferencing Info-graphicPaul Richards
 
Using 3 M Interactive Tools
Using 3 M Interactive ToolsUsing 3 M Interactive Tools
Using 3 M Interactive ToolsLinda Nitsche
 
InFocus Solutions Displays
InFocus Solutions DisplaysInFocus Solutions Displays
InFocus Solutions DisplaysGabriel Navakas
 
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresent
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresentEzcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresent
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresentvinaybs
 
The Application of the Human Phenotype Ontology
The Application of the Human Phenotype Ontology The Application of the Human Phenotype Ontology
The Application of the Human Phenotype Ontology mhaendel
 
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...LinkedTV
 
The Paperless Student - Skills and Confidence Reading on Screen
The Paperless Student - Skills and Confidence Reading on ScreenThe Paperless Student - Skills and Confidence Reading on Screen
The Paperless Student - Skills and Confidence Reading on ScreenMatt Cornock
 
Live – in relationship
Live – in relationshipLive – in relationship
Live – in relationshipankur_sk
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS
 

En vedette (20)

TGAC Browser bosc 2014
TGAC Browser bosc 2014TGAC Browser bosc 2014
TGAC Browser bosc 2014
 
Genome Browser
Genome BrowserGenome Browser
Genome Browser
 
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl GenomesGenome resources at EMBL-EBI: Ensembl and Ensembl Genomes
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
 
Ensembl genome
Ensembl genomeEnsembl genome
Ensembl genome
 
News screen annotation
News screen annotationNews screen annotation
News screen annotation
 
Modelling and exchanging annotations
Modelling and exchanging annotationsModelling and exchanging annotations
Modelling and exchanging annotations
 
News Screen Annotation
News Screen AnnotationNews Screen Annotation
News Screen Annotation
 
U Pointer Detailed Training Manual
U Pointer Detailed Training ManualU Pointer Detailed Training Manual
U Pointer Detailed Training Manual
 
Web2.0 tools categorised
Web2.0 tools categorised Web2.0 tools categorised
Web2.0 tools categorised
 
USB Video Conferencing Info-graphic
USB Video Conferencing Info-graphicUSB Video Conferencing Info-graphic
USB Video Conferencing Info-graphic
 
Using 3 M Interactive Tools
Using 3 M Interactive ToolsUsing 3 M Interactive Tools
Using 3 M Interactive Tools
 
InFocus Solutions Displays
InFocus Solutions DisplaysInFocus Solutions Displays
InFocus Solutions Displays
 
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresent
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresentEzcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresent
Ezcast pro vs Crestron Airmedia vs Barco clickshare vs Latentech wepresent
 
The Application of the Human Phenotype Ontology
The Application of the Human Phenotype Ontology The Application of the Human Phenotype Ontology
The Application of the Human Phenotype Ontology
 
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
Survey of Semantic Media Annotation Tools - towards New Media Applications wi...
 
The Paperless Student - Skills and Confidence Reading on Screen
The Paperless Student - Skills and Confidence Reading on ScreenThe Paperless Student - Skills and Confidence Reading on Screen
The Paperless Student - Skills and Confidence Reading on Screen
 
Live – in relationship
Live – in relationshipLive – in relationship
Live – in relationship
 
Windows Vista
Windows VistaWindows Vista
Windows Vista
 
BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1BITS: UCSC genome browser - Part 1
BITS: UCSC genome browser - Part 1
 
NCBI
NCBINCBI
NCBI
 

Similaire à Ensembl annotation

Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Sijo A
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian Aurisano
 
Curation Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopCuration Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopMonica Munoz-Torres
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopMorgan Langille
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomicsajay301
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GenomeInABottle
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GenomeInABottle
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917GenomeInABottle
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...DataScienceConferenc1
 
Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Monica Munoz-Torres
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GenomeInABottle
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectGenome Reference Consortium
 
An introduction to Web Apollo for the Biomphalaria glabatra research community.
An introduction to Web Apollo for the Biomphalaria glabatra research community.An introduction to Web Apollo for the Biomphalaria glabatra research community.
An introduction to Web Apollo for the Biomphalaria glabatra research community.Monica Munoz-Torres
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshopGenomeInABottle
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016GenomeInABottle
 
Browsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with EnsemblBrowsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with EnsemblDenise Carvalho-Silva, PhD
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшаваValeriya Simeonova
 

Similaire à Ensembl annotation (20)

Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)Bioinformatics (Exam point of view)
Bioinformatics (Exam point of view)
 
Jillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-jaJillian ms defense-4-14-14-ja
Jillian ms defense-4-14-14-ja
 
Curation Introduction - Apollo Workshop
Curation Introduction - Apollo WorkshopCuration Introduction - Apollo Workshop
Curation Introduction - Apollo Workshop
 
GLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics WorkshopGLBIO/CCBC Metagenomics Workshop
GLBIO/CCBC Metagenomics Workshop
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
GIAB Benchmarks for SVs and Repeats for stanford genetics sv 200511
 
GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005GIAB-GRC workshop oct2015 giab introduction 151005
GIAB-GRC workshop oct2015 giab introduction 151005
 
Giab for jax long read 190917
Giab for jax long read 190917Giab for jax long read 190917
Giab for jax long read 190917
 
Functional genomics
Functional genomicsFunctional genomics
Functional genomics
 
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
[DSC Europe 23][DigiHealth] Vesna Pajic - Machine Learning Techniques for omi...
 
Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.Web Apollo Tutorial for the i5K copepod research community.
Web Apollo Tutorial for the i5K copepod research community.
 
GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517GIAB Integrating multiple technologies to form benchmark SVs 180517
GIAB Integrating multiple technologies to form benchmark SVs 180517
 
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) ProjectThe Matched Annotation from NCBI and EMBL-EBI (MANE) Project
The Matched Annotation from NCBI and EMBL-EBI (MANE) Project
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
An introduction to Web Apollo for the Biomphalaria glabatra research community.
An introduction to Web Apollo for the Biomphalaria glabatra research community.An introduction to Web Apollo for the Biomphalaria glabatra research community.
An introduction to Web Apollo for the Biomphalaria glabatra research community.
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016Genome in a bottle for ashg grc giab workshop 181016
Genome in a bottle for ashg grc giab workshop 181016
 
Browsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with EnsemblBrowsing Genes, Variation and Regulation data with Ensembl
Browsing Genes, Variation and Regulation data with Ensembl
 
презентация за варшава
презентация за варшавапрезентация за варшава
презентация за варшава
 

Plus de Genome Reference Consortium

Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCGenome Reference Consortium
 
What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?Genome Reference Consortium
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesGenome Reference Consortium
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amGenome Reference Consortium
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyGenome Reference Consortium
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsGenome Reference Consortium
 

Plus de Genome Reference Consortium (20)

Previewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRCPreviewing GRCm39: Assembly Updates from the GRC
Previewing GRCm39: Assembly Updates from the GRC
 
What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?What's new and what's next for the human reference assembly?
What's new and what's next for the human reference assembly?
 
Telomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomesTelomere-to-telomere assembly of a complete human chromosomes
Telomere-to-telomere assembly of a complete human chromosomes
 
Genome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkitGenome variation graphs with the vg toolkit
Genome variation graphs with the vg toolkit
 
Why graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 amWhy graph genome storage and updating wakes me up at 4 am
Why graph genome storage and updating wakes me up at 4 am
 
Schneider grc workshop_final
Schneider grc workshop_finalSchneider grc workshop_final
Schneider grc workshop_final
 
Mane v2 final
Mane v2 finalMane v2 final
Mane v2 final
 
Lrg and mane 16 oct 2018
Lrg and mane   16 oct 2018Lrg and mane   16 oct 2018
Lrg and mane 16 oct 2018
 
20181016 grc presentation-pa
20181016 grc presentation-pa20181016 grc presentation-pa
20181016 grc presentation-pa
 
2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final2018 1016 trio_binning_ashg_arhie_final
2018 1016 trio_binning_ashg_arhie_final
 
Variation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copyVariation graphs and population assisted genome inference copy
Variation graphs and population assisted genome inference copy
 
Ashg2017 workshop schneider
Ashg2017 workshop schneiderAshg2017 workshop schneider
Ashg2017 workshop schneider
 
Ashg2017 workshop tg
Ashg2017 workshop tgAshg2017 workshop tg
Ashg2017 workshop tg
 
Ashg sedlazeck grc_share
Ashg sedlazeck grc_shareAshg sedlazeck grc_share
Ashg sedlazeck grc_share
 
101717.kh miga ashg_grc
101717.kh miga ashg_grc101717.kh miga ashg_grc
101717.kh miga ashg_grc
 
AGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: FultonAGBT2017 Reference Workshop: Fulton
AGBT2017 Reference Workshop: Fulton
 
AGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: SchneiderAGBT2017 Reference Workshop: Schneider
AGBT2017 Reference Workshop: Schneider
 
AGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: LindsayAGBT2017 Reference Workshop: Lindsay
AGBT2017 Reference Workshop: Lindsay
 
Haplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long readsHaplotype resolved structural variation assembly with long reads
Haplotype resolved structural variation assembly with long reads
 
Everyday de novo diploid assembly
Everyday de novo diploid assemblyEveryday de novo diploid assembly
Everyday de novo diploid assembly
 

Dernier

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 

Dernier (20)

Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

Ensembl annotation

  • 1. EBI is an Outstation of the European Molecular Biology Laboratory. Ensembl annotation Bronwen Aken 21 September 2014
  • 2. How Ensembl started • Ewan Birney • Michele Clamp • Tim Hubbard
  • 3. Ensembl’s goals Annotate (vertebrate) genome Integrate with other biological data Make publicly available • Stable, automatic annotation • High quality • Regular release cycles • Open source “Provide a bioinformatics framework to organise biology around the sequences of large genomes”
  • 4. Challenges 1. Find functional elements in a genome • Data have lots of noise 2. Software / hardware • Storing and manipulating data 3. Intuitive and comprehensive access to data • Visualization
  • 6. What is Genebuilding? • Automatic, evidence-based annotation of genes • Not ab initio • Based on sequence alignment • “Best-in-genome” • Aim for high specificity • Prefer to miss a few features than heavily over- predict Automated gene annotation pipeline is designed around decisions made during manual annotation
  • 7. Advantages of re-annotating • Add new genes to new / fixed genomic regions • Updated supporting evidence: Remove models built on data that has been deleted from archives • Move alignments to regions with better mapping
  • 8. Gene annotation pipeline – the basics Identify interesting regions • Rough alignment of sequences to genome Exhaustive alignment to produce transcript models Filter models • Prioritize data sources Produce ‘best guess’ gene set
  • 9. Repeatmasking Same-species proteins Other-species proteins cDNAs/ESTs UTR addition Final gene set Filtering Protein-coding genebuild Filtering TranscriptConsensus LayerAnnotation Also: Small ncRNAs LincRNAs Pseudogenes
  • 10. Repeatmasking Same-species proteins Other-species proteins cDNAs/ESTs UTR addition Final gene set Filtering Protein-coding genebuild Filtering RNA-Seq models Also: Small ncRNAs LincRNAs Pseudogenes MERGE WITH HAVANA
  • 11. Release cycle 26 September 2014 11 Regulation Gene Allele Conserved sequence Figure adapted from the ENCODE project www.nature.com/nature/focus/encode/ Genes • Coding & noncoding • Protein & mRNA alignments • GTF & BAM files Compara • Conserved DNA sequence • Multiple genome alignments • Homologues • Protein families Regulatory regions • DNA methylation • TFBS • Open chromatin Variation • SNPs, indels, structural variation • Phenotypes • QTLs
  • 12. Integrate with other speciesChimpanzeeHuman Gene SLC12A1
  • 14. Genome assembly representation • Coord_system table • Lists the allowed coordinate systems • chromosome, scaffold, contig • With ‘versions’ • GRCh37, GRCh38 • Contigs are shared between assemblies so have no version • ‘Toplevel’ coordinate system • Chromosomes + unplaced scaffolds + unlocalized scaffolds + alternate sequences • Most popular means to access the whole genome • API options for including/excluding alternate sequences and PAR
  • 20. Seq_region names • Regions of the genome are given a slice name; it’s like an address • eg. chromosome:GRCh37:6:133090509:133119701:1 • Users like to say, ‘chromosome 6’ • INSDC coordinates are versioned, but less human-readable • chromosome:GRCh37:CM000668.1:133090509:133119701:1 assembly seq_region. name coord_system start end strand
  • 21. Alternate sequences • Assembly_exception table defines ‘bubbles’ • Initially set up to handle Y chromosome PAR • Adapted to work for MHC haplotypes • Now also used for GRC patches • Assumes ‘equivalent’ region will be present in primary assembly
  • 22. Gene annotation on a ‘patched’ genome 62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH Assembly excepti... SNORA76 > SNORD104 > MILR1 > Genes (GENCODE... Primary assembly... AC025362.12 > AC016489.18 > < AC234063.4Contigs < Y_RNA < hsa-mir-1273e < AC234063.1 < TEX2 < AC016489.1 < PECAM1 Genes (GENCODE... H.sap-H.sap lastz-... Assembly excepti... 62.3Mb 62.4Mb 62.5MbHsap HG183_PATCH protein coding merged Ensembl/Havana RNA gene pseudogene Alternative alleles Projection Gene Legend 62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17 Assembly excepti... H.sap-H.sap lastz-... SNORA76 > SNORD104 > AC138744.2 > MILR1 > Genes (GENCODE... GL383558.1 ... ...GRC alignment i... AC025362.12 > AC016489.18 > < AC009994.10Contigs < TEX2 < RPL31P57 < POLG2 Genes (GENCODE... Assembly excepti... 62.225Mb 62.250Mb 62.275Mb 62.300Mb 62.325Mb 62.350Mb 62.375Mb 62.400Mb 62.425Mb 62.450Mb 62.475MbHsap Chr. 17 Insert relative to reference Delete relative to reference ... Large insert shown truncated due to image scale or edgeMatchAlignment Differe... protein coding merged Ensembl/Havana RNA gene pseudogene Alternative alleles Projection Gene Legend 331.04 kb Forward strand Reverse strand 331.04 kb 276.06 kb Forward strand Reverse strand 276.06 kb TEX2 gene lies across the patch boundary PECAM1 is annotated only on patch HG183 Gap in primary assembly PatchedchromosomePrimarychromosome
  • 23. Gene annotation on a ‘patched’ genome
  • 24. Gene annotation on patches Patch Primary
  • 25. Gene annotation on patches Patch Primary 1. Manual annotation
  • 26. Gene annotation on patches Patch Primary Patch Primary 2. Project models to patch 1. Manual annotation
  • 27. Gene annotation on patches Patch Primary Patch Primary Patch Primary 1. Manual annotation 2. Project models to patch 3. Gap-fill with mini genebuilld
  • 28. Ongoing challenges • How strict should we be when aligning proteins cDNAs to the genome? 1. Genome assembly • Sequencing error (inversion, artificial duplication) • Assembly incomplete • Alignments must allow for truncated matches 2. Population variation • Linear genome is made from ‘one’ individual vs protein databases contain data from many unknown individuals • Paralogues, gene families, pseudogenes 3. Public databases eg. UniProt • Include suspect data and incomplete for many species • When there’s a match, or no match, is it biologically real? • Aligning proteins from other species must allow for mismatches Specificity Sensitivity
  • 31. Reporting data to users Visualisation and Data querying: • - When browsing the primary assembly, how do we make it obvious to users when alternate sequences are available? • - How do we show when the alternate genomic sequences are identical or differ from one another? • - How do we show whether the alternate genome sequences result in identical or different transcribed / translated products? • - How do we make a qualitative call about which allele is “better” to use? eg. ABO • - Data download options • - Concept of a ‘canonical’ transcript per gene (per tissue) Data analysis: • - Linking between alternate alleles (and paralogues?) • - How do we show when data have been mapped from an old to new assembly, compared to freshly aligned to a new assembly? When is it right to map instead of align? • - In a non-linear genome model, how will SNPs (rsIDs) work? • - In a non-linear genome model, what coordinate system should be used?