1. Understanding Genome
-Biological Database Overview
Part-1
DAY-2, SESSION-1
(25-10-2010)
Rajendra K. Labala
Biomedical Informatics Centre, NICED, ICMR, Kolkata
2. Major Challenges with Genomes
Scientific challenge of decoding a genome from its
nucleotides to a set of functional elements
Development of software which is capable of
storing, manipulating, and evaluating genomes
Challenge of providing comprehensive and
informative access to a large amount of data in a
user friendly way
3. The Genome Problem
The problem with the genome (particularly human)
is that it is “large, complicated, and opaque to
analysis”
Genome features to identify include:
Genes: protein coding, RNA, pseudogenes
Regulatory elements
SNPs, repeats, etc….
4. Solutions
Ensembl
NCBI
PATRIC
You will learn
Detailed overview
Sequence related information/data mining!
5. The Ensembl Project
Ensembl is a joint project between 3 organizations to
develop a software system which produces and
maintains automatic annotation on selected
eukaryotic genomes
EMBL- European Molecular Biology Laboratory
EBI- European Bioinformatics Institute
WTSI – Wellcome Trust Sanger Institute
6. What is Ensembl
Ensembl is one of 3 main systems that are currently
available that annotate and display genomic
information
Ensembl
http://www.ensembl.org
UCSC Genome Browser
http://genome.ucsc.edu
NCBI Genome Browser
http://www.ncbi.nlm.nih.gov
Public annotation of mammalian and other genomes
Open source software
Relational database system
7. Genomes and Annotation
Ensembl does not assembly any genome project
directly
Works in relation with the sequencing centers that
generate the genome assembly
Ensembl provides high quality annotation for
genomes that do not have existing annotation
Works in relation with genomes that do have high quality
annotation
8. Utilizes raw DNA
sequence data from public
sources
Creates a tracking
database (The “Ensembl
database”)
Joins the sequences -
based on a sequence
scaffold or “Golden Path”
Automatically finds
genes and other features
of the sequence
Associates sequence
and features with data
from other sources
Provides a publicly
Ensembl Genome Annotation
accessible web based
interface to the database
11. Ensembl Software System
Uses extensively BioPerl (www.bioperl.org)
The free MySQL database
Entire Ensembl code base is freely available under
Apache open source license.
Mainly written in Perl, extensions in C. Some
viewers have been written in Java (e.g. Apollo).
Software can be accessed by FTP
Possible to set up a mirror of the entire Ensembl
system.
12. Ensembl Databases
4 Main Databases
Ensembl Core Database
Ensembl EST Database
Ensembl Compara Database
Ensembl Variation Database
Ensembl uses MySQL to store information in relational
databases
Ensembl also utilizes APIs (Application Programme
Interfaces)
Serve as a connection between the databases and specific application
programs
Ensembl has Perl API and Java API
Perl API more “complete” than Java API
13. Ensembl Databases
Ensembl Core Databases
Species specific Ensembl core databases that store
genome sequence and annotation information
Gene, transcript, and protein models that are annotated by the
Ensembl automated genome analysis
Databases also stores information about cDNA and
protein alignments, as well as external references
Ex. - NCBI Numbers AB012211
14. Ensembl Databases
Ensembl Compara Database
Is a multi-species database that stores the results of genome wide species
comparisons
The comparative genomic dataset allows for pairwise whole genome
alignments
The comparative proteomics dataset allows for orthologue predictions
and protein family clusters
Ensembl EST
Species-specific Ensembl EST databases hold an independent EST gene set
provided for all well-characterised species with a suitable amount of
biological evidence. The layout of Ensembl EST Databases is identical to the
Ensembl Core Database schema so that schema descriptions and API access
are equally applicable
Variation
The large amount of genetic variation information is organised in a set of
species-specific Ensembl Variation databases.
15. Data Mining with Ensembl
BioMart
Generic data management system built specifically for use in
Ensembl
Ensembl provide users the ability to conduct fast and powerful
searches
It simplifies the task of integrating external data sets (provided
by the user) with the Ensembl databases
Help & Documentation Link
http://asia.ensembl.org/info/index.html
16. Data mining through BioMart
Choose dataset
Choose data to be retrieved (attributes)
Narrow your dataset (filters)
20. Try Yourself
Retrieve all SNPs for „novel‟ human G-protein coupled receptor genes (GPCRs –
IPR000276) on chromosome 2.
Retrieve the sequences of the exons of the human MEFV gene in FASTA format.
Retrieve the gene structure (i.e. start and end coordinates of exons) of the mouse
gene ENSMUSG00000042351.
Retrieve all human disease genes containing transmembrane domains located
between p11.2 and q22.
The file contains a list of probeset IDs from a microarray experiment using the
Affymetrix array HG-U133 Plus 2.0 (human). Retrieve the 500 bp upstream of the
transcripts matching these probeset IDs.
Retrieve the sequences 5kb upstream of all human „known‟ genes between D1S2806
and D1S464.
Retrieve all human SNPs that have an ID from The SNP Consortium (TSC), from
chromosome 6 between 15 Mb and 15.2 Mb, with 200 bases flanking sequence.
Retrieve the mouse homologues of Homo sapiens genes CASP1, CASP2, CASP3, and
CASP4.
21. NCBI
Genome projects
After DNA sequencing, several contigs were generated and are
submitted to NCBI through WGS Submissions
Whole Genome Shotgun Sequences
WGS List
Download (GenBank format WGS FASTA)
26. NCBI FTP
For downloading the sequences/genomes in
different required formats.
FAA (amino acid file in fasta format)
FNA (nucleic acid file in fasta format)
FFN (Coding Sequences in fasta format)
GBK (GenBank format)
PTT (CDS file in tab delimited format)
28. Genome files
in different
formats
FAA (amino acid file in
fasta format)
FNA (nucleic acid file in
fasta format)
FFN (Coding Sequences
in fasta format)
GBK (GenBank format)
PTT (CDS file in tab
delimited format)
29. PATRIC
WGS annotations download
For details visit the website and the FAQ page
http://www.patricbrc.org/portal/portal/patric/Hom
e