BITS: Overview of important biological databases beyond sequences

Basic bioinformatics concepts,
databases and tools
Module 4
Beyond the sequences

Dr. Joachim Jacob
http://www.bits.vib.be

Updated Nov 2011
http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf

To understand life, we need not only
sequences, but many other concepts

Bioinformatics is also storing and analyzing
− gene information: variations, isoforms,...
− Expression data
− 3D protein structure data
− Interaction data
− Pathways and network

“Storing all relevant biological data”

Schematic view II
GeneA sequence annotations – gene expr – pathway – struct,...

GeneB sequence annotations – gene expr – pathway – struct,...

GeneC sequence annotations – gene expr – pathway – struct,...

analysis Additional information
sources
results results
Primary database
Other sequence
databases

The indispensable databases

Gene Ontology – structuring

KEGG – biochemical pathways

PDB – Structure of proteins

Intact – Interaction data

dbSNP – database of genomic variation

Expression sources – Microarray data

Gene Ontology structures the way we
communicate about life

Gene translation Protein production Protein synthesis

http://www.arabidopsis.org/help/tutorials/go1.jsp
http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax

Gene Ontology structures life
http://www.geneontology.org/
Agreement on standardized keywords (often referred to as
'controlled vocabularies'), describing all natural processes in an
hierarchical way (ontology).
Keywords are assigned to genes based different evidence
Keywords are ordered in a hierarchical tree-like structure ( 'directed
acyclic graphs')
Three GO 'trees' exists, describing:
"Biological Process"
"Cellular Component"
"Molecular Function"
http://www.arabidopsis.org/help/tutorials/go1.jsp
http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax

A gene can be given
different GO terms

Example, cytochrome c:

molecular function: oxidoreductase activity,

biological process: oxidative phosphorylation and
induction of cell death,

cellular component: mitochondrial matrix and
mitochondrial inner membrane.

In each tree, the terms are organised in a directed acyclic
graph: a network consisting of parents and child-terms (as
nodes) and lines between them as relationships.

Different evidence codes can assign a
degree of confidence to the assignment
http://www.geneontology.org/GO.evidence.shtml

Evidence codes can be grouped by:

Experimental (e.g. IDA – inferred from direct assay)

Computational analysis

Author statement

Curator statement

Inferred from electronic annotation (IEA)
If available, each annotation has also a reference

Different evidence codes can assign a
degree of confidence to the assignment

Gene Ontology structures all genes
according to their biological significance
The GO structure and the terms can be browsed by a browser
called AmiGO.
The Quick Go from EBI has some nice visualisation
Excellent GO-wiki for all your questions

GO can be used to retrieve all gene
(products) related to one specific term
You can search broad, e.g. Amigo search for Diabetes
leads to following GO term
http://amigo.geneontology.org/

GO can be used to retrieve all gene
(products) related to one specific term
Amigo search for Diabetes

GO is also useful to analyze and compare
different gene lists
A lot of tools on GO are available on website.

http://www.geneontology.org/GO.tools.shtml

Some things to know about GO
For analyses, one can make use of 'shrinked' GO sets,
the so-called GO-slims
– GO slims are a subset of biologically more
relevant GO terms (available per species)
– GO ontologies can be downloaded in .obo
format.
Not all information is captured by GO and need to be
retrieved in other databases
Metabolic pathways: KEGG, …
Phenotype/diseases
• Mapping files exists e.g. kegg2go
http://www.geneontology.org/GO.slims.shtml

Biological pathways databases organise
genes by molecular reactions
3 important databases on biological pathways

http://www.kegg.jp/

 http://www.reactome.org/ - EBI
 http://metacyc.org

Proteins with enzymatic function receive
an Enzyme Commission (EC) number
http://www.chem.qmul.ac.uk/iubmb/enzyme/
EC 6 Ligases
EC 5 Isomerases
EC 4 Lyases
EC 3 Hydrolases
EC 2 Transferases
EC 1 Oxidoreductases

IntAct database contains interaction
information of proteins
http://www.ebi.ac.uk/intact
Three types of interactions stored

Protein-protein

Protein-dna

Protein-small molecule

IntAct database represents all
interactions as binary: caution!

Interaction networks can be analysed on
your computer using Cytoscape

Cytoscape training material on the BITS website

PDB hosts 3-dimensional
structural data on molecules

PDB hosts 3-dimensional
structural data on molecules

PDB = Protein DataBank
http://www.pdb.org/pdb/home/home.do
Only structures resolved through NMR and X-ray
(or other accurate techniques)

Proteins

DNA

RNA

Ligands

Understanding PDB data: tutorial

PDB files can be read by a lot of different
tools to display the structure
Every entry in PDB contains its own PDB accession
number (often 1 digit and three letters)
The PDB file contains 3D coordinates from every
single atom in the structure, together with
variability of that position (last two digits)

http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203817:protein-structure-

PDB files can be read by a lot of different
tools to display the structure
Tools to visualize (and some to analyze
structures) (see BITS wiki)

http://www.bits.vib.be/wiki/index.php/Protein_structure

To find a structure for your protein
sequence is to search for similarity
Homology modeling
Similarity on sequence level projected to a structure
 Blast your query against PDB db by cblast , or at expasy
 PSI-BLAST - can detect sequences with similar structures
(twilight zone!)
 If still no success: 3D-jury (a meta approach, including fold
recognition and local structure prediction)
Similarity on structural level: aligning structures
 VAST (structure)
 Distance mAtrix aLIgnment DALI

BITS training on protein structure analysis
http://www.ii.uib.no/~slars/bioinfocourse/PDFs/structpred_tutorial.pdf
Tools at EBI http://consurf.tau.ac.il/pe/protexpl/psbiores.htm

Structural information is used to classify
proteins Database cross-references in PDB entry


SCOP
Groups proteins based on evolutionary, domain
architecture and structural information.

CATH
Manually curated classification on protein domains

http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.cathdb.info/

dbSNP is a public-domain archive for
simple genetic polymorphisms

Single Nucleotide Polymorphism database (NCBI)

Each dbSNP entry has a code rsxx (RefSNP) or ssxx
(submitted SNP)

single-base nucleotide substitutions (also known as
single nucleotide polymorphisms or SNPs),

small-scale multi-base deletions or insertions (also
called deletion insertion polymorphisms or DIPs)

retroposable element insertions and microsatellite
repeat variations (also called short tandem repeats or
STRs).

Synchronized with new genome builds

Expression data can be sequence-based
or hybridisation-based
Sequence-based (ESTs - RNA seq - SAGE)

Digital gene expression/northern
Microarray databases – hybridisation based:

GEO: gene expression omnibus (NCBI)
− Platform: GPLxxxxxxx
− Experiment: GSExxxxxx (= several samples)
− Sample: GSMxxxxxxxx
− Some experiments are curated: GDSxxxxx (online
analysis possible)

ArrayExpress (EBI)

Example of expression data at GEO

Entrez interconnects the databases at
NCBI for easy querying

UniGene : sequences grouped by gene

PopSet : sequence alignments for population
studies and phylogeny

Structure : 3D structures (PDB)

Genome : genomic maps of chromosomes and
plasmids

UniSTS (Sequence Tagged Sites)

PubMed : literature abstracts (MEDLINE,…)

OMIM (Online Mendelian Inheritance in Man) :
literature reviews,

Mesh (Medical Subject Headings) : keywords

Taxonomy

Summarizing most important links to
discover everything you need ...
Protein data
Interpro (heavily integrated with EBI resources)
http://www.interpro.org

Gene data
Entrez at NCBI : 'Entrez Gene'
http://www.ncbi.nlm.nih.gov/Entrez/
Ebeye Search at EBI : excellent for cross-species
http://www.ebi.ac.uk/ebisearch/

Hold back your horses!

Phew, where do I place this all?

Bioinformatics is all about different data,
as versatile as life itself
Due to the strong cross-references between
different databases, new databases and
relevant info are rapidly integrated in existing
databases.
You can discover them by taking time to read the
entries.

New tools are emerging everyday to
enable you to browse all data sources...
BioGPS, all in one window!

New tools are emerging everyday to
enable you to browse all data sources...

Integrative resources are increasingly
being organised on a species basis

EMAGE database of in situ gene expression in mouse

OMIM Database of diseases in man

Websites providing an interface to integrate all
this data is increasingly important

Often organized on a species basis
− TAIR
− Flybase
− Wormbase

The organizing biological data
information by species

By species, why?
There is one biological information resource which stays
more or less unchanged per species ...

BITS: Overview of important biological databases beyond sequences

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à BITS: Overview of important biological databases beyond sequences

Similaire à BITS: Overview of important biological databases beyond sequences (20)

Plus de BITS

Plus de BITS (20)

Dernier

Dernier (20)

BITS: Overview of important biological databases beyond sequences

Notes de l'éditeur