Module 4 Other relevant biological data sources beyond sequences
Part of training session "Basic Bioinformatics concepts, databases and tools" - http://www.bits.vib.be/training
BITS: Overview of important biological databases beyond sequences
1. Basic bioinformatics concepts,
databases and tools
Module 4
Beyond the sequences
Dr. Joachim Jacob
http://www.bits.vib.be
Updated Nov 2011
http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod4-intro_H1_2011_otherRelevantData.pdf
3. To understand life, we need not only
sequences, but many other concepts
Bioinformatics is also storing and analyzing
− gene information: variations, isoforms,...
− Expression data
− 3D protein structure data
− Interaction data
− Pathways and network
“Storing all relevant biological data”
5. The indispensable databases
Gene Ontology – structuring
KEGG – biochemical pathways
PDB – Structure of proteins
Intact – Interaction data
dbSNP – database of genomic variation
Expression sources – Microarray data
6. Gene Ontology structures the way we
communicate about life
Gene translation Protein production Protein synthesis
http://www.arabidopsis.org/help/tutorials/go1.jsp
http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax
7. Gene Ontology structures life
http://www.geneontology.org/
Agreement on standardized keywords (often referred to as
'controlled vocabularies'), describing all natural processes in an
hierarchical way (ontology).
Keywords are assigned to genes based different evidence
Keywords are ordered in a hierarchical tree-like structure ( 'directed
acyclic graphs')
Three GO 'trees' exists, describing:
"Biological Process"
"Cellular Component"
"Molecular Function"
http://www.arabidopsis.org/help/tutorials/go1.jsp
http://www.geneontology.org/teaching_resources/tutorials/2005-09_BiB-journal-tutorial_jlomax
8. A gene can be given
different GO terms
Example, cytochrome c:
molecular function: oxidoreductase activity,
biological process: oxidative phosphorylation and
induction of cell death,
cellular component: mitochondrial matrix and
mitochondrial inner membrane.
In each tree, the terms are organised in a directed acyclic
graph: a network consisting of parents and child-terms (as
nodes) and lines between them as relationships.
9.
10. Different evidence codes can assign a
degree of confidence to the assignment
http://www.geneontology.org/GO.evidence.shtml
Evidence codes can be grouped by:
Experimental (e.g. IDA – inferred from direct assay)
Computational analysis
Author statement
Curator statement
Inferred from electronic annotation (IEA)
If available, each annotation has also a reference
12. Gene Ontology structures all genes
according to their biological significance
The GO structure and the terms can be browsed by a browser
called AmiGO.
The Quick Go from EBI has some nice visualisation
Excellent GO-wiki for all your questions
13. GO can be used to retrieve all gene
(products) related to one specific term
You can search broad, e.g. Amigo search for Diabetes
leads to following GO term
http://amigo.geneontology.org/
14. GO can be used to retrieve all gene
(products) related to one specific term
Amigo search for Diabetes
15. GO can be used to retrieve all gene
(products) related to one specific term
Amigo search for Diabetes
16. GO is also useful to analyze and compare
different gene lists
A lot of tools on GO are available on website.
http://www.geneontology.org/GO.tools.shtml
17. Some things to know about GO
For analyses, one can make use of 'shrinked' GO sets,
the so-called GO-slims
– GO slims are a subset of biologically more
relevant GO terms (available per species)
– GO ontologies can be downloaded in .obo
format.
Not all information is captured by GO and need to be
retrieved in other databases
Metabolic pathways: KEGG, …
Phenotype/diseases
• Mapping files exists e.g. kegg2go
http://www.geneontology.org/GO.slims.shtml
18. Biological pathways databases organise
genes by molecular reactions
3 important databases on biological pathways
http://www.kegg.jp/
http://www.reactome.org/ - EBI
http://metacyc.org
19. Proteins with enzymatic function receive
an Enzyme Commission (EC) number
http://www.chem.qmul.ac.uk/iubmb/enzyme/
EC 6 Ligases
EC 5 Isomerases
EC 4 Lyases
EC 3 Hydrolases
EC 2 Transferases
EC 1 Oxidoreductases
20. IntAct database contains interaction
information of proteins
http://www.ebi.ac.uk/intact
Three types of interactions stored
Protein-protein
Protein-dna
Protein-small molecule
24. PDB hosts 3-dimensional
structural data on molecules
PDB = Protein DataBank
http://www.pdb.org/pdb/home/home.do
Only structures resolved through NMR and X-ray
(or other accurate techniques)
Proteins
DNA
RNA
Ligands
Understanding PDB data: tutorial
25. PDB files can be read by a lot of different
tools to display the structure
Every entry in PDB contains its own PDB accession
number (often 1 digit and three letters)
The PDB file contains 3D coordinates from every
single atom in the structure, together with
variability of that position (last two digits)
http://www.bits.vib.be/index.php?option=com_content&view=article&id=17203817:protein-structure-
26. PDB files can be read by a lot of different
tools to display the structure
Tools to visualize (and some to analyze
structures) (see BITS wiki)
http://www.bits.vib.be/wiki/index.php/Protein_structure
27. To find a structure for your protein
sequence is to search for similarity
Homology modeling
Similarity on sequence level projected to a structure
Blast your query against PDB db by cblast , or at expasy
PSI-BLAST - can detect sequences with similar structures
(twilight zone!)
If still no success: 3D-jury (a meta approach, including fold
recognition and local structure prediction)
Similarity on structural level: aligning structures
VAST (structure)
Distance mAtrix aLIgnment DALI
BITS training on protein structure analysis
http://www.ii.uib.no/~slars/bioinfocourse/PDFs/structpred_tutorial.pdf
Tools at EBI http://consurf.tau.ac.il/pe/protexpl/psbiores.htm
28. Structural information is used to classify
proteins Database cross-references in PDB entry
SCOP
Groups proteins based on evolutionary, domain
architecture and structural information.
CATH
Manually curated classification on protein domains
http://scop.mrc-lmb.cam.ac.uk/scop/
http://www.cathdb.info/
29. dbSNP is a public-domain archive for
simple genetic polymorphisms
Single Nucleotide Polymorphism database (NCBI)
Each dbSNP entry has a code rsxx (RefSNP) or ssxx
(submitted SNP)
single-base nucleotide substitutions (also known as
single nucleotide polymorphisms or SNPs),
small-scale multi-base deletions or insertions (also
called deletion insertion polymorphisms or DIPs)
retroposable element insertions and microsatellite
repeat variations (also called short tandem repeats or
STRs).
Synchronized with new genome builds
30. Expression data can be sequence-based
or hybridisation-based
Sequence-based (ESTs - RNA seq - SAGE)
Digital gene expression/northern
Microarray databases – hybridisation based:
GEO: gene expression omnibus (NCBI)
− Platform: GPLxxxxxxx
− Experiment: GSExxxxxx (= several samples)
− Sample: GSMxxxxxxxx
− Some experiments are curated: GDSxxxxx (online
analysis possible)
ArrayExpress (EBI)
38. Summarizing most important links to
discover everything you need ...
Protein data
Interpro (heavily integrated with EBI resources)
http://www.interpro.org
Gene data
Entrez at NCBI : 'Entrez Gene'
http://www.ncbi.nlm.nih.gov/Entrez/
Ebeye Search at EBI : excellent for cross-species
http://www.ebi.ac.uk/ebisearch/
40. Bioinformatics is all about different data,
as versatile as life itself
Due to the strong cross-references between
different databases, new databases and
relevant info are rapidly integrated in existing
databases.
You can discover them by taking time to read the
entries.
41. New tools are emerging everyday to
enable you to browse all data sources...
BioGPS, all in one window!
42. New tools are emerging everyday to
enable you to browse all data sources...
43. Integrative resources are increasingly
being organised on a species basis
EMAGE database of in situ gene expression in mouse
OMIM Database of diseases in man
Websites providing an interface to integrate all
this data is increasingly important
Often organized on a species basis
− TAIR
− Flybase
− Wormbase
44. The organizing biological data
information by species
By species, why?
There is one biological information resource which stays
more or less unchanged per species ...
Notes de l'éditeur
'translation', whereas another uses the phrase 'protein synthesis',
'translation', whereas another uses the phrase 'protein synthesis',
'translation', whereas another uses the phrase 'protein synthesis',
GO hierarchy can be downloaded (obo format) GO Slim: selection of categories
GO hierarchy can be downloaded (obo format) GO Slim: selection of categories
Different types: Ribbon Cartoon Ball and stick Space filling
Different types: Ribbon Cartoon Ball and stick Space filling