Data is raw, unorganized facts that need to be processed.
Example:- Each student's test score is one piece of data.
When data is processed, organized, structured or presented in a given context
so as to make it useful, it is called information.
Example:- score of a class or of the average entire school is information that
can be derived from the given data.
A database is a collection of data in an organized
manner, which is accessible in various ways.
Biological Databases serve a critical purpose in the
collection and organization of data related to biological
They provide a computational support and a user-friendly
interface to a researcher for a meaningful analysis of
5. A database is a computerized archive used to store and
organize data in such a way that information can be
retrieved easily via a variety of search criteria.
Databases are composed of computer hardware and software
for data management.
The chief objective of the development of a database is to
organize data in a set of structured records to enable easy
retrieval of information.
Each record, also called an entry, should contain a number
of fields that hold the actual data items, for example, fields
for names, phone numbers, addresses, dates.
8. Different classifications of
Type of data
proteins sequence patterns or motifs
macromolecular 3D structure
gene expression data
10. Different classifications of
Primary or derived databases
Primary databases: experimental results directly
Secondary databases: results of analysis of
Aggregate of many databases
Links to other data items
Combination of data
Consolidation of data
11. Different classifications of
Publicly available, no restrictions
Available, but with copyright
Accessible, but not downloadable
Academic, but not freely available
Proprietary, commercial; possibly free for
13. PRIMARY DATABASES
Contains bio-molecular data in its original form.
Experimental results are submitted directly into the database by
researchers, and the data are essentially archival in nature.
Once given a database accession number, the data in primary
databases are never changed.
Examples :- GenBank, EMBL and DDBJ for DNA/RNA sequences,
SWISS-PROT and PIR for protein sequences and PDB for molecular
• Database from NCBI, includes sequences from
publicly available resources.
NCBI and Entrez
One of the largest and most comprehensive
databases belonging to the NIH – national institute
of health (USA)
Entrez is the search engine of NCBI
Search for :
genes, proteins, genomes, structures, diseases,
publications and more.
An annotated collection of all publicly
available nucleotide and proteins
Set up in 1979 at the LANL (Los Alamos).
Maintained since 1992 NCBI (Bethesda).
European Molecular Biological Laboratory
Nucleic acid database from EBI
(European Bioinformatics Institute)
Produced in collaboration with DDBJ and GenBank
Search engine – SRS (Sequence Retrieval System)
DNA Databank of Japan
Started in 1986 in collaboration with GenBank
Produced and maintained at NIG
(National Institute of Genetics)
• Protein Information Resource
•A division of National Biomedical Research
•Foundation (NBRF) in U.S.
•One can search for entries or do sequence
similarity search at PIR site.
Translated European Molecular Biology Laboratory
Computer annotated supplement of SWISS PROT.
Contains all the translations of EMBL nucleotide
sequence entries not yet integrated in SWISS PROT.
25. Protein DataBank (PDB)
Important in solving real problems in molecular
PDB Established in 1972 at Brookhaven National
Sole international repository of macromolecular
Moved to Research Collaboratory
for Structural Bioinformatics
26. PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2
COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3
SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5
AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6
REVDAT 1 15-OCT-92 12CA 0 12CA 7
JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8
JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9
JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10
JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11
JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13
REMARK 1 12CA 14EMARK 3 AUTHORS
HENDRICKSON,KONNERT 12CA 20
REMARK 3 R VALUE 0.170 12CA 21
REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22
REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23
REMARK 4 12CA 24
REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25
REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26
REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
27. COMPOSITE DATABASES
Collection of various primary database sequences
Renders sequence searching highly efficient as it searches
Examples :- NRDB (Non Redundant Database), OWL,
MIPSX, SWISS PROT + TrEMBL
29. SECONDARY DATABASES
Contains data derived from the results of analysing
Manually created or automatically generated
Contains more relevant and useful information
structured to specific requirements
Example :- PROSITE, PRINTS, BLOCKS, Pfam
Families of proteins
Can search using regular
Similar to unix commands
Families exhibit these patterns
So we can search over families