2. Data
Data is raw, unorganized facts that
need to be processed.
Example:- Each student's test score
is one
piece of data.
Vartika's Presentation
INFORMATION
When data is processed,
organized, structured or presented in
a given context so as to make it
useful, it is called information.
3. What is database????
• Database are convenient system to properly store, searchand
retrieve any type of data.
• A database helps to easily handle and share large amount of data
and supports large scaleanalysis by easyaccessand data updating.
Vartika's Presentation
4. What is Biological Database
• Biological databases are libraries of life sciences information
,collected from scientific experiments, published literature, high-
throughput experiment technology and computational analysis.
• They contain information from genomics, proteomics, microarray
gene expression.
• Informationcontained in Biological database includes function,
gene structure, localization(both cellular and
chromosomal),biological sequences andstructures.
Vartika's Presentation
5. Major purpose of these Data Base is :
•Availability of Biological data.
•Systemization of data.
•Analysis of computed Biological Data.
Vartika's Presentation
6. History:
1956; first sequence database when insulin was sequenced
51 amino acids.
Atlas of protein sequences and structures in 1965 by Margaret Day Hoff et
al was a printed book.
Became base for PIR protein information resource
First nucleotide sequence: yeast tRNA
77 bases
During this time 3D structure of proteins was being studied and renowned
PDB was made.
First genome published was of free living virus Haemophilus influenzae in
1995.
Vartika's Presentation
7. Features of Biological Data Bases:
1) Data heterogeneity
2) High volume data
3) Uncertainty
4) Data Curation
5) Large scale data integration
6) Data sharing
7) Dynamic and subject to change
Vartika's Presentation
8. Classification scheme for
biological databases :
Data type
Maintenance status
Data access
Data source
Database design
Organism
Vartika's Presentation
13. Based on data
access
1) Publicly available
2) Available with copy wright
3) Browsing only, accessible but not
downloadable
4) Academic but not freely available
5) Proprietary commercial
6) Restricted
Vartika's Presentation
17. A Sequence Retrieving and
Manipulation Network
DNA
NCBI-GenBANK
Protein
PIR
SWISSPROTDDBJ
EBI-EMBL EXPASY, PDB
GCG
SeqWEB
Vector NTI
GenoMAX
Entrez
SRS
GenBANK
GCG
FASTA
Staden
Image
Databases
Softwares
Formats
Sequence
Converter
Retriva
l
System
Information
Sequnece, Pdb, Image
Vartika's Presentation
19. Primary databases
Thesesare the primary sourcesof data usedto store nucleic acid, protein sequences and
structural information of biological macromolecules.
Some primarydatabases-
• NCBI(The National Centre for Biotechnology Information)
• GenBank
• DDBJ(DNAdata bank of Japan)
• SWISS-PROT(Swiss-Prot)
• PIR(Protein InformationResource)
• PDB(Protein DataBank)
This sequencecollection of this database is due to the efforts of basic researchfrom
academic industrial and sequencinglab)
Vartika's Presentation
20. GenBank/EMBL/DDBJ
International
Nucleotide Sequence Database
DDBJ:DNAData Bankof Japan
CIB:Center for Information Biology and
DNAData Bankof Japan
NIG:National Institute of Genetics
IAM: International Advisory Meeting
ICM: International Collaborative Meeting
EMBL:
European Molecular Biology
Laboratory
EBI:
European Bioinformatics
Institute
NCBI:
National Center for BiotechnologyInformation
NLM:
National Library of Medicine Vartika's Presentation
21. Secondary Database
• ASecondary database contain additional information derived from the analysis
of data available in primary sources.
• Secondary databasesare analysed in avariety of waysand contain different
information in different formats.
• Some secondarydatabases
• TrEMBL
• Pfam
• PROSITE
• Profiles
• SCOP
• CATH
Vartika's Presentation
22. Flat File Storage Data Formats
• When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence
databases had moved to a defined flat file format with a shared feature
table format and annotation standards.
• The flat file formats from the sequence databases are still used to access
and display sequence and annotation. They are also convenient for storage
of localcopies.
Vartika's Presentation
27. The National Center for
Biotechnology Information
Bethesda,
MD
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical informationVartika's Presentation
28. NCBI Databases and Services
• GenBank primary sequencedatabase
• Free public accesstobiomedical literature
• PubMed free Medline (3million searches per day)
• PubMedCentral full text online access
• Entrez integrated molecular and literature databases
• BLASThighest volume sequence searchservice
(100 – 200 Ksearches perday)
• VASTstructure similaritysearches
• Software andDatabases
Vartika's Presentation
29. GenBank (Genetic Sequence Databank)
• GenBank®is the genetic sequencedatabaseat the National
Center for Biotechnology Information (NCBI).
• It wasestablished in the year 1982and now maintained by the
NationalCenter for Biotechnology (NCBI).
• DNAsequencescanbe submitted to GenBankusing several
different methods.
• It contains publicly available nucleotide sequencesfor more than
240 000 named organisms, obtained primarily through
submissions from individual laboratories and batch submissions
fromlarge-scale sequencing projects.Vartika's Presentation
30. • It hasaflat file structure that is anASCIItext file,
readable & downloadable by both humans and
computers.
• There are two main waysof making batch sequence
submissions to GenBank: NCBI’sBarcode
SubmissionTool(BarSTool) and Sequin.
Vartika's Presentation
33. EMBL
• The European Molecular Biology Laboratory (EMBL) is amolecular biology research
institution supported by 22member states, four prospect and two associate member
states.
• EMBLwascreated in 1974and is an intergovernmental organisation funded by public
researchmoney from its member states.
• The Laboratory operates from five sites: the main laboratory in Heidelberg, and
outstations in Hinxton (the European Bioinformatics Institute (EBI), in England),
Grenoble (France),Hamburg (Germany), and Monterotondo (near Rome).
• EMBLgroups and laboratories perform basicresearchin molecular biology and
molecular medicine aswell astraining for scientists,students and visitors.
• Israel is the onlyAsian state that hasfull membership.
• TheEMBLNucleotide SequenceDatabase (http:// www.ebi.ac.uk/embl/), maintained
at the European Bioinformatics Institute (EBI),
Vartika's Presentation
34. • It is used to incorporate and distributes nucleotide sequences from
public sources.
• The database is apart of an international collaboration with DDBJ
(Japan) and GenBank(USA).
• Data are exchangedbetween the collaborating databases on a
daily basis.
• The web-based tool, Webin, is the preferred system for individual
submission of nucleotide sequences,including Third Party
Annotation (TPA) and alignment data.
Vartika's Presentation
35. • Automatic submission procedures are usedfor submission of data
from large-scale genomesequencing
• The latest data collection canbe accessedvia FTP,email and
WWW interfaces.
• The EBI's Sequence Retrieval System (SRS) integrates and links
the main nucleotide and protein databases aswell asmany other
specialist molecular biologydatabases.
• For sequencesimilarity searching, avariety of tools (e.g. FASTA
and BLAST) are available that allow external users to compare
their own sequences against the data in the EMBLNucleotide
Sequence Database and otherdatabases.
• All available resourcescanbe accessedvia the EBIhome page atVartika's Presentation
43. EMBL format
28-APR-1992 (Rel. 31, Created)
30-JUN-1993 (Rel. 36, Last updated, Version 6)
L.ivanovii sod gene for superoxide dismutase
sod gene; superoxide dismutase.
Listeria ivanovii
Bacteria; Firmicutes; Bacillus/Clostridium group;
Bacillus/Staphylococcus group; Listeria.
[1]
MEDLINE; 92140371.
Haas A., Goebel W.;
"Cloning of a superoxide dismutase gene from Listeria ivanovii by
functional complementation in Escherichia coli and characterization of
ID LISOD standard; DNA; PRO; 756 BP.
XX
AC X64011; S78972;
XX
SV X64011.1
XX
DT
DT
XX
DE
XX
KW
XX
OS
OC
OC
XX
RN
RX
RA
RT
RT
the
RT gene product.";
Vartika's Presentation
44. M o l . G e n . G e n e t . 2 3 1 : 3 1 3 - 3 2 2 ( 1 9 9 2 ) .
[ 2 ]
1 - 7 5 6
K r e f t J . ;
;
S u b m i t t e d ( 2 1 - A P R - 1 9 9 2 ) t o t h e E M B L / G e n B a n k / D D B J d a t a b a s e s .
J . K r e f t , I n s t i t u t f . M i k r o b i o l o g i e , U n i v e r s i t a e t W u e r z b u r g , B i o z e n t r u m
H u b l a n d , 8 7 0 0 W u e r z b u r g , F R G
S W I S S - P R O T ; P 2 8 7 6 3 ; S O D M _ L I S I V .
K e y L o c a t i o n / Q u a l i f i e r s
s o u r c e
R B S
t e r m i n a t o r
C D S
1 . . 7 5 6
/ d b _ x r e f = " t a x o n : 1 6 3 8 "
/ o r g a n i s m = " L i s t e r i a i v a n o v i i "
/ s t r a i n = " A T C C 1 9 1 1 9 "
9 5 . . 1 0 0
/ g e n e = " s o d "
7 2 3 . . 7 4 6
/ g e n e = " s o d "
1 0 9 . . 7 1 7
/ d b _ x r e f = " S W I S S - P R O T : P 2 8 7 6 3 "
/ t r a n s l _ t a b l e = 1 1
/ g e n e = " s o d "
/ E C _ n u m b e r = " 1 . 1 5 . 1 . 1 "
/ p r o d u c t = " s u p e r o x i d e d i s m u t a s e "
/ p r o t e i n _ i d = " C A A 4 5 4 0 6 . 1 "
/ t r a n s l a t i o n = " M T Y E L P K L P Y T Y D A L E P N F D K E T M E I H Y T K H H N I Y V T K L N E A
H A E L A S K P G E E L V A N L D S V P E E I R G A V R N H G G G H A N H T L F W S S L S P N G G G A P T G N L
I E S E F G T F D E F K E K F N A A A A A R F G S G W A W L V V N N G K L E I V S T A N Q D S P L S E G K T P V
D V W E H A Y Y L K F Q N R R P E Y I D T F W N V I N W D E R N K R F D A A K "
R L
X X
R N
R P
R A
R T
R L
R L
A m
R L
X X
D R
X X
F H
F H
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
V S G
F T
K A A
F T
L G L
F T
X X
S Q S e q u e n c e 7 5 6 B P ; 2 4 7 A ; 1 3 6 C ; 1 5 1 G ; 2 2 2 T ; 0 o t h e r ;
c g t t a t t t a a g g t g t t a c a t a g t t c t a t g g a a a t a g g g t c t a t a c c t t t c
g c c t t a c a a t
g t a a t t t c t t
g a c t t a c g a a
t t a c c a a a a t
a g a a a c a a t g
g a a a t t c a c t
a g c a g t c t c a
g g a c a c g c a g
a g a t a g c g t t
c c t g a a g a a a
c c a t a c t t t a
t t c t g g t c t a
a a a a g c a g c a
a t c g a a a g c g
g g c a g c t g c g
g c t c g t t t t g
t a a t a a a c a a t c c g a g g a g g a a t t t t t a a t
t t a t g a t g c t t t g g a g c c g a a t t t t g a t a a
c c a c a a t a t t t a t g t a a c a a a a c t a a a t g a
t a a a c c t g g g g a a g a a t t a g t t g c t a a t c t
a g t a c g t a a c c a c g g t g g t g g a c a t g c t a a
a a a t g g t g g t g g t g c t c c a a c t g g t a a c t t
a t t t g a t g a a t t c a a a g a a a a a t t c a a t g c
g g c a t g g c t a g t a g t g a a c a a t g g t a a a c t
a g a a a t t g t t
6 0
t t c a c a t a a a
1 2 0
t a c c t t a t a c
1 8 0
a t a c a a a g c a
2 4 0
a a c t t g c a a g
3 0 0
t t c g t g g c g c
3 6 0
g t c t t a g c c c
4 2 0
a a t t c g g c a c
4 8 0
g t t c a g g a t g
Vartika's Presentation
45. I D - I d e n t i f i c a t i o n .
A C - A c c e s s i o n n u m b e r ( s ) .
D T - D a t e .
D E - D e s c r i p t i o n .
G N - G e n e n a m e ( s ) .
O S - O r g a n i s m s p e c i e s .
O G - O r g a n e l l e .
O C - O r g a n i s m c l a s s i f i c a t i o n .
R N - R e f e r e n c e n u m b e r .
R P - R e f e r e n c e p o s i t i o n .
R C - R e f e r e n c e c o m m e n t s .
R X - R e f e r e n c e c r o s s - r e f e r e n c e s .
R A - R e f e r e n c e a u t h o r s .
R L - R e f e r e n c e l o c a t i o n .
C C - C o m m e n t s o r n o t e s .
D R - D a t a b a s e c r o s s - r e f e r e n c e s .
K W - K e y w o r d s .
F T - F e a t u r e t a b l e d a t a .
S Q - S e q u e n c e h e a d e r .
- ( b l a n k s ) s e q u e n c e d a t a .
/ / - T e r m i n a t i o n l i n e .
S o m e e n t r i e s d o n o t c o n t a i n a l l o f t h e l i n e t y p e s , a n d s o m e l i n e t y p e s o c c u r m a n y t i m e s i n a s i n g l e
e n t r y . E a c h e n t r y m u s t b e g i n w i t h a n i d e n t i f i c a t i o n l i n e ( I D ) a n d e n d w i t h a t e r m i n a t o r l i n e ( / / ) .Vartika's Presentation
46. PubMed
• PubMed is a free search engine accessing primarily
the MEDLINE database of references and abstracts on
sciences and biomedical topics.
• The PubMed system was offered free to the public in
1997.
• The United States National Library of Medicine (NLM)
the National Institutes of Health maintains the
part of the Entrez system of information retrieval.
• PMID is the unique identifier number used in
Vartika's Presentation
47. • Theyare assignedto eacharticle record when it enters the
PubMedsystem.
• ThePMID# is alwaysfound at the end of aPubMed
citation.
• PubMed Central (PMC) is afree digital system that
archivespublicly accessiblefull-text scholarly articles that
have been published within the biomedical and life
sciences journalliterature.
• A"PubMed Mobile" option, providing accessto amobileVartika's Presentation
55. Entrez
• WWW-based data retrievalsystem.
• Developed by NCBI(National Centre for Biotechnology
Information).
• - Integrates information held in different DBs.
Vartika's Presentation
56. Data bases covered by Entrez are
• Nucleic acid -GenBank,
RefSeq,PDB.
• Protein seqs-SWISS-
PROT,PIR.
• 3Dstructures –MMDB
• Genomes –Many
sources
• PopSet – FromGenBank
• OMIM –OMIM
• Taxonomy – NCBItaxonomy
database
• Books- Bookshelf
• ProbeSet – GEO(Gene
ExpressionOmnibus)
• Literature -PubMed
Vartika's Presentation
65. SRS
• SRSis aSequence RetrievalSystem
• - Data retrieval tool developed by EBI
• - Integrates 80 molecular biology DBs
• -AnOpen sourcesoftware (Canbe installed locally)
• SRShas an associated scripting language calledIcarus
• Central resource for molecular biology data
• - more than 250databanks have been indexed. More than 35SRS
servers over theWWW(world wide)
Vartika's Presentation
66. • Information retrieval
• Easy way to retrieve information from sequence and sequence-related
databases
• Possibility to search for multiple words/other criteria
• Linkage between different databases
• E.g. Find all primary structures with known three-dimensional
• Different types of database in SRS
• Sequence & structure
• DNA, protein, three-dimensional structures
• Sequence-related
• Gene-related
• Genome, mapping, mutations, transcription factors
• SNP
• Bibliographic
Vartika's Presentation
67. • SRS main toolbar tabs:
• Top Page: displays databases in different database groups
• Query: displays either the standard or extended query form
• Results or “the query manager”: maintains a history of all the
results obtained during a session
• Projects or “the project manager”: maintains a history of all
queries and views used during a session
• Views: allows a user to define a user specific view for one or
more databases
• Databanks: contains a list and some facts about the databases
available in the system
Vartika's Presentation
68. • Search terms in SRS
• SRS indexed fields can be searched using any of the
• Single word search
• Multiple word phrases
• Numbers and dates
• Regular expressions
• Wildcards
•
Vartika's Presentation
72. LocusLink
• LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) is aNational
Center for Biotechnology Information (NCBI) online resource.
• It is principally intended for useby graduate students and
professional researchersin the biomedical sciences.
• It is designed to bring together related information on genetic loci
and gene products from several sources.
• LocusLink provides acentral point of accessfor basic biomedical
information and molecular data for genes, transcripts, and proteins
from model organisms, currently including human, rat, mouse,
fruit fly,and zebrafish.
• Now it is not availablein NCBI.
Vartika's Presentation