Data base in detail

Data
Data is raw, unorganized facts that
need to be processed.
Example:- Each student's test score
is one
piece of data.
Vartika's Presentation
INFORMATION
When data is processed,
organized, structured or presented in
a given context so as to make it
useful, it is called information.

What is database????
• Database are convenient system to properly store, searchand
retrieve any type of data.
• A database helps to easily handle and share large amount of data
and supports large scaleanalysis by easyaccessand data updating.

What is Biological Database
• Biological databases are libraries of life sciences information
,collected from scientific experiments, published literature, high-
throughput experiment technology and computational analysis.
• They contain information from genomics, proteomics, microarray
gene expression.
• Informationcontained in Biological database includes function,
gene structure, localization(both cellular and
chromosomal),biological sequences andstructures.

Major purpose of these Data Base is :
•Availability of Biological data.
•Systemization of data.
•Analysis of computed Biological Data.

History:
 1956; first sequence database when insulin was sequenced
 51 amino acids.
 Atlas of protein sequences and structures in 1965 by Margaret Day Hoff et
al was a printed book.
 Became base for PIR protein information resource
 First nucleotide sequence: yeast tRNA
 77 bases
 During this time 3D structure of proteins was being studied and renowned
PDB was made.
 First genome published was of free living virus Haemophilus influenzae in
1995.

Features of Biological Data Bases:
1) Data heterogeneity
2) High volume data
3) Uncertainty
4) Data Curation
5) Large scale data integration
6) Data sharing
7) Dynamic and subject to change

Classification scheme for
biological databases :
Data type
Maintenance status
Data access
Data source
Database design
Organism

Data Types :

Based on data
sources
Based on
data
sources

Content Based:
Genome database
Sequence database
Structure database
Microarray database
Chemical database
Pathway database
Enzyme database
Disease database
Literature database

Based on maintenance
status
NCBI EMBL SIB

Based on data
access
1) Publicly available
2) Available with copy wright
3) Browsing only, accessible but not
downloadable
4) Academic but not freely available
5) Proprietary commercial
6) Restricted

Biological
sequence
Databases

Databases Architecture
Information system
)Querysystem
StorageSystem
Data
(The Google,Entrez
SRS)
Your search keywords
Oracle,MySQL,PCbinary
files,Unix text
files,Bookshelves
GenBank flat file
PDBfile
Interaction Record
Title of abook
BookVartika's Presentation

A Sequence Retrieving and
Manipulation Network
DNA
NCBI-GenBANK
Protein
PIR
SWISSPROTDDBJ
EBI-EMBL EXPASY, PDB
GCG
SeqWEB
Vector NTI
GenoMAX
Entrez
SRS
GenBANK
GCG
FASTA
Staden
Image
Databases
Softwares
Formats
Sequence
Converter
Retriva
l
System
Information
Sequnece, Pdb, Image

Types of biological databases
 Primary Database.
Secondarydatabase.

Primary databases
Thesesare the primary sourcesof data usedto store nucleic acid, protein sequences and
structural information of biological macromolecules.
Some primarydatabases-
• NCBI(The National Centre for Biotechnology Information)
• GenBank
• DDBJ(DNAdata bank of Japan)
• SWISS-PROT(Swiss-Prot)
• PIR(Protein InformationResource)
• PDB(Protein DataBank)
This sequencecollection of this database is due to the efforts of basic researchfrom
academic industrial and sequencinglab)

GenBank/EMBL/DDBJ
International
Nucleotide Sequence Database
DDBJ:DNAData Bankof Japan
CIB:Center for Information Biology and
DNAData Bankof Japan
NIG:National Institute of Genetics
IAM: International Advisory Meeting
ICM: International Collaborative Meeting
EMBL:
European Molecular Biology
Laboratory
EBI:
European Bioinformatics
Institute
NCBI:
National Center for BiotechnologyInformation
NLM:
National Library of Medicine Vartika's Presentation

Secondary Database
• ASecondary database contain additional information derived from the analysis
of data available in primary sources.
• Secondary databasesare analysed in avariety of waysand contain different
information in different formats.
• Some secondarydatabases
• TrEMBL
• Pfam
• PROSITE
• Profiles
• SCOP
• CATH

Flat File Storage Data Formats
• When GenBank, EMBL and DDBJ formed a collaboration (1986), sequence
databases had moved to a defined flat file format with a shared feature
table format and annotation standards.
• The flat file formats from the sequence databases are still used to access
and display sequence and annotation. They are also convenient for storage
of localcopies.

The National Center for
Biotechnology Information
Bethesda,
MD
Created in 1988 as a part of the
National Library of Medicine at NIH
– Establish public databases
– Research in computational biology
– Develop software tools for sequence analysis
– Disseminate biomedical informationVartika's Presentation

NCBI Databases and Services
• GenBank primary sequencedatabase
• Free public accesstobiomedical literature
• PubMed free Medline (3million searches per day)
• PubMedCentral full text online access
• Entrez integrated molecular and literature databases
• BLASThighest volume sequence searchservice
(100 – 200 Ksearches perday)
• VASTstructure similaritysearches
• Software andDatabases

GenBank (Genetic Sequence Databank)
• GenBank®is the genetic sequencedatabaseat the National
Center for Biotechnology Information (NCBI).
• It wasestablished in the year 1982and now maintained by the
NationalCenter for Biotechnology (NCBI).
• DNAsequencescanbe submitted to GenBankusing several
different methods.
• It contains publicly available nucleotide sequencesfor more than
240 000 named organisms, obtained primarily through
submissions from individual laboratories and batch submissions
fromlarge-scale sequencing projects.Vartika's Presentation

• It hasaflat file structure that is anASCIItext file,
readable & downloadable by both humans and
computers.
• There are two main waysof making batch sequence
submissions to GenBank: NCBI’sBarcode
SubmissionTool(BarSTool) and Sequin.

EMBL
• The European Molecular Biology Laboratory (EMBL) is amolecular biology research
institution supported by 22member states, four prospect and two associate member
states.
• EMBLwascreated in 1974and is an intergovernmental organisation funded by public
researchmoney from its member states.
• The Laboratory operates from five sites: the main laboratory in Heidelberg, and
outstations in Hinxton (the European Bioinformatics Institute (EBI), in England),
Grenoble (France),Hamburg (Germany), and Monterotondo (near Rome).
• EMBLgroups and laboratories perform basicresearchin molecular biology and
molecular medicine aswell astraining for scientists,students and visitors.
• Israel is the onlyAsian state that hasfull membership.
• TheEMBLNucleotide SequenceDatabase (http:// www.ebi.ac.uk/embl/), maintained
at the European Bioinformatics Institute (EBI),

• It is used to incorporate and distributes nucleotide sequences from
public sources.
• The database is apart of an international collaboration with DDBJ
(Japan) and GenBank(USA).
• Data are exchangedbetween the collaborating databases on a
daily basis.
• The web-based tool, Webin, is the preferred system for individual
submission of nucleotide sequences,including Third Party
Annotation (TPA) and alignment data.

• Automatic submission procedures are usedfor submission of data
from large-scale genomesequencing
• The latest data collection canbe accessedvia FTP,email and
WWW interfaces.
• The EBI's Sequence Retrieval System (SRS) integrates and links
the main nucleotide and protein databases aswell asmany other
specialist molecular biologydatabases.
• For sequencesimilarity searching, avariety of tools (e.g. FASTA
and BLAST) are available that allow external users to compare
their own sequences against the data in the EMBLNucleotide
Sequence Database and otherdatabases.
• All available resourcescanbe accessedvia the EBIhome page atVartika's Presentation

EMBL format
28-APR-1992 (Rel. 31, Created)
30-JUN-1993 (Rel. 36, Last updated, Version 6)
L.ivanovii sod gene for superoxide dismutase
sod gene; superoxide dismutase.
Listeria ivanovii
Bacteria; Firmicutes; Bacillus/Clostridium group;
Bacillus/Staphylococcus group; Listeria.
[1]
MEDLINE; 92140371.
Haas A., Goebel W.;
"Cloning of a superoxide dismutase gene from Listeria ivanovii by
functional complementation in Escherichia coli and characterization of
ID LISOD standard; DNA; PRO; 756 BP.
XX
AC X64011; S78972;
XX
SV X64011.1
XX
DT
DT
XX
DE
XX
KW
XX
OS
OC
OC
XX
RN
RX
RA
RT
RT
the
RT gene product.";

M o l . G e n . G e n e t . 2 3 1 : 3 1 3 - 3 2 2 ( 1 9 9 2 ) .
[ 2 ]
1 - 7 5 6
K r e f t J . ;
;
S u b m i t t e d ( 2 1 - A P R - 1 9 9 2 ) t o t h e E M B L / G e n B a n k / D D B J d a t a b a s e s .
J . K r e f t , I n s t i t u t f . M i k r o b i o l o g i e , U n i v e r s i t a e t W u e r z b u r g , B i o z e n t r u m
H u b l a n d , 8 7 0 0 W u e r z b u r g , F R G
S W I S S - P R O T ; P 2 8 7 6 3 ; S O D M _ L I S I V .
K e y L o c a t i o n / Q u a l i f i e r s
s o u r c e
R B S
t e r m i n a t o r
C D S
1 . . 7 5 6
/ d b _ x r e f = " t a x o n : 1 6 3 8 "
/ o r g a n i s m = " L i s t e r i a i v a n o v i i "
/ s t r a i n = " A T C C 1 9 1 1 9 "
9 5 . . 1 0 0
/ g e n e = " s o d "
7 2 3 . . 7 4 6
/ g e n e = " s o d "
1 0 9 . . 7 1 7
/ d b _ x r e f = " S W I S S - P R O T : P 2 8 7 6 3 "
/ t r a n s l _ t a b l e = 1 1
/ g e n e = " s o d "
/ E C _ n u m b e r = " 1 . 1 5 . 1 . 1 "
/ p r o d u c t = " s u p e r o x i d e d i s m u t a s e "
/ p r o t e i n _ i d = " C A A 4 5 4 0 6 . 1 "
/ t r a n s l a t i o n = " M T Y E L P K L P Y T Y D A L E P N F D K E T M E I H Y T K H H N I Y V T K L N E A
H A E L A S K P G E E L V A N L D S V P E E I R G A V R N H G G G H A N H T L F W S S L S P N G G G A P T G N L
I E S E F G T F D E F K E K F N A A A A A R F G S G W A W L V V N N G K L E I V S T A N Q D S P L S E G K T P V
D V W E H A Y Y L K F Q N R R P E Y I D T F W N V I N W D E R N K R F D A A K "
R L
X X
R N
R P
R A
R T
R L
R L
A m
R L
X X
D R
X X
F H
F H
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
F T
V S G
F T
K A A
F T
L G L
F T
X X
S Q S e q u e n c e 7 5 6 B P ; 2 4 7 A ; 1 3 6 C ; 1 5 1 G ; 2 2 2 T ; 0 o t h e r ;
c g t t a t t t a a g g t g t t a c a t a g t t c t a t g g a a a t a g g g t c t a t a c c t t t c
g c c t t a c a a t
g t a a t t t c t t
g a c t t a c g a a
t t a c c a a a a t
a g a a a c a a t g
g a a a t t c a c t
a g c a g t c t c a
g g a c a c g c a g
a g a t a g c g t t
c c t g a a g a a a
c c a t a c t t t a
t t c t g g t c t a
a a a a g c a g c a
a t c g a a a g c g
g g c a g c t g c g
g c t c g t t t t g
t a a t a a a c a a t c c g a g g a g g a a t t t t t a a t
t t a t g a t g c t t t g g a g c c g a a t t t t g a t a a
c c a c a a t a t t t a t g t a a c a a a a c t a a a t g a
t a a a c c t g g g g a a g a a t t a g t t g c t a a t c t
a g t a c g t a a c c a c g g t g g t g g a c a t g c t a a
a a a t g g t g g t g g t g c t c c a a c t g g t a a c t t
a t t t g a t g a a t t c a a a g a a a a a t t c a a t g c
g g c a t g g c t a g t a g t g a a c a a t g g t a a a c t
a g a a a t t g t t
6 0
t t c a c a t a a a
1 2 0
t a c c t t a t a c
1 8 0
a t a c a a a g c a
2 4 0
a a c t t g c a a g
3 0 0
t t c g t g g c g c
3 6 0
g t c t t a g c c c
4 2 0
a a t t c g g c a c
4 8 0
g t t c a g g a t g

I D - I d e n t i f i c a t i o n .
A C - A c c e s s i o n n u m b e r ( s ) .
D T - D a t e .
D E - D e s c r i p t i o n .
G N - G e n e n a m e ( s ) .
O S - O r g a n i s m s p e c i e s .
O G - O r g a n e l l e .
O C - O r g a n i s m c l a s s i f i c a t i o n .
R N - R e f e r e n c e n u m b e r .
R P - R e f e r e n c e p o s i t i o n .
R C - R e f e r e n c e c o m m e n t s .
R X - R e f e r e n c e c r o s s - r e f e r e n c e s .
R A - R e f e r e n c e a u t h o r s .
R L - R e f e r e n c e l o c a t i o n .
C C - C o m m e n t s o r n o t e s .
D R - D a t a b a s e c r o s s - r e f e r e n c e s .
K W - K e y w o r d s .
F T - F e a t u r e t a b l e d a t a .
S Q - S e q u e n c e h e a d e r .
- ( b l a n k s ) s e q u e n c e d a t a .
/ / - T e r m i n a t i o n l i n e .
S o m e e n t r i e s d o n o t c o n t a i n a l l o f t h e l i n e t y p e s , a n d s o m e l i n e t y p e s o c c u r m a n y t i m e s i n a s i n g l e
e n t r y . E a c h e n t r y m u s t b e g i n w i t h a n i d e n t i f i c a t i o n l i n e ( I D ) a n d e n d w i t h a t e r m i n a t o r l i n e ( / / ) .Vartika's Presentation

PubMed
• PubMed is a free search engine accessing primarily
the MEDLINE database of references and abstracts on
sciences and biomedical topics.
• The PubMed system was offered free to the public in
1997.
• The United States National Library of Medicine (NLM)
the National Institutes of Health maintains the
part of the Entrez system of information retrieval.
• PMID is the unique identifier number used in

• Theyare assignedto eacharticle record when it enters the
PubMedsystem.
• ThePMID# is alwaysfound at the end of aPubMed
citation.
• PubMed Central (PMC) is afree digital system that
archivespublicly accessiblefull-text scholarly articles that
have been published within the biomedical and life
sciences journalliterature.
• A"PubMed Mobile" option, providing accessto amobileVartika's Presentation

Entrez
• WWW-based data retrievalsystem.
• Developed by NCBI(National Centre for Biotechnology
Information).
• - Integrates information held in different DBs.

Data bases covered by Entrez are
• Nucleic acid -GenBank,
RefSeq,PDB.
• Protein seqs-SWISS-
PROT,PIR.
• 3Dstructures –MMDB
• Genomes –Many
sources
• PopSet – FromGenBank
• OMIM –OMIM
• Taxonomy – NCBItaxonomy
database
• Books- Bookshelf
• ProbeSet – GEO(Gene
ExpressionOmnibus)
• Literature -PubMed

SRS
• SRSis aSequence RetrievalSystem
• - Data retrieval tool developed by EBI
• - Integrates 80 molecular biology DBs
• -AnOpen sourcesoftware (Canbe installed locally)
• SRShas an associated scripting language calledIcarus
• Central resource for molecular biology data
• - more than 250databanks have been indexed. More than 35SRS
servers over theWWW(world wide)

• Information retrieval
• Easy way to retrieve information from sequence and sequence-related
databases
• Possibility to search for multiple words/other criteria
• Linkage between different databases
• E.g. Find all primary structures with known three-dimensional
• Different types of database in SRS
• Sequence & structure
• DNA, protein, three-dimensional structures
• Sequence-related
• Gene-related
• Genome, mapping, mutations, transcription factors
• SNP
• Bibliographic

• SRS main toolbar tabs:
• Top Page: displays databases in different database groups
• Query: displays either the standard or extended query form
• Results or “the query manager”: maintains a history of all the
results obtained during a session
• Projects or “the project manager”: maintains a history of all
queries and views used during a session
• Views: allows a user to define a user specific view for one or
more databases
• Databanks: contains a list and some facts about the databases
available in the system

• Search terms in SRS
• SRS indexed fields can be searched using any of the
• Single word search
• Multiple word phrases
• Numbers and dates
• Regular expressions
• Wildcards
•

LocusLink
• LocusLink (http://www.ncbi.nlm.nih.gov/LocusLink) is aNational
Center for Biotechnology Information (NCBI) online resource.
• It is principally intended for useby graduate students and
professional researchersin the biomedical sciences.
• It is designed to bring together related information on genetic loci
and gene products from several sources.
• LocusLink provides acentral point of accessfor basic biomedical
information and molecular data for genes, transcripts, and proteins
from model organisms, currently including human, rat, mouse,
fruit fly,and zebrafish.
• Now it is not availablein NCBI.

Data base in detail

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Data base in detail

Similaire à Data base in detail (20)

Dernier

Dernier (20)

Data base in detail