1. NCBI
National Centre For Biotechnology
Information
Site: www.ncbi.nlm.nih.gov
By Richa Sharma
M.Sc. Biomedical Sciences
Dr. BR Ambedkar Center for Biomedical
aresearch (ACBR)
2. INTRODUCTION
NCBI was established in the year 1988, as a part of the
National Library of Medicine at the National Institutes of
Health, Maryland, USA
4. DIFFERENCES BETWEEN
DATABASE AND TOOL
DATABASE
It is a collection of data
that is structured,
searchable, updated
periodically and cross-
referenced.
Different databases are:
Genome Database
Sequence Database
Protein Database
Literature Database
Disease Database
TOOL
A program that is used to
extract or retrieve the
desired information from
the database.
Different types of tools are:
Database Retrieval Tool i.e.
Entrez
BLAST
ORF Finder
ePCR
Spidey
7. DATABASE RETRIEVAL TOOL-
ENTREZ
Entrez is an integrated database search and retrieval
system that extracts information from DNA and protein
sequence data, population sets, whole genome,
macromolecular structures, and the biomedical literature
via PubMed.
Entrez provides extensive links within and between
database records.
http://www.ncbi.nlm.nih.gov/gquery/
14. BLAST-BASIC LOCAL ALIGNMENT
SEARCH TOOL
The BLAST programs perform sequence-similarity searches
against a variety of sequence databases, returning a set of
gapped alignments with links to full database records, to
UniGene, Gene, the MMDB, or GEO.
The BLAST tools available at NCBI are classified into
different categories.
Two important ones are:
Standard BLAST
MegaBLAST
15. STANDARD BLAST
Standard BLAST includes:
blastn : Comparing the nucleotide sequence query
against a nucleotide sequence database.
blastp : Comparing the amino acid query against a
protein sequence database.
blastx : Comparing the nucleotide query sequence
translated in all reading frames against a protein
database.
16. • tblastn : Comparing the protein query
sequence against a nucleotide database
translated in all reading frames.
tblastx : Comparing the six –reading
frame translations of the nucleotide
query against six frame translations of
the nucleotide sequence database.
17. MegaBLAST
MegaBLAST is a program optimized for aligning long
sequences.
It can only work with DNA sequences, hence the only
program it supports is “blastn”.
It is faster than blastn but less sensitive,
18. SEQUENCE SUBMISSION TO NCBI
The databases are constantly updated through newer
submissions of sequences, and this is done using the
following sequence submission tools :
1. BankIt
2. Sequin
19. BankIt
BankIT is a web based GenBank sequence submission tool.
It is a tool of choice for simple submissions, especially
when only one or small number of records are to be
submitted. It can also be used by submitters to update
their existing GenBank records. Sequence analysis tools are
not required for submission through this process.
20. SEQUIN
Sequin is a stand-alone software tool developed by NCBI
which aids in submission and updating entries to the
sequence databases. It helps in handling multiple
sequence submissions, provides increased capacity for
complex submissions containing long sequences, multiple
annotations, segmented sets of DNA or phylogenetic and
population studies.
It also provides graphical viewing and editing options.
26. SPECIALISED TOOLS
Some of the specialized tools for the sequence analysis are
:
1. ORF Finder
2. e-PCR
3. Spidey
27. Open Reading Frame (ORF)
Finder
ORF Finder is an essential graphical analysis tool, which
finds all open reading frames of a selectable minimum size
in a user’s sequence or in a sequence already in the
database.
It uses the standard or alternative genetic codes to identify
all open reading frames.
This is helpful in preparing complete and accurate
sequence submissions. It is also packaged with the Sequin
sequence submission software.
28. e-PCR (Electronic Polymerase
Chain Reaction)
e-PCR is a computational procedure that is used to identify
sequence-tagged sites (STSs) within DNA sequeces. While
looking for potential STSs in DNA sequences e-PCR searches
for sub-sequences that closely match the PCR primers and
have the correct order, orientation, and spacing that could
represent the PCR primers used to generate known
STSs.The new version of e-PCr provides a search mode
using a query sequence against a sequence database.
29. SPIDEY
This is an m-RNA to genomic alignment program ,which
uses the local alignment tools like BLAST to find its
alignment. Spidey takes as an input a single genomic
sequence and a set of mRNA-FASTA sequences. At first,
Spidey defines windows on the genomic sequence and then
perform the mRNA-to-genomic alignment separately within
each window to avoid including exons from paralogs and
pseudogenes. It has no maximum intron size and does not
favour shorter or longer introns.
30. Databases
Structured collection of information.
Consists of basic units called record or enteries.
The prefect database-
Comprehensive but easy to search
Cross referenced
Minimum redundancy
31. NCBI Databases
Nucleotide database
Literature database
Protein database
Gene expression database
Structural database
Chemical database
Other databases
32.
33. Kinds of databases
Primary database
Original submissions by
experimentalists.
Database staff organise
but don’t add additional
information.
Example - Genbank
Derivative databases
Derived from primary
data
Content controlled by
third party.
Examples – Refseq,
SWISS-PROT, unigene
34. Nucleotide database
GENBANK
NCBI’s primary sequence data
It is a comprehensive public database of nucleotide
sequences.
Genbank along with EMBL and DDBJ comprises the INSD.
It is a collaborative approach for exchanging data daily
to ensure a uniform and comprehensive collection of
sequence information.
35.
36. Accession numbers are labels for
sequences
DNA sequences and other molecular data are tagged with
accession numbers that are used to identify a sequence or
other record relevant to molecular data.
It is string of letters and/or numbers that corresponds to a
molecular sequence.
It is shared among the 3 collaborating databases and
remains constant over the lifetime of record.
The DNA sequence within a Genbank record is also assigned
a unique NCBI identifier called a ‘gi’ that apperas on the
version line of flat file records following the accession
number.
43. NCBI’s Derivative Sequence
Database
RefSeq
It is a collection of non redundant set of nucleotide and
protein sequences.
It is derived from the primary submissions available in the
GenBank.
RefSeq records can be distinguished from GenBank records
by the format of the accession series
RefSeq accession numbers are formatted as two alphabetic
characters followed by an underscore ‘-’
The GenBank accession never include an underscore.
44. Literature database
PMC – PubMed Central
It is a digital archive of peer-reviewed journals in the
life sciences providing access to full-text articles.
All PMC free articles are identified in PubMed search
results and PMC itself can be searched using Entrez.
49. Protein database
Entrez protein is the protein sequence database of NCBI.
The protein sequences in this database come from several
different sources such as Swiss-Prot,PDB.
There are GenPept translations for each of the coding
sequences within the GenBank nucleotide database.
The Entrez protein database is cross linked to the Entrez
taxonomy database.
It is also linled to CDD.
After clicking on the individual search results of Entrez
protein,the protein sequence is displayed in a particular
format which is known as GenPept.
50. Expression database
GEO-Gene Expression Omnibus
Distribution and regulation of the transcriptional
products of normal and abnormal cell types.
SAGE map- serial analysis of gene expression map.
51. Structural database
MMDB-Molecular modelling database.
3D macromolecular structures.
XRD and NMR are being used for the experimental structure
determination.
These provide a wealth of information regarding the biological
function,mechanism linked to the function,the evolutionary history of the
function and relationship between the macromolecules.
52. Chemical database
PubChem is a database of chemical molecules
maintained by NCBI.
It focuses on the chemical,structural and biological
properties of small molecules
Molecular mass below 2000u.
53. Other databases
OMIM-Online Mendelian Inheritance in Man.
It is a comprehensive,authoritative and timely
knowledge base of human genes and genetic disorders.
OMIA-Online Mendelian Inheritance in Animals.
It is a database of genes,inhertited disorders and traits
in animal species other than human and mouse.