2. • Protein databases have become a crucial part of modern biology.
• Searching databases is often the first step in the study of a new protein.
• Huge amounts of data for protein structures, functions, and particularly sequences are being
generated which cannot be handled without using computer databases.
• Without the prior knowledge obtained from such searches, known information about the
protein could be missed, or an experiment could be repeated unnecessarily.
• Comparison between proteins and protein classification provide information about the
relationship between proteins within a genome or across different species, and hence offer
much more information than can be obtained by studying only an isolated protein.
Introduction to Protein Databases
3. Protein Databases
• The databases can be classified in following categories:
Sequence Databases
2D Gel Databases
3D Structure Databases
Polymorphism and Mutation Database
Chemistry Databases
Enzyme and Pathway Databases
Ontologies, Specialized Protein Databases
Family and Domain Databases,
Gene Expression Databases
Genome Annotation Databases
Organism Specific Databases
Phylogenomic Databases
Protein-Protein Interaction Databases,
Proteomic Databases,
PTM Databases
Other Miscellaneous Databases.
5. UniProt
• UniProt provides more annotations than any other sequence database with a minimal level
of redundancy. It has following three components:
1. Protein knowledgebase- including Swiss-Prot (manually annotated and reviewed) and
TrEMBL (automatically annotated).
2. UniRef- sequence clusters for fast sequence similarity searches.
3. UniParc- sequence archive for keeping track of sequences and their identifiers.
UniProt, as a curated protein sequence database, offers a portal to a wide range of
annotations, covering areas such as function, family, domain parsing, post-translational
modifications, and variants.
6. RefSec-NCBI
• The National Center for Biotechnology Information Reference Sequence (NCBI RefSeq) database provides
curated non-redundant sequences of genomic regions, transcripts and proteins for taxonomically diverse
organisms including Archaea, Bacteria, Eukaryotes, and Viruses.
• RefSeq database is derived from the sequence data available in the redundant archival database GenBank.
• RefSeq sequences include coding regions, conserved domains, variations etc. and enhanced annotations such
as publications, names, symbols, aliases, Gene IDs, and database cross-references.
• The sequences and annotations are generated using a combined approach of collaboration, automated
prediction, and manual curation.
• The RefSeq records can be directly accessed from NCBI web sites bysearch of the Nucleotide or Protein
databases, BLAST searches against selected databases and FTP downloads
8. WWPDB
• The World Wide Protein Data Bank (WWPDB) was established in 2003 as an
international collaboration to maintain a single and publicly available PDB Archive of
protein structural data.
• The “PDB Archive” is a collection of flat files in three different formats:
(A) Legacy PDB format (B) PDBx/mmCIF format (C) Protein Data Bank Markup Language
(PDBML) format.
• Each member site serves as a deposition, data processing and distribution site for the PDB
Archive, and each provides its own view of the primary data and a variety of tools and
resources.
9. SCOP
• SCOP (Structural Classification of Proteins) contains information about the classification
of protein structures along with their sequences information.
• It classified works under sub-categories with their features:
1. Class - Global characteristics 2. Fold - Similar “topology”
3. Superfamily - Clear structural homology 4. Family - Clear sequence homology
5. Protein - Functionally identical 6. Species - Unique sequences
It aims to provide an accurate, detailed, and comprehensive description of the structural and
evolutionary relationships amongst all proteins of known structure.
11. Pfam
• Pfam is a database of protein families represented as multiple sequence alignments and Hidden
Markov Models (HMMs).
• Pfam entries can be classified as Family (related protein regions), Domain (protein structural unit),
Repeat (multiple short protein structural units), Motifs (short protein structural unit outside global
domains).
• Related Pfam entries are grouped into clans based on sequence, structure or profile HMM
similarity.
• The Pfam database web site provides search interface for querying by sequence, keyword, domain
architecture, taxonomy, and browse interfaces for analyzing protein sequences for Pfam matches
and viewing Pfam annotations in domain architectures, sequence alignments, interactions, species
and protein structures in PDB.
12. PANTHER
• PANTHER is a database of gene families, including a phylogenetic tree for each family in which nodes of the
tree are annotated with gene attributes
• The main goals of PANTHER is the accurate inference (and practical application) of gene and protein function
over large sequence databases, using phylogenetic trees to extrapolate from the relatively sparse experimental
information from a few model organisms.
• The three types of gene attribute currently annotated in PANTHER are:
(A) Subfamily membership (B) Protein class and (C) Gene function
• The PANTHER website provides tools for functional analysis of lists of genes or proteins.
• PANTHER now includes stable database identifiers for inferred ancestral genes, which are used to associate
inferred gene attributes with particular genes in the common ancestral genomes of extant species.
13. PROSITE
• PROSITE is a database of documentation entries describing protein domains, families and
functional sites as well as associated patterns and profiles to identify them.
• The entries are derived from multiple alignments of homologous sequences and have the
advantage of identifying distant relationships between sequences.
• PROSITE includes a collection of ProRules based on profiles and patterns of functionally
and/or structurally critical amino acids that can be used to increase PROSITE’s
discriminatory power.
• The PROSITE web site provides keyword-based search and allows browsing by
documentation entry, ProRule description, taxonomic scope and number of positive hits.
14. Proteomics – An Introduction
• Proteomics is the recent branch of molecular biology concerned with the study of
proteome.
• The term proteomics was introduced in 1994.
• It has many roles in molecular biology field such as: study of structure and
function of proteins, 3D structure of proteins and, qualitative and quantitative
analysis of proteins.
• It has many applications including Clinical research, Drug discovery, Biomarkers,
Neurology, etc.
16. FunRich
• FunRich software, is an open-access software that facilitates the analysis of
proteomics data, providing tools for functional enrichment and interaction
network analysis of genes and proteins.
• FunRich is a reinterpretation of proteomic software, a standalone tool
combining ease of use with customizable databases, free access, and
graphical representations.
17. ProHits
• ProHits is a complete open source software solution for MS (Mass Spectrometric) based
interaction proteomics that manages the entire pipeline from raw MS data files to fully
annotated protein-protein interaction data sets.
• It was designed to provide an intuitive user interface from the biologist’s perspective and
can accommodate multiple instruments within a facility, multiple user groups, multiple
laboratory locations and any number of parallel projects.
• ProHits can manage all project scales and supports common experimental pipelines,
including those using gel-based separation, gel-free analysis and multidimensional protein
or peptide separation.
18. ProteoWizard
• ProteoWizard provides a modular and extensible set of open-source, cross-platform tools
and libraries.
• The tools perform proteomics data analyses; the libraries enable rapid tool creation by
providing a robust, pluggable development framework that simplifies and unifies data file
access, and performs standard chemistry and LCMS dataset computations.
• The primary goal of ProteoWizard is to eliminate the existing barriers to proteomic
software development so that researchers can focus on the development of new analytic
approaches, rather than having to dedicate significant resources to mundane (if important)
tasks, like reading data files.
20. PPDB
• PPDB is a Plant Proteome DataBase for Arabidopsis thaliana and maize (Zea
mays).
• Initially PPDB was dedicated to plant plastids, but has now expanded to the
whole plant proteome – hence it was renamed from Plastid PDB to Plant PDB
in November 2007.
• The PPDB stores experimental data from in-house proteome and mass
spectrometry analysis, curated information about protein function, protein
properties and subcellular localization.
21. PRIDE
• The PRoteomics IDEntifications database (PRIDE) is a repository for massspectrometry
based proteomics data including identifications of proteins, peptides and post-translational
modifications that have been described in the scientificliterature, together with supporting
mass spectra and related technical and biological metadata.
• PRIDE supports tandem MS (MS/MS) and Peptide Fingerprinting datasets with
search/analysis workflows originally analyzed by the submitters.
• PIRDE provides several services such as the Protein Identifier Cross-Reference (PICR),
the Ontology Lookup Service (OLS) and Database on Demand.
22. ProteomicsDB
• ProteomicsDB (Data base) is an effort of the Technische Universität
München (TUM).
• It is dedicated to expedite the identification of the human proteome
and its use across the scientific community.