This document describes a web-based application to survey properties of homologous proteins. It provides four web pages that give access to: 1) function-related annotation from UniProtKB/Swiss-Prot, 2) features of the protein group, 3) conservation score, and 4) a phylogenetic tree. The application allows users to identify similarities and differences between protein sequences, estimate functional information transfer between characterized and uncharacterized proteins, and identify errors in protein annotations.
Z Score,T Score, Percential Rank and Box Plot Graph
Web-based application to survey properties of homologous proteins
1. Web-based application
to survey properties of
homologous proteins.
proteins
Candidato:
Diego Poggioli
Relatore:
Prof. Rita Casadio
Correlatore:
Dr. Brigitte Boeckmann
2. • Bio-problem: Visualization and interaction with
biological data and performing a comparative protein
analysis
• Info-solution: Web application – CGI
The portal gives access to four web pages:
1) Function-related annotation derived from UniProtKB/Swiss-Prot;
2) Feature of the protein group;
3) Conservation score;
4) Tree.
3. Members of a protein family normally perform
a general biochemical function in common,
but one or more subgroups may evolve a
slightly different function, such as different
substrate specificity.
4. By comparing groups and subgroups of proteins it is possible
to identify or estimate:
• similarity and differences between the proteins sequences
as well as the information available for the given protein
group;
• the ranges, within which functional information on proteins
can be transferred from experimentally characterized proteins
to their homologs from poorly studied organism;
• errors in the annotations of proteins;
6. Available from
any PC
System and browser
independent
php C GI
Dinamic page
HTML JavaScript, PHP, Perl, Python, Ajax, ASP, Ruby…
7. ID AVID_CHICK Reviewed; 152 AA.
Form filling and data type AC
DT
P02701; Q91958; Q98SH4;
21-JUL-1986, integrated into
DT 11-SEP-2007, sequence version
DT 10-JUN-2008, entry version 87.
DE Avidin precursor.
GN Name=AVD;
OS Gallus gallus (Chicken).
OC Eukaryota; Metazoa; Chordata
OC Archosauria; Dinosauria
OC Neognathae; Galliformes
OX NCBI_TaxID=9031; RN [1] RP NUC
RX MEDLINE=87203384; PubMed
RA Gope M.L., Keinaenen R.A.,
RA Zarucki-Schulz T., O'Malley B.
RT quot;Molecular cloning of the chic
RL Nucleic Acids Res. 15:3595
RN [2] RP NUCLEOTIDE SEQUENCE [MR
RX MEDLINE=90355928; PubMed
RA Chandra G., Gray J.G.;
RT quot;Cloning and expression of
RL Methods Enzymol. 184:70
…
AVID_CHICK
AVR2_CHICK
AVR4_CHICK
AVR1_CHICK
AVR3_CHICK
AVR6_CHICK
AVR7_CHICK
P02701
P56732
P56734
O13153
P56733
P56735
P56736
8.
9. BioView
• overview on biological informations
• taxonomic descriptive statistics
a compact summary view on the biological information of
a protein group is important especially when having a
large dataset. This way it will be possible to observe,
compare and count all common and dissimilar
characteristics; it is also possible to analyze in every
single detail of component with the same featuring.
- gene name, functional (catalytic activity, enzyme regulation, pathway…) and general
descriptive information;
- organism classification (OC) and organism species (OS);
- non-experimental qualifiers (by similarities, putative or probable).
11. Nuber of entries
Non-redundant annotation
Number of entries with non-experimental qualifier
Number of entries with annotated experimental qualifier
12. On mouse-click the relevant entry names are listed
Expande all the hierarchy
13.
14. FeatureView
• Interactive interface for visualizing
function-related features on the protein
sequence and 3D structure
• This page should allow the user to analyze
combined sequences-structure on a broad
set of data showing the greatest number of
information available in a clear and
intuitive way.
15. Function-related features derived from the FT lines of
UniProtKB:
active sites, binding sites, domain, transmembrane
region, DNA binding domain…
are mapped on the alignment and highlighted to allow a
clear and compact presentation of the relevant
information. The characteristics are mapped on the
structure in the same way, allowing to identify regions
and conserved sites.
Sequence FT Structure
16. FeatureView
• Choose the best structure
• Alignment
• Mapping the feature on the alignment and
on the structure
17. Choose the best structure
*
...
'91 ' => ‘91',
'25 ' => ‘25',
'92 ' => ‘92',
'81 ' => ‘82',
'71 ' => ‘71',
'21 ' => ‘23',
'-' => 'x',
'61 ' => ‘61',
'37 ' => ‘37',
'68 ' => ‘68',
'50 ' => ‘50',
'18 ' => ‘15',
...
F.P.A. David and Y.L. Yip. SSMap*: a new UniProt-PDB mapping resource for the curation of structural-related
information in the UniProt/Swiss-Prot Knowledgebase. Submitted
18. Jmol: an open-source Java viewer for chemical structures in 3D. http://www.jmol.org/
19.
20.
21.
22. FeatureView
• Choose the best structure
• Alignment
• Mapping the feature on the alignment and
on the structure
23. Alignment
Input file
Edgar, Robert C. (2004), MUSCLE: multiple sequence alignment with high accuracy and
high throughput, Nucleic Acids Research 32(5), 1792-97.
24. FeatureView
• Choose the best structure
• Alignment
• Mapping the feature on the alignment
and on the structure
26. FT (Feature Table) lines
distinct font color and with a toolbox I group: ('CA_BIND', 'NP_BIND', 'MOTIF',
containing the description of the feature 'ACT_SITE', 'METAL', 'BINDING', 'SITE',
(entry name, feature key, sequence position, 'NON_STD', 'MOD_RES', 'LIPID', 'CARBOHYD',
description) 'DISULFID', 'CROSSLINK');
II group: ('PEPTIDE', 'TOPO_DOM',
different background colour and a toolbox with the 'TRANSMEM', 'DOMAIN', 'REPEAT', 'ZN_FING',
content as described above. 'DNA_BIND', 'REGION', 'COILED');
-overlapping into the first group represented in toolbox.
-ovelapping into the second group different background color.
27. ATOM 1817 N MET B 3 -31.380 87.126 39.296 1.0 100.00
ATOM 1818 CA MET B 3 -30.684 88.400 39.176 1.0 100.00
ATOM 1819 C MET B 3 -30.858 88.967 37.771 1.0 100.00
ATOM 1820 O MET B 3 -30.195 88.514 36.832 1.0 100.00
ATOM 1821 CB MET B 3 -29.190 88.285 39.498 1.0 100.00
ATOM 1822 CG MET B 3 -28.465 89.628 39.501 1.0 100.00
ATOM 1823 SD MET B 3 -26.671 89.415 39.661 1.0 100.00
ATOM 1824 CE MET B 3 -26.312 90.705 40.863 1.0 100.00
ATOM 1825 N GLU B 4 -31.750 89.938 37.638 1.0 50.00
ATOM 1826 CA GLU B 4 -31.927 90.498 36.300 1.0 50.00
… … … … … … … … … … …
100.00
Alignment position 00.00
50.00
30. Conservation
• Interactive interface for visualizing the
structural conservation of protein groups
on the protein sequence and 3D structure
• Highlight positions and regions conserved
in the group of proteins
• Conservation scores are mapped on the
multiple sequence alignment (MSA) and
into the 3D-structure
32. Scoring methods
Method name Type of score Description
basicmdm Sum-of-Pairs (SP), matrix score Simplest SP score possible
Normalized Shanon entropy with 7
entropynorm7 Entropic
symbol types
Normalized Shannon entropy with
entropynorm21 Entropic
21 symbol types.
Entropic, matrix score, sequence
trident Mixed model score.
weighted
SP, matrix score, sequence Score used in Valdar & Thornton
valdar01
weighted 2001
0.000 # ---S--------
0.000 # ---T--------
0.000 # ---S--------
0.000 # ---T--------
0.000 # ---S--------
0.024 # ---TM-M-----
0.320 # MMMSV-VVMM--
0.278 # VVVDHMHHGGG-
0.500 # LLLYLLWWLLL-
0.603 # SSSSTTTSSSS-
0.391 # PAAAPAAEDDD-
0.424 # AAAAEEEVGGQT
0.809 # DDDDEEEEEEEE
33.
34.
35. At the moment it is a framework integrated for the development
of the visualization of info such as annotation and for the
visualization of sites that differ in conservation between protein
subgroups.
• develop a method to compare two or more protein subgroups
• profile
Input file
37. Software for phylogenetic tree visualization and manipulations
http://bioinfo.unice.fr/biodiv/Tree_editors.html
- Treedyn: works in local machine but not in server side (graphical applet needed)
- Phylodendron: trouble with cgi script
-phyfi: private program it is not possible to install on own server, eventually URL
request
-nexplorer: NEXUS format needed and it is not possible to install on own server
- dnd2svg.pl: strict sequence number – output only in SVG format
-TreeFam: only private program
ATV 1.92
38. Input file
Gascuel O.1997. BIONJ: an improved version of the NJ algorithm based on a
simple model of sequence data. Molecular Biology and Evolution, 14:685-695.
Tree in Newick format
((((ACADM_HUMAN:0.000925,ACADM_PANTR:0.003941):0.014922,ACADM_MACFA:0.021579):0.041621,((ACADM
_MOUSE:0.015113,ACADM_RAT:0.029420):0.051559,(ACADM_DROME:0.187088,((ACAD8_MOUSE:0.049728,ACAD
8_HUMAN:0.052753):0.013706,ACAD8_BOVIN:0.104627):1.146493):0.149078):0.010918):0.015504,ACADM_
PIG:0.057735,ACADM_BOVIN:0.023577);
http://www.phylosoft.org/atv/
Zmasek C.M. and Eddy S.R. (2001) ATV: display
and manipulation of annotated phylogenetic trees.
Bioinformatics, 17, 383-384.
http://www.jalview.org/
Clamp, M., Cuff, J., Searle, S. M. and
Barton, G. J. (2004). The Jalview Java
Alignment Editor. Bioinformatics, 20, 426-7
39.
40. Future plans
• Normalize HTML pages according to the W3C standard
• Improve the use of CSS
• Test the application on different web browser
• Write the application in a server side language
• Integrate the application with other databases
• Ensuring multiple access to the application and analysis
history
• Develop a view of phylogenetic tree to show and to
interact with additional information
• Hierarchical phylogeny-based classification in UniProtKB
43. Acknowledgements
• Brigitte Boeckmann & Rita Casadio
• Swiss-Prot lab, Biocomputing group
• Fabrice David & Marco Vassura
• Tutti i miei amici e Fra
• Dolores e Davide
And now?
44. practical examples
- identifysimilarity and differences between the proteins
sequences as well as the information available for the given
protein group;
- estimating the ranges, within which functional information
on proteins can be transferred from experimentally
characterized proteins to their homologs from poorly studied
organism;
- identify errors in the annotations of proteins;
45. Compact summary view on the biological information of a protein group is important
especially when having a large dataset. This way it will be possible to observe,
compare and count all common and dissimilar characteristics; it is also possible to
analyze in every single detail of component with the same featuring.
Acetylglutamate kinase family