4. UniProt consortium
EBI : European Bioinformatics Institute (UK)
SIB : Swiss Institute of Bioinformatics (CH)
PIR : Protein information resource (US)
7. UniProtKB: protein sequence knowledgebase, 2 sections
UniProtKB/Swiss-Prot and UniProtKB/TrEMBL (query, Blast,
download) (~15 mo entries)
UniParc: protein sequence archive (ENA equivalent at the protein
level). Each entry contains a protein sequence with cross-
links to other databases where you find the sequence
(active or not). Not annotated (query, Blast, download) (~25 mo entries)
UniRef: 3 clusters of protein sequences with 100, 90 and
50 % identity; useful to speed up sequence similarity
search (BLAST) (query, Blast, download) (UniRef100 10 mo entries; UniRef90 7 mo
entries; UniRef50 3.3 mo entries)
UniMES: protein sequences derived from metagenomic
projects (mostly Global Ocean Sampling (GOS)) (download)
(8 mo entries, included in UniParc)
9. UniProtKB
an encyclopedia on proteins
composed of 2 sections
UniProtKB/TrEMBL and UniProtKB/Swiss-Prot
unreviewed and reviewed
automatically annotated and manually annotated
released every 4 weeks
10. UniProtKB
Origin of protein sequences
UniProtKB protein sequences are mainly derived from
- INSDC (translated submitted coding sequences - CDS)
- Ensembl (gene prediction ) and RefSeq sequences
- Sequences of PDB structures
- Direct submission or sequences scanned from literature
Notes: - UniProt is not doing any gene prediction
- Most non-germline immunoglobulins, T-cell receptors , most patent sequences,
highly over-represented data (e.g. viral antigens), pseudogenes sequences are
excluded from UniProtKB, - but stored in UniParc
- Data from the PIR database have been integrated in UniProtKB since 2003.
15 %
85 %
13. One protein sequence
One species
Automated annotation
Keywords
and
Gene Ontology
Automated annotation
Function, Subcellular location,
Catalytic activity,
Sequence similarities…
Automated annotation
transmembrane domains,
signal peptide…
Cross-references
to over 125 databases
References
Protein and gene names
Taxonomic information
UniProtKB/TrEMBL
www.uniprot.org
14. UniProtKB/TrEMBL
Automatic annotation
Protein sequence
- The quality of the protein sequences is dependent on the information
provided by the submitter of the original nucleotide entry (CDS) or of the
gene prediction pipeline (i.e. Ensembl).
- 100% identical sequences (same length, same organism are merged
automatically).
Biological information
Sources of annotation
- Provided by the submitter (EMBL, PDB, TAIR…)
- From automated annotation (automated generated annotation rules (i.e.
SAAS) and/or manually generated annotation rules (i.e. UniRule))
15.
16.
17. Example of fully automatic annotation: SAAS
• Rules are derived from the UniProtKB/Swiss-Prot manual annotation.
• Fully automated rule generation based on C4.5 decision tree algorithm.
• One annotation, one rule.
• High stringency – require 99% or greater estimated precision to
generate annotation (test on UniProtKB/Swiss-Prot)
• Rules are produced, updated and validated at each release.
UniProtKB/TrEMBL
22. The displayed protein sequence:
…canonical, representative, consensus…
+
alternative sequences (described within the entry)
1 entry <-> 1 gene (1 species)
UniProtKB/Swiss-Prot
a gene-centric view of the protein space
23. What is the current status?
• At least 20% of Swiss-Prot entries required a minimal
amount of curation effort so as to obtain the “correct”
sequence.
• Typical problems
– unsolved conflicts
– uncorrected initiation sites
– frameshifts
– wrong gene prediction
– other ‘problems’
26. UniProtKB/Swiss-Prot gathers data form multiple sources:
- publications (literature/Pubmed)
- prediction programs (Prosite, TMHMM, …)
- contacts with experts
- other databases
- nomenclature committees
An evidence attribution system allows to easily trace the
source of each annotation
Extract literature information
and protein sequence analysis
maximum usage of controlled vocabulary
31. Non-experimental qualifiers
UniProtKB/Swiss-Prot considers both experimental and predicted
data and makes a clear distinction between both
Type of evidence Qualifier
Strong experimental evidence None or Ref.X
Light experimental evidence Probable
Inferred by similarity with homologous protein By similarity
Inferred by prediction Potential
32. Find all the proteins localized in
the cytoplasm (experimentally
proven) which are phosphorylated
on a serine (experimentally proven)
33. • The ‘Protein existence’ tag indicates what is the evidence
for the existence of a given protein;
• Different qualifiers:
– 1. Evidence at protein level (~18%)
– (MS, western blot (tissue specificity), immuno (subcellular
location),…)
– 2. Evidence at transcript level (~19%)
– 3. Inferred from homology (~58 %)
– 4. Predicted (~5%)
– 5. Uncertain (mainly in TrEMBL)
‘Protein existence’ tag
http://www.uniprot.org/docs/pe_criteria
36. 2D gel
2DBase-Ecoli
ANU-2DPAGE
Aarhus/Ghent-2DPAGE (no server)
COMPLUYEAST-2DPAGE
Cornea-2DPAGE
DOSAC-COBS-2DPAGE
ECO2DBASE (no server)
OGP
PHCI-2DPAGE
PMMA-2DPAGE
Rat-heart-2DPAGE
REPRODUCTION-2DPAGE
Siena-2DPAGE
SWISS-2DPAGE
UCD-2DPAGE
World-2DPAGE
Family and domain
Gene3D
HAMAP
InterPro
PANTHER
Pfam
PIRSF
PRINTS
ProDom
PROSITE
SMART
SUPFAM
TIGRFAMs
Organism-specific
AGD
ArachnoServer
CGD
ConoServer
CTD
CYGD
dictyBase
EchoBASE
EcoGene
euHCVdb
EuPathDB
FlyBase
GeneCards
GeneDB_Spombe
GeneFarm
GenoList
Gramene
H-InvDB
HGNC
HPA
LegioList
Leproma
MaizeGDB
MGI
MIM
neXtProt
Orphanet
PharmGKB
PseudoCAP
RGD
SGD
TAIR
TubercuList
WormBase
Xenbase
ZFIN
Protein family/group
Allergome
CAZy
MEROPS
PeroxiBase
PptaseDB
REBASE
TCDB
Genome annotation
Ensembl
EnsemblBacteria
EnsemblFungi
EnsemblMetazoa
EnsemblPlants
EnsemblProtists
GeneID
GenomeReviews
KEGG
NMPDR
TIGR
UCSC
VectorBase
Enzyme and pathway
BioCyc
BRENDA
Pathway_Interaction_DB
Reactome
Other
BindingDB
DrugBank
NextBio
PMAP-CutDB
Sequence
EMBL
IPI
PIR
RefSeq
UniGene
3D structure
DisProt
HSSP
PDB
PDBsum
ProteinModelPortal
SMR
PTM
GlycoSuiteDB
PhosphoSite
PhosSite
UniProtKB/Swiss-Prot:
129 explicit links
and 14 implicit links!
Proteomic
PeptideAtlas
PRIDE
ProMEX
PPI
DIP
IntAct
MINT
STRING
Phylogenomic dbs
eggNOG
GeneTree
HOGENOM
HOVERGEN
InParanoid
OMA
OrthoDB
PhylomeDB
ProtClustDB
Polymorphism
dbSNP
Gene expression
ArrayExpress
Bgee
CleanEx
Genevestigator
GermOnline
Ontologies
GO
37. The UniProt web site
www.uniprot.org
• Powerful search engine, google-like and easy-to-use, but also
supports very directed field searches
• Scoring mechanism presenting relevant matches first
• Entry views, search result views and downloads are customizable
• The URL of a result page reflects the query; all pages and queries
are bookmarkable, supporting programmatic access
• Search, Blast, Align, Retrieve, ID mapping
38. Search
A very powerful text search tool with
autocompletion and refinement options
allowing to look for UniProt entries and
documentation by biological information
53. Retrieve
A UniProt specific tool allowing to retrieve a list of
entries in several standard identifiers formats.
You can then query your ‘personal database’ with the
UniProt search tool.
55. ID Mapping
Gives the possibility to get a mapping between
different databases for a given protein
56. These identifiers are all pointing to a TP53 (p53) protein sequence !
●
P04637, NP_000537, NP_001119584.1, NP_001119585.1,
●
NP_001119584.1, NP_001119584.1, NP_001119584.1,
●
NP_001119584.1, ENSG00000141510, CCDS11118,
●
UPI000002ED67, IPI00025087, etc.
61. A few words on the UniProt
‘complete proteome’
sequence sets…
62. 2’747 complete proteomes
Genome completely sequenced
Proteins mapped to the genome
Entries tagged with the KW ‘Complete proteome’
UniProtKB/Swiss-Prot isoform sequences are available
in FASTA format only
Fully manually reviewed (e.g. S. cerevisiae)
Partially manually reviewed (e.g. Homo sapiens)
Unreviewed (e.g. Acinetobacter baumannii (strain 1656-2))
UniProtKB - complete proteomes
63. Can be downloaded:
From our complete proteome page
www.uniprot.org/taxonomy/complete-proteomes
From the ‘ftp download ‘ page
By querying UniProtKB + download
Query: organism:93062 AND keyword:"complete proteome"
UniProtKB - complete proteomes
Additional information: www.uniprot.org/faq/15
70. The UniProt Consortium
SIB
Ioannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-
Claude Blatter, Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel
Breuza, Alan Bridge, Edouard de Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael
Doche, Dolnide Dornevil, Severine Duvaud, Anne Estreicher, Livia Famiglietti, Marc Feuermann,
Sebastien Gehant, Elisabeth Gasteiger, Alain Gateau, Vivienne Gerritsen, Arnaud Gos, Nadine Gruaz-
Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller,
Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat,
Salvo Paesano, Ivo Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole
Redaschi, Catherine Rivoire, Bernd Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson,
Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala Sundaram, Michael Tognolli, Laure Verbregue,
Anne-Lise Veuthey
EBI
Rolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo
Antunes, Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer,
Francesco Fazzini, Alexander Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius
Jacobsen, Michael Kleen, Duncan Legge, Wudong Liu, Jie Luo, Sandra Orchard, Samuel Patient,
Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony Sawford, Harminder Sehra,
Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg
PIR
Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen,
Pratibha Dubey, Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale,
Thanemozhi G. Natarajan, Jules Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka
Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang, Lai-Su Yeh and Jian Zhang
www.uniprot.org
71. UniProt is mainly supported by the National
Institutes of Health (NIH) grant 1 U41 HG006104-
01. Additional support for the EBI's involvement in
UniProt comes from the NIH grant 2P41 HG02273-07.
Swiss-Prot activities at the SIB are supported by the
Swiss Federal Government through the Federal
Office of Education and Science and the European
Commission contracts SLING (226073), Gen2Phen
(200754) and MICROME (222886). PIR activities are
also supported by the NIH grants 5R01GM080646-04,
3R01GM080646-04S2, 1G08LM010720-01, and
3P20RR016472-09S2, and NSF grant DBI-0850319.
73. Thank you for your attention
http://education.expasy.org/cours/Prague2011/
Notes de l'éditeur
This Science cover clearly shows the well known discepancy between the amount of data and the amount of knowledge which are available.This is a first challenge …but there is a second one: how is to link the 2 together ?
The mission of UniProt is….to link the protein squences (data) together with the biological knowledge (functional information)
The UniProt databases and web site are maintained by the UniProt consortium, which is composed of:
Screen shot of the web page
UniProt provides 4 databases, the central one beiing the UniProtKB.
UniProt provides 4 databases, the central one beiing the UniProtKB.
Computer prediction: if no other evidence from this protein or a similar protein, the keyword is not put.
&lt;number&gt;
dbSNP is NOT in DR lines!!! =&gt; not included in the release notes statistics.
Note : Replaces BuruList, ListiList, MypuList, PhotoList, SagaList and SubtiList
&lt;number&gt;
3 groups working together
Encyclopedia of proteins function in biology and life science
Considered by the life science community as the GOLD standard in annotation practices
Over 600’000 users per month originating from 149 countries. Is it uniprot or swiss-prot?
Used by life science scientists (biologists, MDs), but also by chemists, engineers in nanotechnologies;
Bioinformaticians;
Used by pharma and biotechnology industry;
&lt;number&gt;
3 groups working together
Encyclopedia of proteins function in biology and life science
Considered by the life science community as the GOLD standard in annotation practices
Over 600’000 users per month originating from 149 countries. Is it uniprot or swiss-prot?
Used by life science scientists (biologists, MDs), but also by chemists, engineers in nanotechnologies;
Bioinformaticians;
Used by pharma and biotechnology industry;