Event: Plant and Animal Genomes conference 2012
Speaker: Michel Schneider
The UniProt Knowledgebase consists of two sections, UniProtKB/Swiss-Prot, which contains manually-annotated protein sequence enriched with functional information added by expert human curators, and UniProtKB/TrEMBL, which contains unreviewed records that are enhanced by information provided by automated rule-based annotation systems. The majority of UniProtKB records are based on automatic translation of coding sequences (CDS) provided by submitters at the time of initial deposition to the nucleotide sequence databases. In order to provide the complete proteome of Arabidopsis thaliana, a complementary curation pipeline for import of protein sequences from TAIR has been developed. As the complete genome reannotation proposed in the TAIR10 release contains most of the sequences already in UniProtKB, these existing sequences have to be reconciled with those imported. Around 7% of them have a different gene model and should be checked manually. Based on these comparisons, we improved over 200 of our predicted proteins. In exchange, we provide TAIR with the gene model corrections that we introduce on the bases of our trans-species family annotation. This approach allows identification of data that can be seamlessly transferred from one site to the other and the development of common annotations. With the significant increase in the number of complete genomes sequenced (1001 Arabidopsis cultivars are currently under way!), organization of this data in a convenient way is critical. UniProt have selected a set of “reference proteomes”, including A. thaliana cv. Columbia, which provide broad coverage of the tree of life and constitute a representative cross-section of the taxonomic diversity to be found within UniProtKB.
Genislab builds better products and faster go-to-market with Lean project man...
The annotation of plant proteins in UniProtKB
1. The annotation of Plant Proteins in
UniProtKB
Michel Schneider
Plant protein annotation program, Swiss-Prot group
Swiss Institute of Bioinformatics
Geneva, Switzerland
Michel.Schneider@isb-sib.ch
2. 1. The UniProt consortium and its products
2. Content of an entry in UniProtKB and manual curation
3. Complete proteomes and reference proteomes
4. Synchronization between UniProtKB and TAIR
5. Some statistics
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
3. The UniProt consortium
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
4. The missions of the UniProt consortium
Provide the scientific community with a resource of protein
sequence and functional annotation which has to be …
comprehensive
high quality
and freely accessible
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
5. Four components to fulfill specific demands
UniProtKB
Protein Knowledgebase
UniRef
UniProtKB/Swiss-Prot UniMes
Sequence clusters
Reviewed Metagenomic and
UniRef100
(533’657 entries)
UniRef90 environmental
Manual curation sample sequences
UniRef50
UniProtKB/Trembl
Unreviewed
(19 million entries)
Automated annotation
UniParc – Sequence archive contains current and obsolete sequences
(29.6 million sequences)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
6. UniProtKB, the expertly curated
component of UniProt
The high-quality curated protein knowledge database
where data becomes structured knowledge
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
7. UniProtKB, the expertly curated
component of UniProt
Shigeo Fukuda
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
15. Origin of the sequences in UniProtKB
International Nucleotide Sequence Database Collection
(INSDC)
Ensembl or EnsemblGenomes
RefSeq
Direct submissions (protein sequences)
Literature
Protein Data Bank
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
16. The process of manual sequence curation
1. Select entry/gene (priorities)
2. Identify entries from same gene and homologs
using BLAST against UniProtKB
3. Merge entries from the same gene and same
species into a single record
4. Select a canonical sequence
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
17. Critical analysis and report of sequence discrepancies
QPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
18. Critical analysis and report of sequence discrepancies
QPCT_ARATH (Q84WV9) Glutaminyl-peptide cyclotransferase (At4g25720)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
19. “Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
20. Literature-based curation
Identify relevant papers through searching literature
databases
Read full text of papers and extract and summarize
relevant information
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
21. Literature-based curation
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
22. Literature-based curation
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
23. Literature-based curation
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
24. Controlled vocabularies
• Keywords provide a summary of the entry content
• We annotate using the Gene Ontology (GO)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
25. UniProtKB, complete proteome
sequence sets
• Genome completely sequenced
• Proteins mapped to the genome
2’902 complete proteomes
Fully manually reviewed (e.g. S. cerevisiae)
Partially manually reviewed (e.g. A. thaliana)
Unreviewed (e.g. Chlorella variabilis)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
26. UniProtKB, reference proteome
sequence sets
A reference proteome is the complete proteome of a
representative, well-studied model organism or an organism
of interest for biomedical research.
509 reference proteomes
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
28. Arabidopsis thaliana
The building of the complete proteome sequence set:
• Based on the re-annotation of complete genome by TAIR:
27’416 protein coding genes
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
29. UniProtKB – TAIR synchronization
cDNAs, ESTs,
genomic sequences
Nucleic acid
databases
UniProtKB/TrEMBL
Unreviewed
(40’574 entries)
UniProtKB/Swiss-Prot
Reviewed
(10’340 entries)
release 2011_03 - Mar 08, 2011
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
30. UniProtKB – TAIR synchronization
cDNAs, ESTs,
genomic sequences Genome re-annotation
35’386 gene products
Nucleic acid
databases
UniProtKB/TrEMBL Temporary TrEMBL set
33’341 entries
Unreviewed
(40’574 entries)
UniProtKB/Swiss-Prot
Reviewed
(10’340 entries)
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
31. UniProtKB – TAIR synchronization
cDNAs, ESTs,
genomic sequences Genome re-annotation
35’386 gene products
Nucleic acid
databases
UniProtKB/TrEMBL Temporary TrEMBL set
33’341 entries
Unreviewed
(40’574 entries)
11’508 sequences
UniProtKB/Swiss-Prot Compare translations from the same gene, merge if 100 %
identical, report sequence discrepancies, align with
Reviewed
(10’340 entries)
orthologs and paralogs
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
32. UniProtKB – TAIR synchronization
cDNAs, ESTs,
genomic sequences Genome re-annotation
Nucleic acid
databases
UniProtKB/TrEMBL Temporary TrEMBL set
Unreviewed
UniProtKB/Swiss-Prot Compare translations from the same gene, merge if 100 %
identical, report sequence discrepancies, align with
Reviewed
orthologs and paralogs
Feedback to TAIR
90 gene models
correct gene models or add new isoforms
283 corrections at the Heart of Science” 1998 – 2008
“Pioneers
PAG XX, San Diego, January 15, 2012
33. UniProtKB – TAIR synchronization
cDNAs, ESTs,
genomic sequences Genome re-annotation
Nucleic acid
databases
UniProtKB/TrEMBL Temporary TrEMBL set
Unreviewed
Cleaned set of new TrEMBL entries
UniProtKB/Swiss-Prot
(21’656 entries)
Reviewed
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
34. UniProtKB – TAIR synchronization
cDNAs, ESTs,
genomic sequences Genome re-annotation
Nucleic acid
databases
UniProtKB/TrEMBL Temporary TrEMBL set
Unreviewed
(44’628 entries)
Cleaned set of new TrEMBL entries
UniProtKB/Swiss-Prot
(21’656 entries)
Reviewed
+
(10’875 entries)
UniProtKB/Swiss-Prot
Reviewed (10’865 entries)
release 2011_12 - Dec 14, 2011
Arabidopsis thaliana, cv. Columbia
Complete proteome: 32’521 entries
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
35. 1001 Arabidopsis genomes
• Deposited to INSDC ?
• Fully Annotated ? With CDS ?
• Should we still merge all the identical sequences together?
• If they are not merged but kept separate, how to get
relevant Blast results?
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
36. Some UniProtKB/Swiss-Prot Statistics
concerning plant entries
(UniProt release 2011_12 - Dec 14, 2011)
• 31,959 entries of Viridiplantae
• from 1,924 species
• 10’875 entries from Arabidopsis thaliana (with 1,219 isoforms)
• 2,823 entries from Oryza sativa sp. Japonica
• 11,897 plant entries with an EC number
• 966 different complete EC numbers
• 5,744 putative transporters or proteins involved in transport
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
37. Summary
UniProtKB/Swiss-Prot, the manually curated knowledgebase:
• Protein sequence database covering all kingdoms of life (533’657
sequence entries; 12’664 species)
• Manually annotated
• Non-redundant: all products of one gene in one species in a single entry
• Highly cross-referenced (links to ~130 databases).
Plant protein annotation:
• Complete proteome for Arabidopsis thaliana
• Synchronization with TAIR
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
38. We need your feedback and your collaboration !
help@uniprot.org
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
39. Acknowledgements
SIB
Ioannis Xenarios, Lydie Bougueleret, Andrea Auchincloss, Kristian Axelsen, Delphine Baratin, Marie-Claude Blatter,
Brigitte Boeckmann, Jerven Bolleman, Laurent Bollondi, Emmanuel Boutet, Lionel Breuza, Alan Bridge, Edouard de
Castro, Lorenzo Cerutti, Elisabeth Coudert, Béatrice Cuche, Mikael Doche, Dolnide Dornevil, Severine Duvaud, Anne
Estreicher, Livia Famiglietti, Marc Feuermann, Sebastien Gehant, Elisabeth Gasteiger, Vivienne Gerritsen, Arnaud Gos,
Nadine Gruaz-Gumowski, Ursula Hinz, Chantal Hulo, Nicolas Hulo, Janet James, Florence Jungo, Guillaume Keller,
Vicente Lara, Philippe Lemercier, Damien Lieberherr, Xavier Martin, Patrick Masson, Anne Morgat, Salvo Paesano, Ivo
Pedruzzi, Sandrine Pilbout, Sylvain Poux, Monica Pozzato, Manuela Pruess, Nicole Redaschi, Catherine Rivoire, Bernd
Roechert, Michel Schneider, Christian Sigrist, Karin Sonesson, Sylvie Staehli, Eleanor Stanley, André Stutz, Shyamala
Sundaram, Michael Tognolli, Laure Verbregue and Anne-Lise Veuthey
EBI
Rolf Apweiler, Maria Jesus Martin, Claire O'Donovan, Michele Magrane, Yasmin Alam-Faruque, Ricardo Antunes,
Benoit Bely, Mark Bingley, David Binns, Lawrence Bower, Wei Mun Chan, Emily Dimmer, Francesco Fazzini, Alexander
Fedotov, John Garavelli, Leyla Garcia Castro, Rachael Huntley, Julius Jacobsen, Michael Kleen, Duncan Legge, Wudong
Liu, Jie Luo, Sandra Orchard, Samuel Patient, Klemens Pichler, Diego Poggioli, Nikolas Pontikos, Steven Rosanoff, Tony
Sawford, Harminder Sehra, Edward Turner, Matt Corbett, Mike Donnelly and Pieter van Rensburg
PIR
Cathy H. Wu, Cecilia N. Arighi, Leslie Arminski, Winona C. Barker, Chuming Chen, Yongxing Chen, Pratibha Dubey,
Hongzhan Huang, Kati Laiho, Raja Mazumder, Peter McGarvey, Darren A. Natale, Thanemozhi G. Natarajan, Jules
Nchoutmboube, Natalia V. Roberts, Baris E. Suzek, Uzoamaka Ugochukwu, C. R. Vinayaka, Qinghua Wang, Yuqi Wang,
Lai-Su Yeh and Jian Zhang
www.uniprot.org
40. UniProt is mainly supported by the National Institutes of
Health (NIH) grant 1 U41 HG006104-01. Additional support for
the EBI's involvement in UniProt comes from the NIH grant
2P41 HG02273-07. Swiss-Prot activities at the SIB are
supported by the Swiss Federal Government through the
Federal Office of Education and Science and the European
Commission contracts SLING (226073), Gen2Phen (200754)
and MICROME (222886). PIR activities are also supported by
the NIH grants 5R01GM080646-04, 3R01GM080646-04S2,
1G08LM010720-01, and 3P20RR016472-09S2, and NSF grant
DBI-0850319.
“Pioneers at the Heart of Science” 1998 – 2008
PAG XX, San Diego, January 15, 2012
Notes de l'éditeur
Alignment of sequences deduced from 2 genomic DNAs, one cDNA and one ESTAnnotation of erroneous gene model predictions
Annotation of isoforms
Information about how to reconstruct all isoformsAccess to the sequences of all isoformsCan apply various tools
The sequencing of 1001 Arabidopsis genomes is raising several questions and we have to find new solutionsIf not merged, one solution for the blast is to use UniRef, but only valid for functional annotation and not for finding if an homologous protein is already known in a given species