2. Hands-on exercise
Design a manually curated data resource that will enable
the description of species agnostic protein complexes,
to act as reference resource in the same way that
UniProt does for proteins – use as examples
1. Human Haemoglobin
2. Arabidopsis Light harvesting complex
3. Designing a new resource - what else is out
there?
• Before starting to design a resource, assess what else is
out there – re-inventing the wheel causes community
fragmentation and confusion as well as being a waste of
limited funds
• Is it needed – what gap in the market is it designed to fill?
• Investigate possibilities for collaboration, rather than
competition
• If another resource exists, does it meet your/consumer
demands – can you contribute and improve
4. Designing a new resource
• How will researchers use it, what information do they
want? Conduct extensive user requirement studies
before starting the design process.
• How will users search it? This will impact on data
entry/annotation.
• Data visualisation – again, what do users want? Usability
studies are critical
• Long term plans – will it survive the first grant renewal?
5. Complex Portal - what else was out there?
• Information on protein complexes scattered between
multiple resources but no unifying resource
• MIPS catalogued yeast complexes in 2000
• Corum – human complexes, project terminated in 2009
• Decision – use as starting point or start again?
6. Information content and presentation
• User consultation – design what they need, not what you
want to give them
• Don’t get too attached to your first paper prototype – be
prepared to sacrifice your concept to community need
• Develop a beta site, then observe researchers using it.
• Keep testing, react to new demands, novel use cases
7. Use of community standards
• Use of community standards enable
• Data merger across multiple resources – contribute to a
greater community effort
• Data re-use and longevity
• Immediate access to existing tool suites
8. Use of Community standards – Complex
Portal
• Established standard formats for molecular interactions PSI-MI
XML/MITAB)
• PSI-XML2.5 designed for experimental data, curated complex
data not a perfect fit – worked with PSI-MI workgroup to
produce new version
• MITAB designed for binary pairs, not complexes –
ComplexTAB will be presented to MI workgroup for adoption
9. Use of Community standards – Complex
Portal
• Used existing identifiers for components (UniProtKB,
ChEBI, RNAcentral)
– enables import of additional information using resource APIs,
for example can search website using gene synonyms
- Organism non-specific, enables us to describe complexes in
a range of species, including non-model organisms
10. Use of Community standards enables use of
existing tools
• Community standards have encouraged tool
development by users, software often open-source and
freely available – often can be incorporated directly into
websites with little/no additional development
• Complex Portal viewer originally
written to visualise cross-linking
data
11. Use of Community standards enables use of
existing tools
• Look for initiatives which make open-source tools,
apps/plug-ins, visualizers and widgets freely available
e.g. BioJS, BioPerl, Cytoscape……
12. Free text vs Ontologies
Free text
Pros – versatile, fully descriptive, flexible
Cons – can be difficult to interpret, long winded,
error-prone, difficult to search
CVs
Pros – structured, consistent, concise
Cons – may not deal well with ‘odd’ cases, lack of
information
Consider using both!
13. Use of controlled vocabularies
• Again, re-use rather than re-invent
• Use of CVs enables searches across resources, but also
can make intelligent searches within resources easy to
implement
For example can search for
• all transcription factors
• all complexes involved in respiration
• all mitochondrial complexes
14. Use of controlled vocabularies
In the Complex Portal you can search for
1. All enzymes - GO:0003824 (catalytic activity)
2. All transferases - GO:0016740 (transferase activity)
3. All protein kinases - GO:0004672 (protein kinase activity)
4. All cyclin-dependent protein kinase - GO:0097472
(cyclin-dependent protein kinase activity)
Similarly can use the ChEBI ontology – search on porphyrin
15. Linking to external resources
• Extensive cross-
referencing is time
consuming but enables
subsequent pulling in of
data from other resources
16. Make this the ‘go to’ resource for your
community
• Must fit community need, be easy to search and deliver
the results the user wants
Outreach – publications, conferences, talks….
Collaborate on a high impact analysis paper, with your
resource playing a key role.
Protocols, tutorials, videos, hands-on training courses.
Use social media
19. What is InterPro
• InterPro provides functional analysis of proteins by
classifying them into families and predicting domains
and important sites.
• Combine protein signatures from a number of member
databases into a single searchable resource,
• Has resulted in an integrated database and diagnostic
tool (InerProScan).
20. Protein signatures
Model the pattern of conserved amino acids at specific positions within
a multiple sequence alignment
• Patterns
• Profiles
• Profile HMMs
Use these models (signatures) to infer relationships with the
characterised sequences from which the alignment was constructed
Approach used by a variety of databases: Pfam, TIGRFAMs,
PANTHER, Prosite, etc
21. Protein signatures
Alternatively, model the pattern of conserved amino acids at specific
positions within a multiple sequence alignment
• Patterns
• Profiles
• Profile HMMs
Use these models (signatures) to infer relationships with the
characterised sequences from which the alignment was constructed
Approach used by a variety of databases: Pfam, TIGRFAMs,
PANTHER, Prosite, etc
22. Introduction to InterPro
How are protein signatures made?
Multiple sequence alignment
Protein family/domain Build model Search
Significant
matches
ITWKGPVCGLDGKTYRNECALL
AVPRSPVCGSDDVTYANECELK
SVPRSPVCGSDGVTYGTECDLK
HPPPGPVCGTDGLTYDNRCELR
E-value 1e-49
E-value 3e-42
E-value 5e-39
E-value 6e-10
Protein
signature
Refine
28. Why automatic annotation is needed
• data growth in UniProtKB is fast:
• manual curation is time-consuming
• experimental data are unavailable for many
sequences/organisms
• organisms’ genomes are sequenced but often no
biochemical characterization is conducted
Release Section of database No. of entries Growth
2015_10 reviewed (Swiss-Prot) ~0.5 mio slow
2015_10 unreviewed (TrEMBL) >50 mio rapid
29. The Concepts in GO
1. Molecular Function
2. Biological Process
3. Cellular Component
An elemental activity or task or job
• protein kinase activity
• insulin receptor
activity
A commonly recognised series of events
• cell division
Where a gene product is located
• mitochondrion
• mitochondrial matrix
• mitochondrial inner membrane
30. The relationship between InterPro and GO
(InterPro2GO)
• Curators manually add relevant GO terms to InterPro entries
• InterPro entry specificity determines the GO terms assigned
GO:0007186 G-protein coupled receptor signaling
GO:0016021 integral to membrane
GO:0007601 visual perception
GO:0007186 G-protein coupled receptor signaling
GO:0016021 integral to membrane
32. Using InterPro for annotation
• InterPro is the world’s major source of GO terms:
~ 90 million GO terms for ~ 30 million distinct UniProtKB seqs
• Also underlies the system adding annotation to UniProtKB/TrEMBL
• Provides matches to ~40 million proteins (approx 80% of UniProtKB)
Annotation consistency:
• Using InterPro and GO for annotation allows direct comparison
proteins in UniProtKB
33. System
Rule
creation
Trigger Annotations Scope
SAAS automatic
taxonomy
InterPro
protein names,
EC numbers,
comments, KW
GO terms
all taxa
UniRule manual
taxonomy
InterPro*
proteome property
sequence length
protein names,
EC numbers,
gene names,
comments,
features**, KW,
GO terms
all taxa
*flexibility to create custom signatures and submitted to InterPro as required
**predictors for signal, transmembrane, coiled-coil features, alignment for positional ones
Automatic Annotation in UniProtKB
34. Components of a rule: conditions
Restrict application of rules to those unreviewed UniProtKB entries
fulfilling the conditions
Types of conditions:
• InterPro signatures
• Functional classification of proteins using predictive models (signatures)
• taxonomy
• sequences features, e.g. length
• proteome features, e.g. outer membrane:yes; (bacterial sequences)
35. Components of a rule: annotations
If an unreviewed UniProtKB entry fulfils conditions of a rule, annotations
in a rule are propagated to this entry.
Types of annotations:
• protein names, including enzyme classification (EC) numbers
• functional annotation, e.g. catalytic activities
• gene ontology terms
• keywords
• sequence features, e.g. active sites, transmembrane domains
42. Attributing evidence
It needs to be made clear to the user when information is
1. experimentally based
2. predicted
3. transferred from a related species
Use of evidence codes give this information
Evidence Code Ontology
http://www.ebi.ac.uk/ols/ontologies/eco