2. ‘Omics’ Data Challenges
Advances in protein science is a major catalyst in the
exploding availability of bioinformatics data
We have already discussed the dimensions of omics
data:
Molecular components, interactions, and phenotype
observations
Data from large-scale experiments are no longer
published conventionally but stored in a database
Protein sequence databases are one of the most
comprehensive information resources for scientists
3. Protein Sequence Databases
Universal protein sequence databases cover all species
Specialized protein databases are particular to a protein
family or organism
Sequence repositories
A simple registry of sequence record
No annotations
Curated protein databases
Enrich sequence information with links to various sources
(scientific literature primarily)
4. Informatics Challenges
Standard data integration challenge is the lack of
common conventions
Applies to not just notation but also to:
Use of identifiers
Representation of cross-references
Framework for defining terms and relationships between
them
Links between omics sources is another important
component of data integration
5. What is UniProt?
A comprehensive repository of protein sequences and
their functional annotations
Curators add value to raw data by annotations against
scientific literature
Objective is: the creation and maintenance of stable,
comprehensive, and high-quality protein databases,
with high level of accessibility, to facilitate cross-
database information retrival
Makes use of Semantic Web technologies to address its
challenges
6. UniProt: Core Activities
Sequence archiving
Manual (peer-reviewed) and automated curation of
sequences
Development of human / machine-readable Uniprot web
site
Interaction with other protein-related databases for
expanding cross references
7. UniProt: Components
UniProtKB –Protein sequence annotations and metadata:
Protein name, function, taxonomy, enzyme-specific
information, domains, sites, subcellular location, interactions,
relationships to disease etc.
Links to external sources: DNA sequence repositories, protein
structure databases, protein domain and family databases, and
species & function-specific data collections
UniRef – Compresses sequences at different resolutions
Parameterized by percent of how identical two sequences or
sub-sequences are (100,90,50).
UniParc – Non-redundant database of all publically
available protein sequences
Manages globaly-unique identifers, the sequence, information
on source database, and CRC check number.
8. Semantic Web Technologies
Set of standards for managing web-based content in a way
that emphasizes use by an automaton
Automaton: a machine that performs a function according to
a predetermined set of coded instructions
The architectural vision (the Semantic Web) is to extend the
standards and best practices behind the World-wide Web with
new standards that emphasize meaning over structure of
data.
Common data formats
Provide a means to make assertions about the world such that
an automaton can reason about it through them
The vision is often confused with the tools meant to achieve
it (i.e., set of standards)
9.
10. RDF: Data Model
Standardized format for representating arbitrary
information as a labelled, directed graph
Comprised of statements: subject, predicate, object
Terms in statements can be Universal Resource
Identifiers (URIs), Blank Nodes (anonymous entities), or
Literals
Abstract data model: a labelled, directed graph
Various serializations: XML-based and text-based
12. Modelling vocabulary: RDFS/OWL
RDF Schema (RDFS)
Simple, minimal schema language for RDF
Ontology Web Language (OWL)
Vocabulary for defining classes, relationships, and various
constraints that limit how RDF is interpreted
More powerful modeling language
Tools for constraining & defining reality that can be
used to codify scientific understanding
Gene Ontology is modelled in this way to capture our
understanding of macromolecular reality
13.
14. Query Language: SPARQL
Provides a common graph-matching language for
querying RDF data
Similar to SQL in many respects
15. Nature of UniProt Data
Very large number of cross references to external
resources
Cross-reference topology that of a graph not a tree
Automated and manual annotation require storage of
provenance information (how / when data was
acquired)
Requires a framework for both data as well as metadata
(data about data)
17. UniProt: Data Conventions
All outbound RDF statements are grouped together
(statements about the same subject)
Datasets (nodes in previous graph) are distributed as a
single file
Only stores stated data, not entailed data.
For instance, relationships involving symmetric properties
are only stored in one direction
18.
19. UniProt: Naming Conventions
Generally, in semiotics: a symbol denotes a referent.
In Web architecture, URIs identify resources
URIs that can be resolved over the web are URLs
UniProt URIs identify:
Resources that correspond to database entries
Modeling vocabulary that use standard namespaces: RDFS
and OWL
Classes and properties used by UniProt
For ex: http://purl.uniprot.org/core/Gene
Resources without stable identifiers (from their source)
20. The Omics Identification Problem
UniProt uses a templated naming convention:
http://purl.uniprot.org/{database}/{identifier}
http://purl.uniprot.org/uniprot/{protein_identifier}
Problem
http://purl.uniprot.org/uniprot/P04926 denotes the Malaria
protein EX-1
If loading that address in a browser returns a web page, can an
automaton infer that Malaria protein EX-1 is a web page?
How do you identify abstract concepts v.s. digital media
21. The PURL Solution
Persistent Uniform Resource Locator (PURL) is a public
URI management service for allocating a ‘URI space’ as
a mapping of identifiers (aliases) for resources they are
not immediately responsible for
PURLs are web addresses that act as permanent
identifiers in the face of a dynamic and changing Web
infrastructure
A request to a PURL returns a 303 HTTP status code and
a location:
303 indicates that a response can be found under the
returned location
22. The PURL Solution: Continued
Can use PURL addresses to identify abstract concepts
Redirect requests to such addresses to an informative
web page (for humans) with a means for machines to
extract other formats
RDF statements are about proteins, machines can
reasons about proteins, and humans resolve protein
identifiers to view informative web pages
26. Serendipitous Re-use
Having a rich repository of protein sequence metadata,
annotations, and taxonomic classification in a
distributed, standard format encourages scientific
collaboration
27. General UniProt Re-Use Scenario
User A refers to protein P1 in their dataset
User A’s dataset doesn’t include statements about P1 (the
host organism for instance)
User B comes across this dataset and (in order to find
out more about protein P1) puts the URI of protein P1
in their browser and pulls up human-readable
information about it (including the host organism)
Automaton C comes across the same dataset, fetches
the web page, fetches the RDF about P1 and has access
to the same information as user B and can reason about
the major taxon the host organism belongs to
28. References
Wu, C. et.al.,”The Universal Protein Resource
(UniProt): an expanding universe of protein
information”. Nucleic Acids Research, vol. 34. 2006
Swiss Institute of Bioinformatics, “UniProt RDF (project
page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/
Redaschi, N. and UniProt Consortium, “UniProt in RDF:
Tackling Data Integration and Distributed Annotation”
Nature Proceedings, 3rd International Biocuration
Conference, April 2009.
http://precedings.nature.com/documents/3193/version/1