UniProt and the Semantic Web

UniProt and the Semantic Web
Chimezie Ogbuji

‘Omics’ Data Challenges
 Advances in protein science is a major catalyst in the
exploding availability of bioinformatics data
 We have already discussed the dimensions of omics
data:
 Molecular components, interactions, and phenotype
observations

 Data from large-scale experiments are no longer
published conventionally but stored in a database
 Protein sequence databases are one of the most
comprehensive information resources for scientists

Protein Sequence Databases
 Universal protein sequence databases cover all species

 Specialized protein databases are particular to a protein
family or organism

 Sequence repositories
 A simple registry of sequence record
 No annotations

 Curated protein databases
 Enrich sequence information with links to various sources
(scientific literature primarily)

Informatics Challenges
 Standard data integration challenge is the lack of
common conventions

 Applies to not just notation but also to:
 Use of identifiers
 Representation of cross-references
 Framework for defining terms and relationships between
them

 Links between omics sources is another important
component of data integration

What is UniProt?
 A comprehensive repository of protein sequences and
their functional annotations

 Curators add value to raw data by annotations against
scientific literature

 Objective is: the creation and maintenance of stable,
comprehensive, and high-quality protein databases,
with high level of accessibility, to facilitate cross-
database information retrival

 Makes use of Semantic Web technologies to address its
challenges

UniProt: Core Activities
 Sequence archiving

 Manual (peer-reviewed) and automated curation of
sequences

 Development of human / machine-readable Uniprot web
site

 Interaction with other protein-related databases for
expanding cross references

UniProt: Components
 UniProtKB –Protein sequence annotations and metadata:
 Protein name, function, taxonomy, enzyme-specific
information, domains, sites, subcellular location, interactions,
relationships to disease etc.
 Links to external sources: DNA sequence repositories, protein
structure databases, protein domain and family databases, and
species & function-specific data collections
 UniRef – Compresses sequences at different resolutions
 Parameterized by percent of how identical two sequences or
sub-sequences are (100,90,50).
 UniParc – Non-redundant database of all publically
available protein sequences
 Manages globaly-unique identifers, the sequence, information
on source database, and CRC check number.

Semantic Web Technologies
 Set of standards for managing web-based content in a way
that emphasizes use by an automaton
 Automaton: a machine that performs a function according to
a predetermined set of coded instructions
 The architectural vision (the Semantic Web) is to extend the
standards and best practices behind the World-wide Web with
new standards that emphasize meaning over structure of
data.
 Common data formats
 Provide a means to make assertions about the world such that
an automaton can reason about it through them
 The vision is often confused with the tools meant to achieve
it (i.e., set of standards)

RDF: Data Model
 Standardized format for representating arbitrary
information as a labelled, directed graph

 Comprised of statements: subject, predicate, object

 Terms in statements can be Universal Resource
Identifiers (URIs), Blank Nodes (anonymous entities), or
Literals

 Abstract data model: a labelled, directed graph

 Various serializations: XML-based and text-based

Modelling vocabulary: RDFS/OWL
 RDF Schema (RDFS)
 Simple, minimal schema language for RDF

 Ontology Web Language (OWL)
 Vocabulary for defining classes, relationships, and various
constraints that limit how RDF is interpreted
 More powerful modeling language

 Tools for constraining & defining reality that can be
used to codify scientific understanding
 Gene Ontology is modelled in this way to capture our
understanding of macromolecular reality

Query Language: SPARQL
 Provides a common graph-matching language for
querying RDF data

 Similar to SQL in many respects

Nature of UniProt Data
 Very large number of cross references to external
resources

 Cross-reference topology that of a graph not a tree

 Automated and manual annotation require storage of
provenance information (how / when data was
acquired)

 Requires a framework for both data as well as metadata
(data about data)

UniProt: Data Conventions
 All outbound RDF statements are grouped together
(statements about the same subject)

 Datasets (nodes in previous graph) are distributed as a
single file

 Only stores stated data, not entailed data.
 For instance, relationships involving symmetric properties
are only stored in one direction

UniProt: Naming Conventions
 Generally, in semiotics: a symbol denotes a referent.

 In Web architecture, URIs identify resources
 URIs that can be resolved over the web are URLs

 UniProt URIs identify:
 Resources that correspond to database entries
 Modeling vocabulary that use standard namespaces: RDFS
and OWL
 Classes and properties used by UniProt
 For ex: http://purl.uniprot.org/core/Gene
 Resources without stable identifiers (from their source)

The Omics Identification Problem
 UniProt uses a templated naming convention:
 http://purl.uniprot.org/{database}/{identifier}
 http://purl.uniprot.org/uniprot/{protein_identifier}

 Problem
 http://purl.uniprot.org/uniprot/P04926 denotes the Malaria
protein EX-1
 If loading that address in a browser returns a web page, can an
automaton infer that Malaria protein EX-1 is a web page?
 How do you identify abstract concepts v.s. digital media

The PURL Solution
 Persistent Uniform Resource Locator (PURL) is a public
URI management service for allocating a ‘URI space’ as
a mapping of identifiers (aliases) for resources they are
not immediately responsible for
 PURLs are web addresses that act as permanent
identifiers in the face of a dynamic and changing Web
infrastructure
 A request to a PURL returns a 303 HTTP status code and
a location:
 303 indicates that a response can be found under the
returned location

The PURL Solution: Continued
 Can use PURL addresses to identify abstract concepts

 Redirect requests to such addresses to an informative
web page (for humans) with a means for machines to
extract other formats

 RDF statements are about proteins, machines can
reasons about proteins, and humans resolve protein
identifiers to view informative web pages

 RDF/XML link:

 http://www.uniprot.org/uniprot/P04926.rdf

Serendipitous Re-use
 Having a rich repository of protein sequence metadata,
annotations, and taxonomic classification in a
distributed, standard format encourages scientific
collaboration

General UniProt Re-Use Scenario
 User A refers to protein P1 in their dataset
 User A’s dataset doesn’t include statements about P1 (the
host organism for instance)

 User B comes across this dataset and (in order to find
out more about protein P1) puts the URI of protein P1
in their browser and pulls up human-readable
information about it (including the host organism)
 Automaton C comes across the same dataset, fetches
the web page, fetches the RDF about P1 and has access
to the same information as user B and can reason about
the major taxon the host organism belongs to

References

 Wu, C. et.al.,”The Universal Protein Resource
(UniProt): an expanding universe of protein
information”. Nucleic Acids Research, vol. 34. 2006

 Swiss Institute of Bioinformatics, “UniProt RDF (project
page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/

 Redaschi, N. and UniProt Consortium, “UniProt in RDF:
Tackling Data Integration and Distributed Annotation”
Nature Proceedings, 3rd International Biocuration
Conference, April 2009.
http://precedings.nature.com/documents/3193/version/1

UniProt and the Semantic Web

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à UniProt and the Semantic Web

Similaire à UniProt and the Semantic Web (20)

Plus de Chimezie Ogbuji

Plus de Chimezie Ogbuji (12)

Dernier

Dernier (20)

UniProt and the Semantic Web