SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
UniProt and the Semantic Web
                     Chimezie Ogbuji
‘Omics’ Data Challenges
 Advances in protein science is a major catalyst in the
  exploding availability of bioinformatics data
 We have already discussed the dimensions of omics
  data:
   Molecular components, interactions, and phenotype
    observations

 Data from large-scale experiments are no longer
  published conventionally but stored in a database
 Protein sequence databases are one of the most
  comprehensive information resources for scientists
Protein Sequence Databases
 Universal protein sequence databases cover all species

 Specialized protein databases are particular to a protein
  family or organism

 Sequence repositories
   A simple registry of sequence record
   No annotations

 Curated protein databases
   Enrich sequence information with links to various sources
    (scientific literature primarily)
Informatics Challenges
 Standard data integration challenge is the lack of
  common conventions

 Applies to not just notation but also to:
   Use of identifiers
   Representation of cross-references
   Framework for defining terms and relationships between
    them

 Links between omics sources is another important
  component of data integration
What is UniProt?
 A comprehensive repository of protein sequences and
  their functional annotations

 Curators add value to raw data by annotations against
  scientific literature

 Objective is: the creation and maintenance of stable,
  comprehensive, and high-quality protein databases,
  with high level of accessibility, to facilitate cross-
  database information retrival

 Makes use of Semantic Web technologies to address its
  challenges
UniProt: Core Activities
 Sequence archiving

 Manual (peer-reviewed) and automated curation of
  sequences

 Development of human / machine-readable Uniprot web
  site

 Interaction with other protein-related databases for
  expanding cross references
UniProt: Components
  UniProtKB –Protein sequence annotations and metadata:
    Protein name, function, taxonomy, enzyme-specific
     information, domains, sites, subcellular location, interactions,
     relationships to disease etc.
    Links to external sources: DNA sequence repositories, protein
     structure databases, protein domain and family databases, and
     species & function-specific data collections
  UniRef – Compresses sequences at different resolutions
    Parameterized by percent of how identical two sequences or
     sub-sequences are (100,90,50).
  UniParc – Non-redundant database of all publically
   available protein sequences
    Manages globaly-unique identifers, the sequence, information
     on source database, and CRC check number.
Semantic Web Technologies
 Set of standards for managing web-based content in a way
  that emphasizes use by an automaton
   Automaton: a machine that performs a function according to
    a predetermined set of coded instructions
 The architectural vision (the Semantic Web) is to extend the
  standards and best practices behind the World-wide Web with
  new standards that emphasize meaning over structure of
  data.
   Common data formats
   Provide a means to make assertions about the world such that
     an automaton can reason about it through them
 The vision is often confused with the tools meant to achieve
  it (i.e., set of standards)
RDF: Data Model
 Standardized format for representating arbitrary
  information as a labelled, directed graph

 Comprised of statements: subject, predicate, object

 Terms in statements can be Universal Resource
  Identifiers (URIs), Blank Nodes (anonymous entities), or
  Literals

 Abstract data model: a labelled, directed graph

 Various serializations: XML-based and text-based
Information About John Smith
Modelling vocabulary: RDFS/OWL
 RDF Schema (RDFS)
   Simple, minimal schema language for RDF

 Ontology Web Language (OWL)
   Vocabulary for defining classes, relationships, and various
    constraints that limit how RDF is interpreted
   More powerful modeling language

 Tools for constraining & defining reality that can be
  used to codify scientific understanding
 Gene Ontology is modelled in this way to capture our
  understanding of macromolecular reality
Query Language: SPARQL
 Provides a common graph-matching language for
  querying RDF data

 Similar to SQL in many respects
Nature of UniProt Data
 Very large number of cross references to external
  resources

 Cross-reference topology that of a graph not a tree

 Automated and manual annotation require storage of
  provenance information (how / when data was
  acquired)

 Requires a framework for both data as well as metadata
  (data about data)
UniProt Distribution
UniProt: Data Conventions
 All outbound RDF statements are grouped together
  (statements about the same subject)

 Datasets (nodes in previous graph) are distributed as a
  single file

 Only stores stated data, not entailed data.
   For instance, relationships involving symmetric properties
    are only stored in one direction
UniProt: Naming Conventions
 Generally, in semiotics: a symbol denotes a referent.

 In Web architecture, URIs identify resources
   URIs that can be resolved over the web are URLs

 UniProt URIs identify:
   Resources that correspond to database entries
   Modeling vocabulary that use standard namespaces: RDFS
    and OWL
   Classes and properties used by UniProt
     For ex: http://purl.uniprot.org/core/Gene
   Resources without stable identifiers (from their source)
The Omics Identification Problem
 UniProt uses a templated naming convention:
   http://purl.uniprot.org/{database}/{identifier}
   http://purl.uniprot.org/uniprot/{protein_identifier}

 Problem
     http://purl.uniprot.org/uniprot/P04926 denotes the Malaria
      protein EX-1
     If loading that address in a browser returns a web page, can an
      automaton infer that Malaria protein EX-1 is a web page?
     How do you identify abstract concepts v.s. digital media
The PURL Solution
 Persistent Uniform Resource Locator (PURL) is a public
  URI management service for allocating a ‘URI space’ as
  a mapping of identifiers (aliases) for resources they are
  not immediately responsible for
 PURLs are web addresses that act as permanent
  identifiers in the face of a dynamic and changing Web
  infrastructure
 A request to a PURL returns a 303 HTTP status code and
  a location:
   303 indicates that a response can be found under the
    returned location
The PURL Solution: Continued
 Can use PURL addresses to identify abstract concepts

 Redirect requests to such addresses to an informative
  web page (for humans) with a means for machines to
  extract other formats

 RDF statements are about proteins, machines can
  reasons about proteins, and humans resolve protein
  identifiers to view informative web pages
 RDF/XML link:

    http://www.uniprot.org/uniprot/P04926.rdf
UniProt: Protein Class
UniProt: Annotation Hierarchy
Serendipitous Re-use
 Having a rich repository of protein sequence metadata,
  annotations, and taxonomic classification in a
  distributed, standard format encourages scientific
  collaboration
General UniProt Re-Use Scenario
 User A refers to protein P1 in their dataset
   User A’s dataset doesn’t include statements about P1 (the
    host organism for instance)

 User B comes across this dataset and (in order to find
  out more about protein P1) puts the URI of protein P1
  in their browser and pulls up human-readable
  information about it (including the host organism)
 Automaton C comes across the same dataset, fetches
  the web page, fetches the RDF about P1 and has access
  to the same information as user B and can reason about
  the major taxon the host organism belongs to
References

 Wu, C. et.al.,”The Universal Protein Resource
  (UniProt): an expanding universe of protein
  information”. Nucleic Acids Research, vol. 34. 2006

 Swiss Institute of Bioinformatics, “UniProt RDF (project
  page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/

 Redaschi, N. and UniProt Consortium, “UniProt in RDF:
  Tackling Data Integration and Distributed Annotation”
  Nature Proceedings, 3rd International Biocuration
  Conference, April 2009.
  http://precedings.nature.com/documents/3193/version/1

Contenu connexe

Tendances

BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES nadeem akhter
 
PPT protein separation and purification
PPT protein separation and purificationPPT protein separation and purification
PPT protein separation and purificationKAUSHAL SAHU
 
Protein protein interaction
Protein protein interactionProtein protein interaction
Protein protein interactionKAUSHAL SAHU
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary databaseKAUSHAL SAHU
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)ZoufishanY
 
Proteomics 2 d gel, mass spectrometry, maldi tof
Proteomics 2 d gel, mass spectrometry, maldi tofProteomics 2 d gel, mass spectrometry, maldi tof
Proteomics 2 d gel, mass spectrometry, maldi tofnirvarna gr
 
Features of biological databases
Features of biological databasesFeatures of biological databases
Features of biological databasesCharu Sharma
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentationRida Khalid
 

Tendances (20)

BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
PPT protein separation and purification
PPT protein separation and purificationPPT protein separation and purification
PPT protein separation and purification
 
Protein protein interaction
Protein protein interactionProtein protein interaction
Protein protein interaction
 
Entrez databases
Entrez databasesEntrez databases
Entrez databases
 
Protein protein interactions
Protein protein interactionsProtein protein interactions
Protein protein interactions
 
Primary and secondary database
Primary and secondary databasePrimary and secondary database
Primary and secondary database
 
Pubchem
PubchemPubchem
Pubchem
 
Protein data bank
Protein data bankProtein data bank
Protein data bank
 
Protein Databases
Protein DatabasesProtein Databases
Protein Databases
 
PIR- Protein Information Resource
PIR- Protein Information ResourcePIR- Protein Information Resource
PIR- Protein Information Resource
 
blast bioinformatics
blast bioinformaticsblast bioinformatics
blast bioinformatics
 
Protein database
Protein databaseProtein database
Protein database
 
DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)DNA data bank of japan (DDBJ)
DNA data bank of japan (DDBJ)
 
PPT ON ALGORITHM
PPT ON ALGORITHMPPT ON ALGORITHM
PPT ON ALGORITHM
 
Proteomics 2 d gel, mass spectrometry, maldi tof
Proteomics 2 d gel, mass spectrometry, maldi tofProteomics 2 d gel, mass spectrometry, maldi tof
Proteomics 2 d gel, mass spectrometry, maldi tof
 
Interaction between ligand and receptor
Interaction between ligand and receptorInteraction between ligand and receptor
Interaction between ligand and receptor
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
EMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology LaboratoryEMBL- European Molecular Biology Laboratory
EMBL- European Molecular Biology Laboratory
 
Features of biological databases
Features of biological databasesFeatures of biological databases
Features of biological databases
 
Uni prot presentation
Uni prot presentationUni prot presentation
Uni prot presentation
 

En vedette

La muerte y la tortura no es arte ni cultura
La muerte y la tortura no es arte ni culturaLa muerte y la tortura no es arte ni cultura
La muerte y la tortura no es arte ni culturaAchaku
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLJerven Bolleman
 
UniProt & Ontologies
UniProt & OntologiesUniProt & Ontologies
UniProt & OntologiesEric Jain
 
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001Zohaib HUSSAIN
 
Advanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osuAdvanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osuBen Busby
 
Protein 3D structure and classification database
Protein 3D structure and classification database Protein 3D structure and classification database
Protein 3D structure and classification database nadeem akhter
 
Linked Data Management
Linked Data ManagementLinked Data Management
Linked Data ManagementMarin Dimitrov
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOAEBI
 

En vedette (9)

La muerte y la tortura no es arte ni cultura
La muerte y la tortura no es arte ni culturaLa muerte y la tortura no es arte ni cultura
La muerte y la tortura no es arte ni cultura
 
Semantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQLSemantic Variation Graphs the case for RDF & SPARQL
Semantic Variation Graphs the case for RDF & SPARQL
 
UniProt & Ontologies
UniProt & OntologiesUniProt & Ontologies
UniProt & Ontologies
 
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001
ExPASy SIB Bioinformatics Resource Portal CIIT ATD sp13-bty-001
 
Advanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osuAdvanced genomics v_medical_pitt_kent_osu
Advanced genomics v_medical_pitt_kent_osu
 
Protein 3D structure and classification database
Protein 3D structure and classification database Protein 3D structure and classification database
Protein 3D structure and classification database
 
Linked Data Management
Linked Data ManagementLinked Data Management
Linked Data Management
 
UniProt-GOA
UniProt-GOAUniProt-GOA
UniProt-GOA
 
Slideshare ppt
Slideshare pptSlideshare ppt
Slideshare ppt
 

Similaire à UniProt and the Semantic Web

Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Stuart Chalk
 
A Reason Able View To The Web Of Pathway Data
A Reason Able View To The Web Of Pathway DataA Reason Able View To The Web Of Pathway Data
A Reason Able View To The Web Of Pathway Dataguest9fc5f3
 
Ondex: Data integration and visualisation
Ondex: Data integration and visualisationOndex: Data integration and visualisation
Ondex: Data integration and visualisationBiogeeks
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsChimezie Ogbuji
 
2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAGopen_phacts
 
Semantic Web: Technolgies and Applications for Real-World
Semantic Web: Technolgies and Applications for Real-WorldSemantic Web: Technolgies and Applications for Real-World
Semantic Web: Technolgies and Applications for Real-WorldAmit Sheth
 
Web based servers and softwares for genome analysis
Web based servers and softwares for genome analysisWeb based servers and softwares for genome analysis
Web based servers and softwares for genome analysisDr. Naveen Gaurav srivastava
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesHammad Afzal
 
Open library data and embrace the world library linked data
Open library data and embrace the world library linked dataOpen library data and embrace the world library linked data
Open library data and embrace the world library linked data皓仁 柯
 
Pharmacoinformatics Database basics(sree)
Pharmacoinformatics Database basics(sree)Pharmacoinformatics Database basics(sree)
Pharmacoinformatics Database basics(sree)Sreekanth Gali
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003robertstevens65
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdfnedalalazzwy
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Laurent Alquier
 
PRO Use Cases for Scientific Communities
PRO Use Cases for Scientific CommunitiesPRO Use Cases for Scientific Communities
PRO Use Cases for Scientific CommunitiesPaolo Ciccarese
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisCatherine Canevet
 
GI 2013 - ENCODE Project Data Access via RESTful API and JSON
GI 2013 - ENCODE Project Data Access via RESTful API and JSONGI 2013 - ENCODE Project Data Access via RESTful API and JSON
GI 2013 - ENCODE Project Data Access via RESTful API and JSONENCODE-DCC
 
Using the NCBO Annotator to Develop an Ontology-Based Index of Biomedical Res...
Using the NCBO Annotator to Develop an Ontology-Based Index of Biomedical Res...Using the NCBO Annotator to Develop an Ontology-Based Index of Biomedical Res...
Using the NCBO Annotator to Develop an Ontology-Based Index of Biomedical Res...Trish Whetzel
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Trish Whetzel
 

Similaire à UniProt and the Semantic Web (20)

Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
Toward Semantic Representation of Science in Electronic Laboratory Notebooks ...
 
A Reason Able View To The Web Of Pathway Data
A Reason Able View To The Web Of Pathway DataA Reason Able View To The Web Of Pathway Data
A Reason Able View To The Web Of Pathway Data
 
Ondex: Data integration and visualisation
Ondex: Data integration and visualisationOndex: Data integration and visualisation
Ondex: Data integration and visualisation
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical Informatics
 
2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG2011-11-28 Open PHACTS at RSC CICAG
2011-11-28 Open PHACTS at RSC CICAG
 
Semantic Web: Technolgies and Applications for Real-World
Semantic Web: Technolgies and Applications for Real-WorldSemantic Web: Technolgies and Applications for Real-World
Semantic Web: Technolgies and Applications for Real-World
 
Web based servers and softwares for genome analysis
Web based servers and softwares for genome analysisWeb based servers and softwares for genome analysis
Web based servers and softwares for genome analysis
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resources
 
Open library data and embrace the world library linked data
Open library data and embrace the world library linked dataOpen library data and embrace the world library linked data
Open library data and embrace the world library linked data
 
Pharmacoinformatics Database basics(sree)
Pharmacoinformatics Database basics(sree)Pharmacoinformatics Database basics(sree)
Pharmacoinformatics Database basics(sree)
 
Important protein databases and proteomics softwares
Important protein databases and proteomics softwaresImportant protein databases and proteomics softwares
Important protein databases and proteomics softwares
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdf
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
 
BioPortal: ontologies and integrated data resources at the click of a mouse
BioPortal: ontologies and integrated data resourcesat the click of a mouseBioPortal: ontologies and integrated data resourcesat the click of a mouse
BioPortal: ontologies and integrated data resources at the click of a mouse
 
PRO Use Cases for Scientific Communities
PRO Use Cases for Scientific CommunitiesPRO Use Cases for Scientific Communities
PRO Use Cases for Scientific Communities
 
Investigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysisInvestigating plant systems using data integration and network analysis
Investigating plant systems using data integration and network analysis
 
GI 2013 - ENCODE Project Data Access via RESTful API and JSON
GI 2013 - ENCODE Project Data Access via RESTful API and JSONGI 2013 - ENCODE Project Data Access via RESTful API and JSON
GI 2013 - ENCODE Project Data Access via RESTful API and JSON
 
Using the NCBO Annotator to Develop an Ontology-Based Index of Biomedical Res...
Using the NCBO Annotator to Develop an Ontology-Based Index of Biomedical Res...Using the NCBO Annotator to Develop an Ontology-Based Index of Biomedical Res...
Using the NCBO Annotator to Develop an Ontology-Based Index of Biomedical Res...
 
Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications Ontology Web Services for Semantic Applications
Ontology Web Services for Semantic Applications
 

Plus de Chimezie Ogbuji

Reference Domain Ontologies and Large Medical Language Models.pptx
Reference Domain Ontologies and Large Medical Language Models.pptxReference Domain Ontologies and Large Medical Language Models.pptx
Reference Domain Ontologies and Large Medical Language Models.pptxChimezie Ogbuji
 
Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryChimezie Ogbuji
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchChimezie Ogbuji
 
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...Chimezie Ogbuji
 
Automated clinicalontologyextraction
Automated clinicalontologyextractionAutomated clinicalontologyextraction
Automated clinicalontologyextractionChimezie Ogbuji
 
GRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and WhereGRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and WhereChimezie Ogbuji
 
GRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachGRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachChimezie Ogbuji
 
Tools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLTools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLChimezie Ogbuji
 
Semantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical InformaticsSemantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical InformaticsChimezie Ogbuji
 
Segmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical InformaticsSegmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical InformaticsChimezie Ogbuji
 
Overview of CPR Ontology
Overview of CPR OntologyOverview of CPR Ontology
Overview of CPR OntologyChimezie Ogbuji
 
The Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are ImportantThe Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are ImportantChimezie Ogbuji
 

Plus de Chimezie Ogbuji (12)

Reference Domain Ontologies and Large Medical Language Models.pptx
Reference Domain Ontologies and Large Medical Language Models.pptxReference Domain Ontologies and Large Medical Language Models.pptx
Reference Domain Ontologies and Large Medical Language Models.pptx
 
Using OWL for the RESO Data Dictionary
Using OWL for the RESO Data DictionaryUsing OWL for the RESO Data Dictionary
Using OWL for the RESO Data Dictionary
 
Semantic Web use cases in outcomes research
Semantic Web use cases in outcomes researchSemantic Web use cases in outcomes research
Semantic Web use cases in outcomes research
 
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
Integrating Large, Disparate, Biomedical Ontologies to Boost Organ Developmen...
 
Automated clinicalontologyextraction
Automated clinicalontologyextractionAutomated clinicalontologyextraction
Automated clinicalontologyextraction
 
GRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and WhereGRDDL: The Why, What, How, and Where
GRDDL: The Why, What, How, and Where
 
GRDDL: A Pictorial Approach
GRDDL: A Pictorial ApproachGRDDL: A Pictorial Approach
GRDDL: A Pictorial Approach
 
Tools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDLTools for Next Generation of CMS: XML, RDF, & GRDDL
Tools for Next Generation of CMS: XML, RDF, & GRDDL
 
Semantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical InformaticsSemantic Web Technologies as a Framework for Clinical Informatics
Semantic Web Technologies as a Framework for Clinical Informatics
 
Segmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical InformaticsSegmenting & Merging Domain-specific Modules for Clinical Informatics
Segmenting & Merging Domain-specific Modules for Clinical Informatics
 
Overview of CPR Ontology
Overview of CPR OntologyOverview of CPR Ontology
Overview of CPR Ontology
 
The Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are ImportantThe Characteristics of a RESTful Semantic Web and Why They Are Important
The Characteristics of a RESTful Semantic Web and Why They Are Important
 

Dernier

Radiation Dosimetry Parameters and Isodose Curves.pptx
Radiation Dosimetry Parameters and Isodose Curves.pptxRadiation Dosimetry Parameters and Isodose Curves.pptx
Radiation Dosimetry Parameters and Isodose Curves.pptxDr. Dheeraj Kumar
 
Report Back from SGO: What’s New in Uterine Cancer?.pptx
Report Back from SGO: What’s New in Uterine Cancer?.pptxReport Back from SGO: What’s New in Uterine Cancer?.pptx
Report Back from SGO: What’s New in Uterine Cancer?.pptxbkling
 
Culture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxCulture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxDr. Dheeraj Kumar
 
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdfLippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdfSreeja Cherukuru
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisGolden Helix
 
Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!ibtesaam huma
 
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptxSYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptxdrashraf369
 
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners
 
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
PULMONARY EDEMA AND  ITS  MANAGEMENT.pdfPULMONARY EDEMA AND  ITS  MANAGEMENT.pdf
PULMONARY EDEMA AND ITS MANAGEMENT.pdfDolisha Warbi
 
Hematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes FunctionsHematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes FunctionsMedicoseAcademics
 
Basic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdfBasic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdfDivya Kanojiya
 
Let's Talk About It: To Disclose or Not to Disclose?
Let's Talk About It: To Disclose or Not to Disclose?Let's Talk About It: To Disclose or Not to Disclose?
Let's Talk About It: To Disclose or Not to Disclose?bkling
 
The next social challenge to public health: the information environment.pptx
The next social challenge to public health:  the information environment.pptxThe next social challenge to public health:  the information environment.pptx
The next social challenge to public health: the information environment.pptxTina Purnat
 
PERFECT BUT PAINFUL TKR -ROLE OF SYNOVECTOMY.pptx
PERFECT BUT PAINFUL TKR -ROLE OF SYNOVECTOMY.pptxPERFECT BUT PAINFUL TKR -ROLE OF SYNOVECTOMY.pptx
PERFECT BUT PAINFUL TKR -ROLE OF SYNOVECTOMY.pptxdrashraf369
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformKweku Zurek
 
PULMONARY EMBOLISM AND ITS MANAGEMENTS.pdf
PULMONARY EMBOLISM AND ITS MANAGEMENTS.pdfPULMONARY EMBOLISM AND ITS MANAGEMENTS.pdf
PULMONARY EMBOLISM AND ITS MANAGEMENTS.pdfDolisha Warbi
 
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...Badalona Serveis Assistencials
 
SWD (Short wave diathermy)- Physiotherapy.ppt
SWD (Short wave diathermy)- Physiotherapy.pptSWD (Short wave diathermy)- Physiotherapy.ppt
SWD (Short wave diathermy)- Physiotherapy.pptMumux Mirani
 
Introduction to Sports Injuries by- Dr. Anjali Rai
Introduction to Sports Injuries by- Dr. Anjali RaiIntroduction to Sports Injuries by- Dr. Anjali Rai
Introduction to Sports Injuries by- Dr. Anjali RaiGoogle
 
Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.ANJALI
 

Dernier (20)

Radiation Dosimetry Parameters and Isodose Curves.pptx
Radiation Dosimetry Parameters and Isodose Curves.pptxRadiation Dosimetry Parameters and Isodose Curves.pptx
Radiation Dosimetry Parameters and Isodose Curves.pptx
 
Report Back from SGO: What’s New in Uterine Cancer?.pptx
Report Back from SGO: What’s New in Uterine Cancer?.pptxReport Back from SGO: What’s New in Uterine Cancer?.pptx
Report Back from SGO: What’s New in Uterine Cancer?.pptx
 
Culture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptxCulture and Health Disorders Social change.pptx
Culture and Health Disorders Social change.pptx
 
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdfLippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
Lippincott Microcards_ Microbiology Flash Cards-LWW (2015).pdf
 
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic AnalysisVarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
VarSeq 2.6.0: Advancing Pharmacogenomics and Genomic Analysis
 
Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!Biomechanics- Shoulder Joint!!!!!!!!!!!!
Biomechanics- Shoulder Joint!!!!!!!!!!!!
 
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptxSYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
SYNDESMOTIC INJURY- ANATOMICAL REPAIR.pptx
 
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
Wessex Health Partners Wessex Integrated Care, Population Health, Research & ...
 
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
PULMONARY EDEMA AND  ITS  MANAGEMENT.pdfPULMONARY EDEMA AND  ITS  MANAGEMENT.pdf
PULMONARY EDEMA AND ITS MANAGEMENT.pdf
 
Hematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes FunctionsHematology and Immunology - Leukocytes Functions
Hematology and Immunology - Leukocytes Functions
 
Basic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdfBasic principles involved in the traditional systems of medicine PDF.pdf
Basic principles involved in the traditional systems of medicine PDF.pdf
 
Let's Talk About It: To Disclose or Not to Disclose?
Let's Talk About It: To Disclose or Not to Disclose?Let's Talk About It: To Disclose or Not to Disclose?
Let's Talk About It: To Disclose or Not to Disclose?
 
The next social challenge to public health: the information environment.pptx
The next social challenge to public health:  the information environment.pptxThe next social challenge to public health:  the information environment.pptx
The next social challenge to public health: the information environment.pptx
 
PERFECT BUT PAINFUL TKR -ROLE OF SYNOVECTOMY.pptx
PERFECT BUT PAINFUL TKR -ROLE OF SYNOVECTOMY.pptxPERFECT BUT PAINFUL TKR -ROLE OF SYNOVECTOMY.pptx
PERFECT BUT PAINFUL TKR -ROLE OF SYNOVECTOMY.pptx
 
See the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy PlatformSee the 2,456 pharmacies on the National E-Pharmacy Platform
See the 2,456 pharmacies on the National E-Pharmacy Platform
 
PULMONARY EMBOLISM AND ITS MANAGEMENTS.pdf
PULMONARY EMBOLISM AND ITS MANAGEMENTS.pdfPULMONARY EMBOLISM AND ITS MANAGEMENTS.pdf
PULMONARY EMBOLISM AND ITS MANAGEMENTS.pdf
 
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
Presentació "Real-Life VR Integration for Mild Cognitive Impairment Rehabilit...
 
SWD (Short wave diathermy)- Physiotherapy.ppt
SWD (Short wave diathermy)- Physiotherapy.pptSWD (Short wave diathermy)- Physiotherapy.ppt
SWD (Short wave diathermy)- Physiotherapy.ppt
 
Introduction to Sports Injuries by- Dr. Anjali Rai
Introduction to Sports Injuries by- Dr. Anjali RaiIntroduction to Sports Injuries by- Dr. Anjali Rai
Introduction to Sports Injuries by- Dr. Anjali Rai
 
Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.Statistical modeling in pharmaceutical research and development.
Statistical modeling in pharmaceutical research and development.
 

UniProt and the Semantic Web

  • 1. UniProt and the Semantic Web Chimezie Ogbuji
  • 2. ‘Omics’ Data Challenges  Advances in protein science is a major catalyst in the exploding availability of bioinformatics data  We have already discussed the dimensions of omics data:  Molecular components, interactions, and phenotype observations  Data from large-scale experiments are no longer published conventionally but stored in a database  Protein sequence databases are one of the most comprehensive information resources for scientists
  • 3. Protein Sequence Databases  Universal protein sequence databases cover all species  Specialized protein databases are particular to a protein family or organism  Sequence repositories  A simple registry of sequence record  No annotations  Curated protein databases  Enrich sequence information with links to various sources (scientific literature primarily)
  • 4. Informatics Challenges  Standard data integration challenge is the lack of common conventions  Applies to not just notation but also to:  Use of identifiers  Representation of cross-references  Framework for defining terms and relationships between them  Links between omics sources is another important component of data integration
  • 5. What is UniProt?  A comprehensive repository of protein sequences and their functional annotations  Curators add value to raw data by annotations against scientific literature  Objective is: the creation and maintenance of stable, comprehensive, and high-quality protein databases, with high level of accessibility, to facilitate cross- database information retrival  Makes use of Semantic Web technologies to address its challenges
  • 6. UniProt: Core Activities  Sequence archiving  Manual (peer-reviewed) and automated curation of sequences  Development of human / machine-readable Uniprot web site  Interaction with other protein-related databases for expanding cross references
  • 7. UniProt: Components  UniProtKB –Protein sequence annotations and metadata:  Protein name, function, taxonomy, enzyme-specific information, domains, sites, subcellular location, interactions, relationships to disease etc.  Links to external sources: DNA sequence repositories, protein structure databases, protein domain and family databases, and species & function-specific data collections  UniRef – Compresses sequences at different resolutions  Parameterized by percent of how identical two sequences or sub-sequences are (100,90,50).  UniParc – Non-redundant database of all publically available protein sequences  Manages globaly-unique identifers, the sequence, information on source database, and CRC check number.
  • 8. Semantic Web Technologies  Set of standards for managing web-based content in a way that emphasizes use by an automaton  Automaton: a machine that performs a function according to a predetermined set of coded instructions  The architectural vision (the Semantic Web) is to extend the standards and best practices behind the World-wide Web with new standards that emphasize meaning over structure of data.  Common data formats  Provide a means to make assertions about the world such that an automaton can reason about it through them  The vision is often confused with the tools meant to achieve it (i.e., set of standards)
  • 9.
  • 10. RDF: Data Model  Standardized format for representating arbitrary information as a labelled, directed graph  Comprised of statements: subject, predicate, object  Terms in statements can be Universal Resource Identifiers (URIs), Blank Nodes (anonymous entities), or Literals  Abstract data model: a labelled, directed graph  Various serializations: XML-based and text-based
  • 12. Modelling vocabulary: RDFS/OWL  RDF Schema (RDFS)  Simple, minimal schema language for RDF  Ontology Web Language (OWL)  Vocabulary for defining classes, relationships, and various constraints that limit how RDF is interpreted  More powerful modeling language  Tools for constraining & defining reality that can be used to codify scientific understanding  Gene Ontology is modelled in this way to capture our understanding of macromolecular reality
  • 13.
  • 14. Query Language: SPARQL  Provides a common graph-matching language for querying RDF data  Similar to SQL in many respects
  • 15. Nature of UniProt Data  Very large number of cross references to external resources  Cross-reference topology that of a graph not a tree  Automated and manual annotation require storage of provenance information (how / when data was acquired)  Requires a framework for both data as well as metadata (data about data)
  • 17. UniProt: Data Conventions  All outbound RDF statements are grouped together (statements about the same subject)  Datasets (nodes in previous graph) are distributed as a single file  Only stores stated data, not entailed data.  For instance, relationships involving symmetric properties are only stored in one direction
  • 18.
  • 19. UniProt: Naming Conventions  Generally, in semiotics: a symbol denotes a referent.  In Web architecture, URIs identify resources  URIs that can be resolved over the web are URLs  UniProt URIs identify:  Resources that correspond to database entries  Modeling vocabulary that use standard namespaces: RDFS and OWL  Classes and properties used by UniProt  For ex: http://purl.uniprot.org/core/Gene  Resources without stable identifiers (from their source)
  • 20. The Omics Identification Problem  UniProt uses a templated naming convention:  http://purl.uniprot.org/{database}/{identifier}  http://purl.uniprot.org/uniprot/{protein_identifier}  Problem  http://purl.uniprot.org/uniprot/P04926 denotes the Malaria protein EX-1  If loading that address in a browser returns a web page, can an automaton infer that Malaria protein EX-1 is a web page?  How do you identify abstract concepts v.s. digital media
  • 21. The PURL Solution  Persistent Uniform Resource Locator (PURL) is a public URI management service for allocating a ‘URI space’ as a mapping of identifiers (aliases) for resources they are not immediately responsible for  PURLs are web addresses that act as permanent identifiers in the face of a dynamic and changing Web infrastructure  A request to a PURL returns a 303 HTTP status code and a location:  303 indicates that a response can be found under the returned location
  • 22. The PURL Solution: Continued  Can use PURL addresses to identify abstract concepts  Redirect requests to such addresses to an informative web page (for humans) with a means for machines to extract other formats  RDF statements are about proteins, machines can reasons about proteins, and humans resolve protein identifiers to view informative web pages
  • 23.  RDF/XML link:  http://www.uniprot.org/uniprot/P04926.rdf
  • 26. Serendipitous Re-use  Having a rich repository of protein sequence metadata, annotations, and taxonomic classification in a distributed, standard format encourages scientific collaboration
  • 27. General UniProt Re-Use Scenario  User A refers to protein P1 in their dataset  User A’s dataset doesn’t include statements about P1 (the host organism for instance)  User B comes across this dataset and (in order to find out more about protein P1) puts the URI of protein P1 in their browser and pulls up human-readable information about it (including the host organism)  Automaton C comes across the same dataset, fetches the web page, fetches the RDF about P1 and has access to the same information as user B and can reason about the major taxon the host organism belongs to
  • 28. References  Wu, C. et.al.,”The Universal Protein Resource (UniProt): an expanding universe of protein information”. Nucleic Acids Research, vol. 34. 2006  Swiss Institute of Bioinformatics, “UniProt RDF (project page)”. http://dev.isb-sib.ch/projects/uniprot-rdf/  Redaschi, N. and UniProt Consortium, “UniProt in RDF: Tackling Data Integration and Distributed Annotation” Nature Proceedings, 3rd International Biocuration Conference, April 2009. http://precedings.nature.com/documents/3193/version/1