Ontology and the National Cancer Institute Thesaurus (2005)
Reference Data Integration: A Strategy for the Future
1. Reference Data Integration:
A Strategy For The Future
Barry Smith
National Center for Ontological Research
University at Buffalo
presented at FIMA, March 21, 2012
1
2. Who am I?
National Center for Biomedical Ontology
based in Stanford Medical School, the Mayo Clinic
and Buffalo Department of Philosophy
• Cleveland Clinic Semantic Database
• Duke University Health System
• University of Pittsburgh Medical Center
• German Federal Ministry of Health
• European Union eHealth Directorate
• Plant Genome Research Resource
• Protein Information Resource
2
3. Who am I?
National Center for Ontological Research (http://ncor.us)
• Joint Warfighting Center, US Joint Forces Command
• Intelligence and Information Warfare Directorate
(I2WD)
• US Department of the Army Net-Centric Data
Strategy Center of Excellence
• NextGen (Next Generation Air Transportation
System) Ontology Team
• National Nuclear Security Administration (NNSA),
Department of Energy
3
4. Some questions
• How to find data?
• How to understand data when you find it?
• How to use data when you find it?
• How to compare and integrate with other data?
• How to avoid data silos?
4
5. The Web (net-centricity) as part of the
solution
• You build a site
• Others discover the site and they link to it
• The more they link, the more well known the
page becomes (Google …)
• Your data becomes discoverable
5
6. The roots of Semantic Technology
1. Make your data available in a standard way
on the Web
2. Use controlled vocabularies (‘ontologies’) to
capture common meanings, in ways
understandable to both humans and
computers – Web Ontology Language
(OWL)
3. Build links among the datasets to create a
‘web of data’
7. Controlled vocabularies for tagging
(‘annotating’) data
• Hardware changes rapidly
• Organizations rapidly forming and
disbanding
• Data is exploding
• Meanings of common words change slowly
• Use web architecture to annotate exploding
data stores using ontologies to capture
these common meanings in a stable way
7
8. Where we stand today
• increasing availability of semantically enhanced
data and semantic software
• increasing use of XML, RDF, OWL in attempts to
create useful integration of on-line data and
information
• “Linked Open Data” the New Big Thing
8
11. The problem: the more Semantic
Technology is successful, they more it fails
The original idea was to break down silos via
common controlled vocabularies for the tagging
of data
The very success of the approach leads to the
creation of ever new controlled vocabularies –
semantic silos – as ever more ontologies are
created in ad hoc ways
The Semantic Web framework as currently
conceived and governed by the W3C yields
minimal standardization
Multiplying (Meta)data registries are creating
data cemeteries
11
15. Reasons for this effect
• Low incentives for reuse of existing ontologies
• Each organization wants its own ontology
• Poor licensing regime, poor standards, poor
training
• People think: Information technology (hardware)
is changing constantly, so it’s not worth the effort
of getting things right
• People have egos: “We have done it this way for
30 years, we are not going to change now”
15
16. Why should you care?
• when they are many ad hoc systems, average
quality will be low
• constant need for ad hoc repair through
manual effort
• DoD alone spends $6 billion per annum on
this problem
• regulatory agencies are recognizing the need
for common controlled vocabularies
16/24
17. So now people are scrambling
• to learn how to create ontologies
• serious lag in creating trained expertise
• poor quality coding leads to poor quality
ontologies
• poor quality ontology management
17
18. How to do it right?
• how create an incremental, evolutionary
process, where what is good survives ?
• how to bring about ontology death ?
A success story from biology
18
21. Ontology in PubMed
Series 1
1200
1000
800
Axis Title
600
400
200
0
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
22. By far the most successful: GO (Gene Ontology)
22
23. the Gene Ontology is not an ontology of ge
what cellular component?
what molecular function?
what biological process?
23
24. time
Defense response
Microarray data
Immune response
Response to stimulus
shows changed
Toll regulated genes
JAK-STAT regulated genes
expression of
thousands of genes.
Puparial adhesion
Molting cycle
hemocyanin
Amino acid catabolism
Lipid metobolism
How will you spot
the patterns?
Peptidase activity
Protein catabloism
Immune response
Immune response
Toll regulated genes
attacked control 24
e Tree: lw n3d ...lw n3d ...
ar son pearson Colored by: Copy of Copy of C5_RMA (Defa...
Colored by: Copy of Copy of C5_RMA (Defa...
lassification: Set_LW_n3 d_5p_... Gene List:
t_LW_n3 d_5p_... Gene List: allall genes (1 4010)
genes (1 4010)
25. Why is GO successful
• built by bench biologists
• multi-species, multi-disciplinary, open source
• compare use of kilograms, meters, seconds in
formulating experimental results
• natural language and logical definitions for all
terms
• initially low-tech to ensure aggressive use and
testing
25
26. now used not just in
biology but also in
hospital research
26
27. Lab / pathology data
EHR data
Clinical trial data
Family history data
Medical imaging
Microarray data
Model organism data
Flow cytometry
Mass spec
Genotype / SNP data
How will you spot the patterns?
How will you find the data you
need?
27
28. over 11 million annotations relating
UniProt, Ensembl and other databases to terms in
the GO
28
30. ~ $200 mill. invested in the GO so far
A new kind of biomedical research
Over 11 million GO annotations to biomedical
research literature freely available on the web
Powerful software tool support for navigating
this data means that what used to take
researchers months of data comparison effort,
can now be performed in milliseconds
30
31. If controlled vocabularies are to serve
to remove silos
they have to be respected by many owners of
data as resources that ensure accurate
description of their data
– GO maintained not by computer scientists but
by biologists
they have to be willingly used in annotations by
many owners of data
they have to be maintained by persons who are
trained in common principles of ontology
maintenance
31
33. GO has been amazingly successful
Has created a community consensus
Has created a web of feedback loops where
users of the GO can easily report errors
and gaps
Has identified principles for successful
ontology management
Indispensable to every drug company and
every biology lab
33
34. But GO is limited in its scope
it covers only generic biological entities of three
sorts:
– cellular components
– molecular functions
– biological processes
no diseases, symptoms, disease
biomarkers, protein interactions, experimental
processes …
34
35. Extending the GO methodology to
other domains of biology and
medicine
35
36. RELATION
TO TIME CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
GRANULARITY
Anatomical
Organism Organ
ORGAN AND Entity
(NCBI Function
ORGANISM (FMA,
Taxonomy) (FMP, CPRO) Phenotypic Biological
CARO)
Quality Process
(PaTO) (GO)
CELL AND Cellular Cellular
Cell
CELLULAR Component Function
(CL)
COMPONENT (FMA, GO) (GO)
Molecule
Molecular Function Molecular Process
MOLECULE (ChEBI, SO,
(GO) (GO)
RnaO, PrO)
OBO (Open Biomedical Ontology) Foundry proposal
(Gene Ontology in yellow) 36
37. RELATION
TO TIME CONTINUANT OCCURRENT
INDEPENDENT DEPENDENT
GRANULARITY
Anatomical
Organism Organ
ORGAN AND Entity
(NCBI Function
ORGANISM (FMA,
Taxonomy) (FMP, CPRO) Phenotypic Biological
CARO)
Quality Process
(PaTO) (GO)
CELL AND Cellular Cellular
Cell
CELLULAR Component Function
(CL)
COMPONENT (FMA, GO) (GO)
Molecule
Molecular Function Molecular Process
MOLECULE (ChEBI, SO,
(GO) (GO)
RnaO, PrO)
The strategy of orthogonal modules
37
38. Ontology Scope URL Custodians
Cell Ontology cell types from prokaryotes obo.sourceforge.net/cgi- Jonathan Bard, Michael
(CL) to mammals bin/detail.cgi?cell Ashburner, Oliver Hofman
Chemical Entities of Bio- Paula Dematos,
molecular entities ebi.ac.uk/chebi
logical Interest (ChEBI) Rafael Alcantara
Melissa Haendel, Terry
Common Anatomy Refer- anatomical structures in
(under development) Hayamizu, Cornelius Rosse,
ence Ontology (CARO) human and model organisms
David Sutherland,
Foundational Model of fma.biostr.washington. JLV Mejino Jr.,
structure of the human body
Anatomy (FMA) edu Cornelius Rosse
Functional Genomics
design, protocol, data
Investigation Ontology fugo.sf.net FuGO Working Group
instrumentation, and analysis
(FuGO)
cellular components,
Gene Ontology
molecular functions, www.geneontology.org Gene Ontology Consortium
(GO) biological processes
Phenotypic Quality obo.sourceforge.net/cgi
qualities of anatomical Michael Ashburner, Suzanna
Ontology -bin/ detail.cgi?
structures Lewis, Georgios Gkoutos
(PaTO) attribute_and_value
Protein Ontology protein types and
(under development) Protein Ontology Consortium
(PrO) modifications
Relation Ontology (RO) relations obo.sf.net/relationship Barry Smith, Chris Mungall
RNA Ontology three-dimensional RNA
(under development) RNA Ontology Consortium
(RnaO) structures
Sequence Ontology properties and features of
song.sf.net Karen Eilbeck
(SO) nucleic sequences
39. How to recreate the success of the
GO in other areas
1. create a portal for sharing of information
about existing controlled vocabularies, needs
and institutions operating in a given area
2. create a library of ontologies in this area
3. create a consortium of developers of these
ontologies who agree to pool their efforts to
create a single set of non-overlapping
ontology modules
– one ontology for each sub-area
39
40. NextGen Ontology Portal
Portal Ontology Portal
• Two-Tiered Registry
– NextGen Ontology – consist of
Communities
vetted ontologies
Ontology Library
– Ontology Library – open to the
wider community
• Ontology Metadata
NextGen – Ontology owner, domain, and
location
Enterprise • Ontology Search*
Search
Ontology – Support ontology discovery
40
41. The OBO Foundry: a step-by-
step, principles-based approach
Developers commit in advance to
collaborating with developers of ontologies
in adjacent domains and
to working to ensure that, for each
domain, there is community convergence on
a single ontology
http://obofoundry.org
41
42. OBO Foundry Principles
Common governance
Common training
Robust versioning
Common architecture
42
43. top level Basic Formal Ontology (BFO)
Information Artifact Ontology for Biomedical Ontology of General
mid-level Ontology Investigations Medical Science
(IAO) (OBI) (OGMS)
Anatomy Ontology Infectious
(FMA*, CARO) Disease
Environment
Cellular Ontology
Cell Ontology
Component (IDO*)
Ontology (EnvO)
Ontology
(CL) Phenotypic Biological
(FMA*, GO*)
domain level Quality Process
Ontology Ontology (GO*)
Subcellular Anatomy Ontology (SAO)
(PaTO)
Sequence Ontology
(SO*) Molecular
Function
Protein Ontology (GO*)
(PRO*)
OBO Foundry Modular Organization 43
44. Extension Strategy
top level UCore 2.0 / UCore SL
mid-level
domain
level
Military domain ontologies as extensions of the
Universal Core Semantic Layer
44
45. Existing efforts to create modular
ontology suites
NASA Sweet Ontologies
Military Intelligence Ontology Foundry
Planned OMG efforts:
• OMG (CIA) Financial Event Ontology
• Semantic Layer for ISO 20022 (Financial
Industry Message Scheme)
48. Basic principles of ontology
development
– for formulating definitions
– of modularity
– of user feedback for error correction and gap
identification
– for ensuring compatibility between modules
– for using ontologies to annotate legacy data
– for using ontologies to create new data
– for developing user-specific views
49. Modularity designed to ensure
• non-redundancy
• annotations can be additive
• division of labor among SMEs
• lessons learned in one module can benefit work on
other modules
• transferrable training
• motivation of SME users
49
50. How the FIMA Reference Data
community should solve this problem?
Major financial institutions
Major software vendors
Major data management companies
EDMC and government principals
– should pool information about the controlled vocabularies
which already exist
– create a common library of these controlled vocabularies
– create a subset of thought leaders who agree to pool their
efforts in the creation of a suite of ontology modules for
common use
– create a strategy to disseminate and evolve the selected
modules
– create a governance strategy to manage the modules over time
– allow bad ontologies to die
51. Urgent need for trained ontologists
Severe shortage of persons with the needed
expertise
University at Buffalo Online Training and
Certification Program for Ontologists
for details: phismith@buffalo.edu