1. How to Build a Biomedical Ontology
Success Stories
The Gene Ontology (GO)
SNOMED, ICD and other controlled vocabularies
Ontology Design Principles
Ontology Applications
Barry Smith
http://ontology.buffalo.edu/smith
7. Gene Ontology
$100 mill. invested in literature and database
curation using the Gene Ontology (GO)
based on the idea of annotation
over 11 million annotations relating gene
products (proteins) described in the UniProt,
Ensembl and other databases to terms in the
GO
multiple secondary uses – because the
ontology was not built to meet one specific
set of requirements
7
8. GO provides a controlled system of terms
for use in annotating (describing, tagging)
data
• multi-species, multi-disciplinary, open
source
• contributing to the cumulativity of
scientific results obtained by distinct
research communities
• compare use of kilograms, meters,
seconds in formulating experimental
results 8
21. annotation with Gene Ontology
supports reusability of data
supports search of data by humans
supports comparison of data
supports aggregation of data
supports reasoning with data by humans
and machines
21
23. The goal: virtual science
• consistent (non-redundant) annotation
• cumulative (additive) annotation
yielding, by incremental steps, a
virtual map of the entirety of reality
that is accessible to computational
reasoning
23
24. This goal is realizable if we have a
common ontology framework
data is retrievable
data is comparable
data is integratable
only to the degree that it is annotated
using a common controlled vocabulary
– compare the role of seconds, meters,
kilograms … in unifying science
24
25. To achieve this end we have to engage
in something like philosophy (?)
is this the right way to organize the top level of this
portion of the GO?
how does the top level of this ontology relate to
the top levels of other, neighboring ontologies? 25
26. Strategy for doing this
see the world as organized via
types/universals/categories which are
hierarchically organized
and in relation to which statements
can be formulated which are
universally true of all instances:
cell membrane part_of cell 26
27. Anatomical
Anatomical Space
Structure
Organ Cavity Organ
Organ Organ Part
Subdivision Cavity
Serous Sac Serous Sac Organ Organ
Cavity Cavity
Serous Sac Component Subdivision
Tissue
Subdivision
is_a
Pleural Sac
Pleural Sac Pleura(Wall
Pleural Pleura(Wall
Pleural of Sac)
of Sac)
Cavity
of
Cavity
Parietal
Parietal
Pleura
t_
Pleura Visceral
Visceral
Interlobar Pleura
Pleura
Interlobar
r
recess
recess Mediastinal
pa
Mediastinal
Pleura
Pleura Mesothelium
Mesothelium
of Pleura
of Pleura
27
Foundational Model of Anatomy Ontology
30. the problem of continuity of care:
patients move around
with thanks to http://dbmotion.com 30
31. f
f
f
f
f
f
synchronic and diachronic problems of
semantic interoperability
(across space and across time)
31
32. f
f
f
f
EHR 1 EHR 2
f
f
how can we link EHR 1 to EHR 2 in a
reliable, trustworthy, useful way, which
both systems can understand ?
32
33. f
f
f
ICD
f
EHR 1 EHR 2
f
f
the ideal solution:
WHO International Classification of
Diseases
33
34. ICD
PRO:
De facto US billing standard
Multilanguage
CON:
De facto US billing standard (corrupts data)
No definitions of terms, and so difficult to
judge accuracy of hierarchy and of coding
Inconsistent hierarchies
Hard to reason with results
Hence few secondary uses e.g. for research
34
35. ICD 11
The (ontology-based) plan
multiple views including
◦ billing
◦ public health statistics
◦ research
◦ SNOMED compatibility
35
36. f
f
f
SNOMED-CT
f
EHR 1 EHR 2
f
f
the ideal solution:
a single universal clinical vocabulary
36
38. SNOMED CT
CON
Huge (but redundant ... and gappy)
Contains many examples of false synonymy
Still in need of work
◦ No consistent interpretation of relations
◦ Many erroneous relation assertions
◦ Many idiosyncratic relations
◦ Mixes ontology with epistemology
◦ It contains numerous compound terms (e.g., test for X)
without the constituent terms (here: X), even where the
latter are of obvious salience
38
39. SNOMED CT
Coding with SNOMED-CT is unreliable and
inconsistent
Multi-stage multi-committee process for adding
terms that follows intuitive rules and not formal
principles
Does there exist a strategy for evolutionary
improvement?
39
40. f
f
f
SNOMED-CT
f
EHR 1 EHR 2
f
fan
above all: SNOMED CT cannot solve the
problem of continuity of care because it has
too much redundancy
40
41. f
f
f
SNOMED-CT
f
EHR 1 EHR 2
f
fan
AND because it is used only in certain
countries
41
42. f
f Unified Medical
f Language System
(UMLS)
f
EHR 1 EHR 2
f
f
link EHR 1 to EHR 2 through a snapshot of
the patient’s condition which both systems
can understand
42
43. Unified Medical Language System (UMLS)
UMLS is not unified, not a language, not a
system (and not only medical); it is an
aggregation
If we use something like UMLS as reference
terminology, we will not solve the translation
problem
EN
DE
44. R T U New York State
Center of Excellence in
Bioinformatics & Life
Sciences
UMLS approach to countering silo formation
– By ‘linking between different clinical or biomedical
vocabularies’
– However: ‘… the Metathesaurus does not represent a
comprehensive NLM-authored ontology of biomedicine or a
single consistent view of the world. The Metathesaurus
preserves the many views of the world present in its source
vocabularies because these different views may be useful for
different tasks.’
http://www.nlm.nih.gov/pubs/factsheets/umlsmeta.html
45. R T U New York State
Center of Excellence in
Bioinformatics & Life
Sciences
46. Prospective standardization is a
good thing
Prospective standardization is the only thing
which will work in mission critical domains
Prospective standardization means that
certain limits to tolerance must be imposed,
Need for top-down governance to ensure
common architecture and resolution of
border disputes in areas of overlap between
domains
46
48. Problem of ensuring sensible
cooperation in a massively
interdisciplinary community
Consider multiple uses of technical terms
such as
− type
− concept
− instance
− model
− representation
− data
48
49. Three Levels
L3. Words, models (published
representations, ontologies, databases ...)
L2. Ideas (concepts, thoughts, memories, ...)
L1. Things (cells, planets, processes of cell
division ...)
49
50. Entity =def
anything which exists, including things and
processes, functions and qualities, beliefs
and actions, documents and software
(entities on levels 1, 2 and 3)
50
51. First basic distinction among entities
type vs. instance
(science text vs. diary)
(human being vs. Tom Cruise)
51
52. For ontologies
it is generalizations that are
important = types, universals,
kinds, species
52
53. Catalog vs. inventory
A 515287 DC3300 Dust Collector Fan
B 521683 Gilmer Belt
C 521682 Motor Drive Belt
53
54. An ontology is a representation
of types
We learn about types in reality from looking
at the results of scientific experiments in the
form of scientific theories
experiments relate to what is particular
science describes what is general
54
55. Ontology =def.
a representational artifact whose representational
units (which may be drawn from a natural or from
some formalized language) are intended to represent
1. types in reality
2. those relations between these types which
obtain universally (= for all instances)
lung is_a anatomical structure
lobe of lung part_of lung
in accordance with our best current established science
55
57. Domain =def
a portion of reality that forms the subject-
matter of a single science or technology or
mode of study or administrative practice:
proteomics
epidemiology
C2
M&S
57
59. Ontologies are representational
artifacts
comparable to science texts
and subject to the same sorts of
constraints (including need for
update)
59
60. Representational units =def
terms, icons, alphanumeric identifiers ...
which refer, or are intended to refer, to
entities
and which are minimal (atoms)
60
61. Composite representation =def
representation
(1) built out of representational units
which
(2) form a structure that mirrors, or is intended
to mirror, the entities in some domain
61
68. How do we know which general
terms designate types?
Types are repeatables:
cell, electron, weapon, F16 ...
Instances are one-off:
Bill Clinton, this laptop, this handwave
68
69. Problem
The same general term can be used to
refer both to types and to collections of
particulars. Consider:
HIV is an infectious retrovirus
HIV is spreading very rapidly through Asia
69
70. Class =def
a maximal collection of particulars
determined by a general term
(‘cell’, ‘electron’ but also: ‘ ‘restaurant in
Palo Alto’, ‘Italian’)
the class A
= the collection of all particulars x for
which ‘x is A’ is true
70
71. types vs. their extensions
types
..} collections of particulars
71
78. types < classes < ‘concepts’ ?
Cases of ‘concepts’ which, some people say,
do not correspond to classes:
‘Cancelled oophorectomy’
‘Absent nipple’
‘Unlocalized ligand’
A cancelled oophorectomy is not a special
kind of conceptual oophorectory
Use: Information Artifact Ontology (IAO)
78
79. Principle of Low Hanging Fruit
Include even absolutely trivial assertions
(assertions you know to be universally true)
pneumococcal virus is_a virus
Computers need to be led by the hand
79
80. Example: MeSH
MeSH Descriptors
Index Medicus Descriptor
Anthropology, Education, Sociology and
Social Phenomena (MeSH Category)
Social Sciences
Political Systems
National Socialism
National Socialism is_a Political Systems
National Socialism is_a Anthropology ...
80
81. Principle of Singular Nouns
Terms in ontologies represent types
Goal: Each term in an ontology should
represent exactly one type
Thus every term should be a singular noun
81
82. Principle: do not commit the use-
mention confusion
mouse =def. common name for the species
mus musculus
swimming is healthy and has eight letters
82
83. Principle: do not commit the use-
mention confusion
Avoid confusing between words and things
Avoid confusing between concepts in our
minds and entities in reality
Recommendation: avoid the word ‘concept’
entirely
83
85. Trialbank
‘Heparin therapy’ is an instance of ‘written or
spoken designation of a concept’
What are the problems here?
1. misuse of quotation marks
2. confusion of instances and types
3. confusion of concept and reality
85
86. Principle: beware of
terminological baggage
For the sake of interoperability with other
ontologies, do not give special meanings to
terms with established general meanings
(Don’t use ‘cell’ when you mean ‘plant cell’)
86
87. ICNP: International Classification of
Nursing Procedures (old version)
water =def. a type of Nursing Phenomenon
of Physical Environment with the specific
characteristics: clear liquid compound of
hydrogen and oxygen that is essential for
most plant and animal life influencing life
and development of human beings.
87
88. Principle of definitions
Supply definitions for every term
1.human-understandable natural language
definition
2.an equivalent formal definition
88
89. Principle: definitions must be unique
Each term should have exactly one definition
it may have both natural-language and
formal versions
(issue with ontologies which exist with
different levels of expressivity)
89
90. The Problem of Circularity
A Person =def. A person with an identity
document
Hemolysis =def. The causes of hemolysis
90
92. Example: HL7
‘stopping a medication’ = def.
change of state in the record of a
Substance Administration Act from
Active to Aborted
92
93. Principle of Increase in
Understandability
A definition should use only terms which are
easier to understand than the term defined
Definitions should not make simple things
more difficult than they are
93
94. Generalized Tarski principle
(a good, general constraint on a
theory of meaning)
For each linguistic expression ‘E’
‘E’ means E
‘snow’ means: snow
‘pneumonia’ means: pneumonia
94
95. HL7 Reference Information Model
‘medication’ does not mean: medication
rather it means:
the record of medication in an information
system
‘disease’ does not mean: disease
rather it means:
the observation of a disease
95
96. Principle of Acknowledging Primitives
In every ontology some terms and some
relations are primitive = they cannot be
defined (on pain of infinite regress)
Examples of primitive relations:
identity
instance_of
96
97. Principle of Aristotelian Definitions
Use Aristotelian definitions
An A is a B which C’s.
A human being is an animal which is rational
97
98. Rules for Formulating Terms
Avoid abbreviations even when it is clear in
context what they mean (‘breast’ for
‘breast tumor’)
Avoid acronyms
Avoid mass terms (‘tissue’, ‘brain mapping’,
‘clinical research’ ...)
Treat each term ‘A’ in an ontology is
shorthand for a term of the form ‘the type
A’
98
99. Univocity
Terms should have the same meanings on
every occasion of use.
(= They should refer to the same types)
Basic ontological relations such as is_a and
part_of should be used in the same way
by all ontologies
99
101. Universality
Often, order will matter:
We can assert
adult transformation_of child
but not
child transforms_into adult
101
102. Universality
viral pneumonia caused by virus
but not
virus causes pneumonia
pneumococcal virus causes pneumonia
102
103. Principle of Universality
results analysis later_than protocol-design
but not
protocol-design earlier_than results
analysis
103
104. Principle of Positivity
Complements of types are not themselves
types.
Terms such as
non-mammal
non-membrane
other metalworker in New Zealand
do not designate types in reality
104
105. Generalized Anti-Boolean Principle
There are no conjunctive and disjunctive
types:
anatomic structure, system, or substance
musculoskeletal and connective tissue
disorder
105
106. Objectivity
Which types exist in reality is not a function
of our knowledge.
Terms such as
unknown
unclassified
unlocalized
arthropathies not otherwise specified
do not designate types in reality.
106
107. Keep Epistemology Separate from
Ontology
If you want to say that
We do not know where A’s are located
do not invent a new class of
A’s with unknown locations
(A well-constructed ontology should grow
linearly; it should not need to delete classes
or relations because of increases in
knowledge)
107
108. Keep Sentences Separate from
Terms
If you want to say
I surmise that this is a case of pneumonia
do not invent a new class of surmised
pneumonias
Confusion of ‘findings’ in medical terminologies
108
109. Single Inheritance
No kind in a classificatory hierarchy
should be asserted to have more
than one is_a parent on the
immediate higher level
109
111. Multiple Inheritance
is a source of errors
encourages laziness
serves as obstacle to integration with
neighboring ontologies
hampers use of Aristotelian methodology for
defining terms
hampers use of statistical search tools
111
113. Principle of asserted single
inheritance
Each reference ontology module should be
built as an asserted monohierarchy (a
hierarchy in which each term has at most
one parent)
Asserted hierarchy vs. inferred hierarchy
113
115. Principle of instantiability
A term should be included in an ontology
only if there is evidence that instances to
which that term refers exist or have existed
or can exist in reality.
Fist
Crowd
115
116. Avoid mass nouns
Count nouns = an organism, a planet, a
handshake
Mass nouns = tissue, information, discourse
Mass nouns almost always go hand in hand
with ontological confusion
116
117. is_a Overloading
The success of ontology alignment
demands that ontological relations (is_a,
part_of, ...) have the same meanings in the
different ontologies to be aligned.
117
119. How to solve this problem
Create two ontologies:
of cars
of colors
Link the two together via cross-products
(= factoring, normalization, modularization)
119
120. Compositionality
The meanings of compound terms should
be determined
1. by the meanings of component terms
together with
2. the rules governing syntax
120
121. User feedback principle
An ontology should evolve on the basis of
feedback derived from those who are using
the ontology for example for purposes in
annotation.
121
Problem example: ‘chromosome’ in Sequence Ontology and in Cell Component Ontology means different things Current solution: two distinct terms involved (qualified by respective namespace)
There is no species called ‘non-rabbit’
There is no biological species: unknown rabbit. See discussion below.