The need to recognise biomedical and clinical concepts in free text has been driven by demand for semantic information retrieval and decision support. Comprehensive, large-scale ontologies, such as the Foundational Model of Anatomy (FMA) and the Disease Ontology (DO), form the building blocks of the Unified Medical Language System (UMLS) and are the basis of dictionary-based biomedical concept recognisers such as MetaMap. However, these tools typically require substantial computing resources in terms of disk space, memory and processing time to execute. Recently, regular-expression (regex) based concept recognisers such as mGrep have begun to address this shortcoming, but a method that allows researchers to create their own concept recogniser from a given ontology remains unexplained.
In this presentation, I present a method for semantic decomposition of biomedical ontologies as applied to the FMA and DO in the creation of a high-performance tool for identifying anatomical and disease concepts in free text. The method involves 1) tokenizing each ontology into distinct words, 2) extracting free and bound morphemes from the word list, 3) classifying each morpheme according to semantic type or grammatical role, 4) generating regexes over each morpheme set, 5) applying simple grammatical rules over the regexes to identify potential concepts. We evaluate its precision and recall performance against manually annotated clinical and biomedical corpora, and compare the results with the performance of 1) direct ontology lookup and 2) MetaMap against the same corpora.
As measured by the Mann-Whitney rank sum test, the method demonstrates significant (p < 0.01) improvement in accuracy over direct ontology lookup. Against MetaMap, it also demonstrates a measurable improvement in accuracy, although this is not statistically significant (p > 0.05), but has the benefit of reducing processing time by by several orders of magnitude.
Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers
1. Semantic decomposition of ontological
resources for the creation of flexible, high-
performance biomedical concept recognisers
26 June 2012
Phil Gooch
Centre for Health Informatics
2. Overview
●
Why identify biomedical concepts in free text?
●
How ontologies can help
●
Problems with using ontologies for concept identification
●
Potential solutions
●
Application of method to two ontologies: Foundation Model of
Anatomy and Disease Ontology
●
Evaluation against a small corpus of 163 clinical discharge
summaries, surgical, pathology and radiology reports
3. Why identify biomedical concepts in free text?
●
Indexing MedLine abstracts for semantic search
– Identifying 'hypertension' as being of semantic type 'disease',
moreover being a cardiovascular disease
●
Literature based knowledge discovery
– Disease D associated with increase in physiological function F
– Substance S inhibits F
– => S might be a treatment for D
●
Decision support
– What treatment recommendations do clinical guideline
documents provide for hypertension in pregnancy?
– What were the findings of the pathology report?
– 50% of clinically important information resides in the free text of
the patient record, rather than in structured fields (Sittig 2007)
4. Ontologies
●
Define the concepts of a given domain, their properties and their
relationships
– Provide canonical names for terms
– Classification hierarchy, whole-part relations and synonyms
●
Can function as dictionary, a lookup list of terms for concept
identification via string matching
●
Or defined properties can be used to infer concepts
– A Company issues Shares
– 'shares in Abc fell' => 'Abc' is a Company
5. Problems with biomedical ontologies for concept identification
●
Often very large
– Foundational Model of Anatomy > 200MB, 150K+ terms
– Even when expressed in a compact data structure (e.g. Trie),
potentially large RAM overhead when used to match strings
●
May not be complete: how to identify potentially new terms,
classes
●
May not contain all synonyms or other ways of expressing terms,
e.g. abbreviations
– Separate lists of word variations often compiled (e.g. NLM
SPECIALIST lexical variant generation tools)
6. Some solutions
●
Hearst patterns (Hearst 1992)
– Identify hypernomic (class-member) relations
– 'Bruises, cuts, and other injuries'
– 'Diseases such as atherosclerosis'
– High precision, but low recall
●
Boostrapping
– 'scaphoid, lunate, triquetral and pisiform'
– If we know that the scaphoid and lunate are bones of the wrist,
we can infer that the others in this list are also
– Improves recall, but reduces precision (Maynard 2009)
7. Some solutions
●
Domain-specific linguistic features
– Neoclassical combining forms
– Biomedical and clinical terms often composed of or contain well-
defined Latin and Greek roots, suffixes and prefixes
– -osis, -itis, -opathy => disease
– cardi-, ileo- => anatomy
– High precision, but low recall (Gooch & Roudsari 2011)
8. Some solutions
●
NLM MetaMap (Aronson 2010): uses neoclassical combining
forms + lexical variant generation + ontologies
– Comprehensive, but heavyweight (4GB+ RAM, 10GB+ install)
●
mGrep (Meng 2009) radix trie-based lookup over ontologies
– Fast, higher precision but lower recall than MetaMap (Shah
2009)
– Still requires the complete source ontologies
– Requires substantial preprocessing of input text via the NCBO
web service (NCBO Support 2011)
9.
10. Semantic decomposition of ontologies
●
Provide a systematic method of reducing the size of large
ontologies to make their use for concept identification feasible
●
Reproducible method so that concept recognisers for new
ontologies can be quickly developed
●
Has spin-off benefits for ontology quality assurance
– E.g. identification of spelling errors and lexical inconsistencies in
biomedical ontologies (Gooch 2011)
11. Semantic decomposition of ontologies
●
Little published work in this area
●
Tong et al (2008) decomposed the Gene Ontology into individual
tokens (words) and calculated the positional entropy of each token
via the probability of token t appearing at position p in a given
ontology term
●
Could be applied to identifying potential ontology terms in free text,
but wasn't evaluated
12. Semantic decomposition of ontologies
●
Initial focus on Foundational Model of Anatomy (FMA) (Rosse
2003) as anatomical terms are central to the identification of
– location of disease, morbidity
– location of symptoms
– location of procedures – surgery, pathology and radiology
reports
– administration route of medication
●
Apply the method to the Disease Ontology (Osborne et al 2009) to
see how well it generalises
13. Semantic decomposition of ontologies
●
Extend Tong et al's idea but classify each token according to its part of
speech (noun, adjective etc) and its semantic type
●
Reduce the set of tokens further by identifying words (free
morphemes) sharing common roots and suffixes (bound morphemes)
●
Morpheme – smallest linguistic unit that has meaning (cephalon,
-derm, -ium, -rrhea)
14. Regular expressions
●
Used to match sequences of characters against some input
●
Written in a formal language that describes the patterns in the input
that we wish to match
●
For this task, we precompile sets of regular expressions (regex)
generated from the set of morphemes extracted from the ontology
●
We write recombination rules over the regexes which include stop-
words (determiners, prepositions) to identify candidate noun phrases
and prepositional phrases that look like ontology terms
15.
16. Regular expression and pattern generation
●
Create regexes from the union of entries (with morphological variants)
in each set
– nounPattern = … macula | malleus | mandible |
manubri(um|a) | manus ...
●
Top and tail with word boundaries, with optional plurality
– noun = b( + nounPattern + )?sb
– adjective = b( + adjPattern + )b
●
Combine regex output with patterns
– NP = adjective{0,5} (noun | properNoun){1,5}
– PP = NP “of|on” NP
– Term = NP | PP
●
Test by running the patterns against the complete ontology – all terms
should be matched
17.
18. Evaluation
●
Corpus of discharge summaries, progress notes, and surgical,
radiology and pathology reports (Savova et al 2011)
●
Manually annotated for mentions of anatomical and disease
concepts
●
Compare manually identified terms against system-generated
terms via semantic decomposition/recombination pattern approach
vs direct ontology lookup vs MetaMap
●
Calculate precision (tp/tp + fp), recall (tp/tp + fn), and F-measure (2
* P * R / P + R), and Mann-Whitney U between approaches
19. Results – Anatomical terms
Method P R F Time
Semantic 0.36 (0.89) 0.91 0.51 (0.90) 19s
Direct lookup 0.22 (0.54) 0.73 0.34 (0.62) 10s
MetaMap 0.30 (0.75) 0.86 0.44 (0.80) 2239s
Figures in parentheses denote results after corpus correction
Semantic vs direct lookup: significant increase in P and R (p < 0.01)
Semantic vs MetaMap: increase in P and R, but not significant (p > 0.05)
20. Error analysis – Anatomical terms
●
Many false positives (87.9%) were in fact correct terms – missing
from the manually annotated corpus
●
Adding these missing annotations increased precision from 0.36 to
0.89
●
Remaining FPs were partial matches, e.g. 'nonspecific bowel', 'a
haploidentical bone marrow', 'normal sinus', and non-specific
anatomical areas, e.g. 'multifocal areas', 'particular organ site',
'pruritic areas'.
●
Phrases not in the ontology as discrete terms picked up by
semantic method, e.g. 'angiolymphatic space', 'dentate line'
21. Results – Disease terms
Method P R F Time
Semantic 0.58 0.68 0.62 12s
Direct lookup 0.69 0.27 0.37 9s
MetaMap 0.46 0.83 0.59 1748s
Semantic vs direct lookup: significant increase in R (p << 0.01), significant
decrease in P (p < 0.01), overal significant increase in F (p < 0.01)
Semantic vs MetaMap: significant increase in P (p << 0.01), but significant
decrease in R (p < 0.01), overall increase in F but not significant (p > 0.05)
22. Error analysis – Disease terms
●
Factors affecting recall:
– Abbreviations (e.g. COPD)
– Definite descriptors ('the disease', 'her infirmity')
– Symptoms annotated as disease ('mood changes', 'double
vision')
●
Factors affecting precision
– Terms manually annotated as Symptoms being marked as
Disease e.g. 'difficulty walking'
– Some inconsistent manual annotation of negated terms, family
history etc
23. Conclusion
●
Semantic decomposition and regex/pattern-based recombination
of ontology terms is slightly slower than directly looking up terms
and synonyms extracted from the ontology, but leads to
significantly increased accuracy that balances precision and recall
●
Against MetaMap, the improvements are measurable but not
statistically significant for anatomical terms, but precision is
significantly improved for disease terms. However, the processing
time is several orders of magnitude faster.
●
Our findings are comparable to Shah et al (2009) for mGrep vs
MetaMap, but we now have a systematic method for creating new
concept recognisers from scratch
24. Further work
●
Calculate positional entropy of each morpheme and use these to
help generate patterns (e.g. some morphemes are more likely to
occur at the start or end of a pattern)
●
Improve lookup performance by using a radix trie (better for
morpheme sets that share long prefixes and suffixes) rather than
standard Java.util.regex
●
Apply method to other biomedical ontologies
●
Evaluate against other corpora, e.g. annotated MedLine abstracts