SlideShare une entreprise Scribd logo
1  sur  24
Semantic decomposition of ontological
resources for the creation of flexible, high-
performance biomedical concept recognisers
26 June 2012
Phil Gooch
Centre for Health Informatics
Overview
●
Why identify biomedical concepts in free text?
●
How ontologies can help
●
Problems with using ontologies for concept identification
●
Potential solutions
●
Application of method to two ontologies: Foundation Model of
Anatomy and Disease Ontology
●
Evaluation against a small corpus of 163 clinical discharge
summaries, surgical, pathology and radiology reports
Why identify biomedical concepts in free text?
●
Indexing MedLine abstracts for semantic search
– Identifying 'hypertension' as being of semantic type 'disease',
moreover being a cardiovascular disease
●
Literature based knowledge discovery
– Disease D associated with increase in physiological function F
– Substance S inhibits F
– => S might be a treatment for D
●
Decision support
– What treatment recommendations do clinical guideline
documents provide for hypertension in pregnancy?
– What were the findings of the pathology report?
– 50% of clinically important information resides in the free text of
the patient record, rather than in structured fields (Sittig 2007)
Ontologies
●
Define the concepts of a given domain, their properties and their
relationships
– Provide canonical names for terms
– Classification hierarchy, whole-part relations and synonyms
●
Can function as dictionary, a lookup list of terms for concept
identification via string matching
●
Or defined properties can be used to infer concepts
– A Company issues Shares
– 'shares in Abc fell' => 'Abc' is a Company
Problems with biomedical ontologies for concept identification
●
Often very large
– Foundational Model of Anatomy > 200MB, 150K+ terms
– Even when expressed in a compact data structure (e.g. Trie),
potentially large RAM overhead when used to match strings
●
May not be complete: how to identify potentially new terms,
classes
●
May not contain all synonyms or other ways of expressing terms,
e.g. abbreviations
– Separate lists of word variations often compiled (e.g. NLM
SPECIALIST lexical variant generation tools)
Some solutions
●
Hearst patterns (Hearst 1992)
– Identify hypernomic (class-member) relations
– 'Bruises, cuts, and other injuries'
– 'Diseases such as atherosclerosis'
– High precision, but low recall
●
Boostrapping
– 'scaphoid, lunate, triquetral and pisiform'
– If we know that the scaphoid and lunate are bones of the wrist,
we can infer that the others in this list are also
– Improves recall, but reduces precision (Maynard 2009)
Some solutions
●
Domain-specific linguistic features
– Neoclassical combining forms
– Biomedical and clinical terms often composed of or contain well-
defined Latin and Greek roots, suffixes and prefixes
– -osis, -itis, -opathy => disease
– cardi-, ileo- => anatomy
– High precision, but low recall (Gooch & Roudsari 2011)
Some solutions
●
NLM MetaMap (Aronson 2010): uses neoclassical combining
forms + lexical variant generation + ontologies
– Comprehensive, but heavyweight (4GB+ RAM, 10GB+ install)
●
mGrep (Meng 2009) radix trie-based lookup over ontologies
– Fast, higher precision but lower recall than MetaMap (Shah
2009)
– Still requires the complete source ontologies
– Requires substantial preprocessing of input text via the NCBO
web service (NCBO Support 2011)
Semantic decomposition of ontologies
●
Provide a systematic method of reducing the size of large
ontologies to make their use for concept identification feasible
●
Reproducible method so that concept recognisers for new
ontologies can be quickly developed
●
Has spin-off benefits for ontology quality assurance
– E.g. identification of spelling errors and lexical inconsistencies in
biomedical ontologies (Gooch 2011)
Semantic decomposition of ontologies
●
Little published work in this area
●
Tong et al (2008) decomposed the Gene Ontology into individual
tokens (words) and calculated the positional entropy of each token
via the probability of token t appearing at position p in a given
ontology term
●
Could be applied to identifying potential ontology terms in free text,
but wasn't evaluated
Semantic decomposition of ontologies
●
Initial focus on Foundational Model of Anatomy (FMA) (Rosse
2003) as anatomical terms are central to the identification of
– location of disease, morbidity
– location of symptoms
– location of procedures – surgery, pathology and radiology
reports
– administration route of medication
●
Apply the method to the Disease Ontology (Osborne et al 2009) to
see how well it generalises
Semantic decomposition of ontologies
●
Extend Tong et al's idea but classify each token according to its part of
speech (noun, adjective etc) and its semantic type
●
Reduce the set of tokens further by identifying words (free
morphemes) sharing common roots and suffixes (bound morphemes)
●
Morpheme – smallest linguistic unit that has meaning (cephalon,
-derm, -ium, -rrhea)
Regular expressions
●
Used to match sequences of characters against some input
●
Written in a formal language that describes the patterns in the input
that we wish to match
●
For this task, we precompile sets of regular expressions (regex)
generated from the set of morphemes extracted from the ontology
●
We write recombination rules over the regexes which include stop-
words (determiners, prepositions) to identify candidate noun phrases
and prepositional phrases that look like ontology terms
Regular expression and pattern generation
●
Create regexes from the union of entries (with morphological variants)
in each set
– nounPattern = … macula | malleus | mandible |
manubri(um|a) | manus ...
●
Top and tail with word boundaries, with optional plurality
– noun = b( + nounPattern + )?sb
– adjective = b( + adjPattern + )b
●
Combine regex output with patterns
– NP = adjective{0,5} (noun | properNoun){1,5}
– PP = NP “of|on” NP
– Term = NP | PP
●
Test by running the patterns against the complete ontology – all terms
should be matched
Evaluation
●
Corpus of discharge summaries, progress notes, and surgical,
radiology and pathology reports (Savova et al 2011)
●
Manually annotated for mentions of anatomical and disease
concepts
●
Compare manually identified terms against system-generated
terms via semantic decomposition/recombination pattern approach
vs direct ontology lookup vs MetaMap
●
Calculate precision (tp/tp + fp), recall (tp/tp + fn), and F-measure (2
* P * R / P + R), and Mann-Whitney U between approaches
Results – Anatomical terms
Method P R F Time
Semantic 0.36 (0.89) 0.91 0.51 (0.90) 19s
Direct lookup 0.22 (0.54) 0.73 0.34 (0.62) 10s
MetaMap 0.30 (0.75) 0.86 0.44 (0.80) 2239s
Figures in parentheses denote results after corpus correction
Semantic vs direct lookup: significant increase in P and R (p < 0.01)
Semantic vs MetaMap: increase in P and R, but not significant (p > 0.05)
Error analysis – Anatomical terms
●
Many false positives (87.9%) were in fact correct terms – missing
from the manually annotated corpus
●
Adding these missing annotations increased precision from 0.36 to
0.89
●
Remaining FPs were partial matches, e.g. 'nonspecific bowel', 'a
haploidentical bone marrow', 'normal sinus', and non-specific
anatomical areas, e.g. 'multifocal areas', 'particular organ site',
'pruritic areas'.
●
Phrases not in the ontology as discrete terms picked up by
semantic method, e.g. 'angiolymphatic space', 'dentate line'
Results – Disease terms
Method P R F Time
Semantic 0.58 0.68 0.62 12s
Direct lookup 0.69 0.27 0.37 9s
MetaMap 0.46 0.83 0.59 1748s
Semantic vs direct lookup: significant increase in R (p << 0.01), significant
decrease in P (p < 0.01), overal significant increase in F (p < 0.01)
Semantic vs MetaMap: significant increase in P (p << 0.01), but significant
decrease in R (p < 0.01), overall increase in F but not significant (p > 0.05)
Error analysis – Disease terms
●
Factors affecting recall:
– Abbreviations (e.g. COPD)
– Definite descriptors ('the disease', 'her infirmity')
– Symptoms annotated as disease ('mood changes', 'double
vision')
●
Factors affecting precision
– Terms manually annotated as Symptoms being marked as
Disease e.g. 'difficulty walking'
– Some inconsistent manual annotation of negated terms, family
history etc
Conclusion
●
Semantic decomposition and regex/pattern-based recombination
of ontology terms is slightly slower than directly looking up terms
and synonyms extracted from the ontology, but leads to
significantly increased accuracy that balances precision and recall
●
Against MetaMap, the improvements are measurable but not
statistically significant for anatomical terms, but precision is
significantly improved for disease terms. However, the processing
time is several orders of magnitude faster.
●
Our findings are comparable to Shah et al (2009) for mGrep vs
MetaMap, but we now have a systematic method for creating new
concept recognisers from scratch
Further work
●
Calculate positional entropy of each morpheme and use these to
help generate patterns (e.g. some morphemes are more likely to
occur at the start or end of a pattern)
●
Improve lookup performance by using a radix trie (better for
morpheme sets that share long prefixes and suffixes) rather than
standard Java.util.regex
●
Apply method to other biomedical ontologies
●
Evaluate against other corpora, e.g. annotated MedLine abstracts

Contenu connexe

Similaire à Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers

Heart Rate Variability and Report Assignment.pdf
Heart Rate Variability and Report Assignment.pdfHeart Rate Variability and Report Assignment.pdf
Heart Rate Variability and Report Assignment.pdfsdfghj21
 
Knowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text FinalKnowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text Finalkdjamies
 
The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...
The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...
The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...Erin Lyons
 
'It could be lupus’ Identifying narrative event chains in clinical notes
'It could be lupus’ Identifying narrative event chains in clinical notes'It could be lupus’ Identifying narrative event chains in clinical notes
'It could be lupus’ Identifying narrative event chains in clinical notesPhil Gooch
 
HRUG - Text Mining to Construct Causal Models
HRUG - Text Mining to Construct Causal ModelsHRUG - Text Mining to Construct Causal Models
HRUG - Text Mining to Construct Causal Modelsegoodwintx
 
This assignment addresses  the following objectives 1. .docx
This assignment addresses  the following objectives 1. .docxThis assignment addresses  the following objectives 1. .docx
This assignment addresses  the following objectives 1. .docxhowardh5
 
A Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
A Flexible Mapping Scheme For Discrete And Dimensional Emotion RepresentationsA Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
A Flexible Mapping Scheme For Discrete And Dimensional Emotion RepresentationsGina Rizzo
 
Understanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methodsUnderstanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methodsAshis Chanda
 
2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation
2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation
2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulationOCNZ
 
Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14mhaendel
 
Medical Terminology lecture in details..
Medical Terminology lecture in details..Medical Terminology lecture in details..
Medical Terminology lecture in details..mariumta2012
 
Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...Jinho Choi
 
Brain & Language 165 (2017) 1–9Contents lists available at S.docx
Brain & Language 165 (2017) 1–9Contents lists available at S.docxBrain & Language 165 (2017) 1–9Contents lists available at S.docx
Brain & Language 165 (2017) 1–9Contents lists available at S.docxAASTHA76
 

Similaire à Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers (20)

Heart Rate Variability and Report Assignment.pdf
Heart Rate Variability and Report Assignment.pdfHeart Rate Variability and Report Assignment.pdf
Heart Rate Variability and Report Assignment.pdf
 
Knowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text FinalKnowledge Discovery And Data Mining Of Free Text Final
Knowledge Discovery And Data Mining Of Free Text Final
 
Systematic Reviews: Context & Methodology for Librarians
Systematic Reviews: Context & Methodology for LibrariansSystematic Reviews: Context & Methodology for Librarians
Systematic Reviews: Context & Methodology for Librarians
 
The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...
The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...
The ABCs of Medical Translation: Strategies to Identify, Translate, and Manag...
 
'It could be lupus’ Identifying narrative event chains in clinical notes
'It could be lupus’ Identifying narrative event chains in clinical notes'It could be lupus’ Identifying narrative event chains in clinical notes
'It could be lupus’ Identifying narrative event chains in clinical notes
 
Talk at UAB, April 12, 2013
Talk at UAB, April 12, 2013Talk at UAB, April 12, 2013
Talk at UAB, April 12, 2013
 
HRUG - Text Mining to Construct Causal Models
HRUG - Text Mining to Construct Causal ModelsHRUG - Text Mining to Construct Causal Models
HRUG - Text Mining to Construct Causal Models
 
POPSI
POPSIPOPSI
POPSI
 
This assignment addresses  the following objectives 1. .docx
This assignment addresses  the following objectives 1. .docxThis assignment addresses  the following objectives 1. .docx
This assignment addresses  the following objectives 1. .docx
 
Syntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicineSyntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicine
 
A Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
A Flexible Mapping Scheme For Discrete And Dimensional Emotion RepresentationsA Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
A Flexible Mapping Scheme For Discrete And Dimensional Emotion Representations
 
Understanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methodsUnderstanding medical concepts and codes through NLP methods
Understanding medical concepts and codes through NLP methods
 
Communicating Science
Communicating ScienceCommunicating Science
Communicating Science
 
2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation
2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation
2013 Abbreviations in Contemporaneous Notes OCNZ @OsteoRegulation
 
Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14Haendel clingenetics.3.14.14
Haendel clingenetics.3.14.14
 
(ARCHANA) Vocabulary-.ppt
(ARCHANA) Vocabulary-.ppt(ARCHANA) Vocabulary-.ppt
(ARCHANA) Vocabulary-.ppt
 
Babylon in der pflege
Babylon in der pflegeBabylon in der pflege
Babylon in der pflege
 
Medical Terminology lecture in details..
Medical Terminology lecture in details..Medical Terminology lecture in details..
Medical Terminology lecture in details..
 
Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...Towards comprehensive syntactic and semantic annotations of the clinical narr...
Towards comprehensive syntactic and semantic annotations of the clinical narr...
 
Brain & Language 165 (2017) 1–9Contents lists available at S.docx
Brain & Language 165 (2017) 1–9Contents lists available at S.docxBrain & Language 165 (2017) 1–9Contents lists available at S.docx
Brain & Language 165 (2017) 1–9Contents lists available at S.docx
 

Dernier

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Dernier (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Semantic decomposition of ontologies for creation of flexible biomedical concept recognisers

  • 1. Semantic decomposition of ontological resources for the creation of flexible, high- performance biomedical concept recognisers 26 June 2012 Phil Gooch Centre for Health Informatics
  • 2. Overview ● Why identify biomedical concepts in free text? ● How ontologies can help ● Problems with using ontologies for concept identification ● Potential solutions ● Application of method to two ontologies: Foundation Model of Anatomy and Disease Ontology ● Evaluation against a small corpus of 163 clinical discharge summaries, surgical, pathology and radiology reports
  • 3. Why identify biomedical concepts in free text? ● Indexing MedLine abstracts for semantic search – Identifying 'hypertension' as being of semantic type 'disease', moreover being a cardiovascular disease ● Literature based knowledge discovery – Disease D associated with increase in physiological function F – Substance S inhibits F – => S might be a treatment for D ● Decision support – What treatment recommendations do clinical guideline documents provide for hypertension in pregnancy? – What were the findings of the pathology report? – 50% of clinically important information resides in the free text of the patient record, rather than in structured fields (Sittig 2007)
  • 4. Ontologies ● Define the concepts of a given domain, their properties and their relationships – Provide canonical names for terms – Classification hierarchy, whole-part relations and synonyms ● Can function as dictionary, a lookup list of terms for concept identification via string matching ● Or defined properties can be used to infer concepts – A Company issues Shares – 'shares in Abc fell' => 'Abc' is a Company
  • 5. Problems with biomedical ontologies for concept identification ● Often very large – Foundational Model of Anatomy > 200MB, 150K+ terms – Even when expressed in a compact data structure (e.g. Trie), potentially large RAM overhead when used to match strings ● May not be complete: how to identify potentially new terms, classes ● May not contain all synonyms or other ways of expressing terms, e.g. abbreviations – Separate lists of word variations often compiled (e.g. NLM SPECIALIST lexical variant generation tools)
  • 6. Some solutions ● Hearst patterns (Hearst 1992) – Identify hypernomic (class-member) relations – 'Bruises, cuts, and other injuries' – 'Diseases such as atherosclerosis' – High precision, but low recall ● Boostrapping – 'scaphoid, lunate, triquetral and pisiform' – If we know that the scaphoid and lunate are bones of the wrist, we can infer that the others in this list are also – Improves recall, but reduces precision (Maynard 2009)
  • 7. Some solutions ● Domain-specific linguistic features – Neoclassical combining forms – Biomedical and clinical terms often composed of or contain well- defined Latin and Greek roots, suffixes and prefixes – -osis, -itis, -opathy => disease – cardi-, ileo- => anatomy – High precision, but low recall (Gooch & Roudsari 2011)
  • 8. Some solutions ● NLM MetaMap (Aronson 2010): uses neoclassical combining forms + lexical variant generation + ontologies – Comprehensive, but heavyweight (4GB+ RAM, 10GB+ install) ● mGrep (Meng 2009) radix trie-based lookup over ontologies – Fast, higher precision but lower recall than MetaMap (Shah 2009) – Still requires the complete source ontologies – Requires substantial preprocessing of input text via the NCBO web service (NCBO Support 2011)
  • 9.
  • 10. Semantic decomposition of ontologies ● Provide a systematic method of reducing the size of large ontologies to make their use for concept identification feasible ● Reproducible method so that concept recognisers for new ontologies can be quickly developed ● Has spin-off benefits for ontology quality assurance – E.g. identification of spelling errors and lexical inconsistencies in biomedical ontologies (Gooch 2011)
  • 11. Semantic decomposition of ontologies ● Little published work in this area ● Tong et al (2008) decomposed the Gene Ontology into individual tokens (words) and calculated the positional entropy of each token via the probability of token t appearing at position p in a given ontology term ● Could be applied to identifying potential ontology terms in free text, but wasn't evaluated
  • 12. Semantic decomposition of ontologies ● Initial focus on Foundational Model of Anatomy (FMA) (Rosse 2003) as anatomical terms are central to the identification of – location of disease, morbidity – location of symptoms – location of procedures – surgery, pathology and radiology reports – administration route of medication ● Apply the method to the Disease Ontology (Osborne et al 2009) to see how well it generalises
  • 13. Semantic decomposition of ontologies ● Extend Tong et al's idea but classify each token according to its part of speech (noun, adjective etc) and its semantic type ● Reduce the set of tokens further by identifying words (free morphemes) sharing common roots and suffixes (bound morphemes) ● Morpheme – smallest linguistic unit that has meaning (cephalon, -derm, -ium, -rrhea)
  • 14. Regular expressions ● Used to match sequences of characters against some input ● Written in a formal language that describes the patterns in the input that we wish to match ● For this task, we precompile sets of regular expressions (regex) generated from the set of morphemes extracted from the ontology ● We write recombination rules over the regexes which include stop- words (determiners, prepositions) to identify candidate noun phrases and prepositional phrases that look like ontology terms
  • 15.
  • 16. Regular expression and pattern generation ● Create regexes from the union of entries (with morphological variants) in each set – nounPattern = … macula | malleus | mandible | manubri(um|a) | manus ... ● Top and tail with word boundaries, with optional plurality – noun = b( + nounPattern + )?sb – adjective = b( + adjPattern + )b ● Combine regex output with patterns – NP = adjective{0,5} (noun | properNoun){1,5} – PP = NP “of|on” NP – Term = NP | PP ● Test by running the patterns against the complete ontology – all terms should be matched
  • 17.
  • 18. Evaluation ● Corpus of discharge summaries, progress notes, and surgical, radiology and pathology reports (Savova et al 2011) ● Manually annotated for mentions of anatomical and disease concepts ● Compare manually identified terms against system-generated terms via semantic decomposition/recombination pattern approach vs direct ontology lookup vs MetaMap ● Calculate precision (tp/tp + fp), recall (tp/tp + fn), and F-measure (2 * P * R / P + R), and Mann-Whitney U between approaches
  • 19. Results – Anatomical terms Method P R F Time Semantic 0.36 (0.89) 0.91 0.51 (0.90) 19s Direct lookup 0.22 (0.54) 0.73 0.34 (0.62) 10s MetaMap 0.30 (0.75) 0.86 0.44 (0.80) 2239s Figures in parentheses denote results after corpus correction Semantic vs direct lookup: significant increase in P and R (p < 0.01) Semantic vs MetaMap: increase in P and R, but not significant (p > 0.05)
  • 20. Error analysis – Anatomical terms ● Many false positives (87.9%) were in fact correct terms – missing from the manually annotated corpus ● Adding these missing annotations increased precision from 0.36 to 0.89 ● Remaining FPs were partial matches, e.g. 'nonspecific bowel', 'a haploidentical bone marrow', 'normal sinus', and non-specific anatomical areas, e.g. 'multifocal areas', 'particular organ site', 'pruritic areas'. ● Phrases not in the ontology as discrete terms picked up by semantic method, e.g. 'angiolymphatic space', 'dentate line'
  • 21. Results – Disease terms Method P R F Time Semantic 0.58 0.68 0.62 12s Direct lookup 0.69 0.27 0.37 9s MetaMap 0.46 0.83 0.59 1748s Semantic vs direct lookup: significant increase in R (p << 0.01), significant decrease in P (p < 0.01), overal significant increase in F (p < 0.01) Semantic vs MetaMap: significant increase in P (p << 0.01), but significant decrease in R (p < 0.01), overall increase in F but not significant (p > 0.05)
  • 22. Error analysis – Disease terms ● Factors affecting recall: – Abbreviations (e.g. COPD) – Definite descriptors ('the disease', 'her infirmity') – Symptoms annotated as disease ('mood changes', 'double vision') ● Factors affecting precision – Terms manually annotated as Symptoms being marked as Disease e.g. 'difficulty walking' – Some inconsistent manual annotation of negated terms, family history etc
  • 23. Conclusion ● Semantic decomposition and regex/pattern-based recombination of ontology terms is slightly slower than directly looking up terms and synonyms extracted from the ontology, but leads to significantly increased accuracy that balances precision and recall ● Against MetaMap, the improvements are measurable but not statistically significant for anatomical terms, but precision is significantly improved for disease terms. However, the processing time is several orders of magnitude faster. ● Our findings are comparable to Shah et al (2009) for mGrep vs MetaMap, but we now have a systematic method for creating new concept recognisers from scratch
  • 24. Further work ● Calculate positional entropy of each morpheme and use these to help generate patterns (e.g. some morphemes are more likely to occur at the start or end of a pattern) ● Improve lookup performance by using a radix trie (better for morpheme sets that share long prefixes and suffixes) rather than standard Java.util.regex ● Apply method to other biomedical ontologies ● Evaluate against other corpora, e.g. annotated MedLine abstracts