Presentación en IDEAL 2008

Building a Spanish MMTx by
using Automatic Translation and
Biomedical Ontologies
Francisco Carrero 1,2 ; José Carlos Cortizo 1,2 ; José Mª Gómez 3
1 Wipley, Social Gaming Platform
http://www.wipley.com
2 Universidad Europea de Madrid
http://www.esp.uem.es/gsi
3 Optenet
http://www.esp.uem.es/gsi

Outline

The MIRCAT project
The challenge
English MetaMap, a big effort
Approaching a Spanish MetaMap
Experiments
Discussion of the Results and Future Work
Francisco Carrero Garcia

The MIRCAT Project
The Interface


The MIRCAT Project
System’s Architecture


The Challenge
Our Goal

English docs

Medical record

Spanish docs


The Challenge
The problem

We can extract UMLS concepts from English texts using
MetaMap...
...but there is no Spanish version of MetaMap
Is it difﬁcult to construct a tool like MetaMap?


English MetaMap
A big Effort

∼3 years!!


Approaching Spanish MetaMap
Two Main Approaches Considered


Approaching Spanish MetaMap
Our Approach: Translation and Reuse

Optional


Experimental Design
Text Collections

MedLine Plus medical News
http://www.nlm.nih.gov/medlineplus/newsbydate.html
Excellent online resource
2000 news, some in English, some in Spanish
600 available in both languages


Experiments
Experimental Design

MetaMap extracts concepts, allowing multiple representations
A => Using compound concepts
B => simple concepts
1 => resolves ambiguity by adding all the concepts
2 => ignores ambiguities by choosing the ﬁrst possibility
4 representations: A1, A2, B1, B2

Experiments
Filtering

Data representations containing a lot of features do not usually
perform very well in text tasks
Many classiﬁers degrade in prediction accuracy when faced with
many irrelevant features or redundant/correlated ones (“curse
of dimensionality”)
We apply Zipf’s Law to ﬁlter the attributes


Experiments Results
Number of concepts for each representation


Experiments Results
Average Similarities


Experiments Results
Last Experiments (not in IDEAL paper)


Discussion of the Results
Translation

The worst results (similarity) are achieved with the most
complex (near to humans) representation: A1
B1 is less complex and produces the best results
=> Our model seems to be more suitable as a plain bag-of-
concepts representation
Similar to bag-of-words representation, widely used in text
processing tasks

Discussion of the Results
Classification

All results are comparable to classification on original English
texts
In some cases, are even better
Best results using A2+Zipf, +7.8% in AUC
UNMKD representations never achieves worse classifications than
English


Conclussions and Future Work

The “easy way” to construct a Spanish MetaMap is promising
Google Translation seems a good tool to adapt English resources
to any other languages (like Spanish)
We should try other translation tools
We are working on applying this approach to other text tasks
(like Information Retrieval and Filtering)


Ending...

Thank you very much for your attention


Any Question?


Presentación en IDEAL 2008

Recommended

Recommended

More Related Content

More from Jose Carlos Cortizo Perez

More from Jose Carlos Cortizo Perez (20)

Recently uploaded

Recently uploaded (20)

Presentación en IDEAL 2008