Material presented at the TKE (Terminology and Knowledge Engineering) Conference 2010, Dublin, Ireland.
Download paper at http://hal.archives-ouvertes.fr/hal-00544403
Insitutions: Laboratoire d'Informatique de Nantes Atlantique (LINA), Lingua et Machina.
Dealing with Lexicon Acquired from Comparable Corpora: post-edition and exchange
1. Dealing with Lexicon Acquired from
Comparable Corpora
Post-edition and Exchange
Estelle Delpech, Lingua et Machina
Béatrice Daille, U. de Nantes - LINA
1/23
2. Working w/ lexicon acquired from
comparable corpora
I. Terminology acquisition from
comparable corpora : quick overview
II. A tool for terminology post-edition
III. Data exchange : a TBX variant for
automatically acquired lexicons
IV. Future work
2/23
4. Terminology acquisition from
comparable corpora
Comparable corpora:
“Two corpora, respectively in two languages l1 and l2 are said
”comparable” if there exists a substantial part of the
vocabulary of the corpus in language l1 whose translation can
be found in the corpus in language l2.”
(my translation of [Déjan and Gaussier, 2002] )
Advantages :
Availabily
Real usages
4/23
5. Terminology acquisition from
comparable corpora
Terminology extraction : a contextual analysis
Compare contexts of source and target terms
If contexts are similar, there's a good chance
source and target terms are translations of each
other, ex :
mastectomy : reconstruction, prophylactic, treat,
undergo, removal
mastectomie : reconstruction, prophylactique,
traiter, subir, ablation
5/23
6. Terminology acquisition from
comparable corpora
Outputs one-to-many alignments
– Evaluation : precision on the TopNBest alignments
mastectomy
Results
0,92 ablation
0,89 mastectomie
0,48 opération
Not as good as acquisition from parallel corpora !
Fung (1997) : 30 % accuracy on the Top20
candidates
Morin et al. (2004) : translation is usually the 34th for
6/23
complex terms
8. A tool for post-edition
Existing Tools :
ArayaTermExtractor (Waldhör 2006)
iView (Merkel and Foo, 2007)
Xerox Terminology Suite ®
Our needs :
Deal with one-to-many alignments
Non-aligned contexts
Allow non binary annotation
Display useful information to help finding the right
candidate in the corpus
8/23
9. “Useful” information
→ Knownledge that helps catching the in vivo
behavior terms
→Text-driven, term-oriented approach
Useful information :
Variants
Collocations
Distributional neighbors
Contexts
→ To be harvested during the term extraction /
alignment process
9/23
10. Useful information : example
Mastectomy
Mastectomie
risk reducting ~
simple ~
~ préventive
~ simple
Tumorectomy
Lumpectomy
Oophorectomy
Tumorectomie
Ablation
Opération
...patient may choose to have
risk-reducing bilateral
mastectomy if they have a
strong family history of breast
cancer...
...la mastectomie préventive
pourrait supprimer la grande
majorité du risque de
développer un cancer...
10/23
13. Quick introduction to TBX (1)
TBX : Term Base eXchange
Open, XML-based standard for exchanging
structured terminological data
approved as an international standard by LISA
and ISO (norm 30042)
Maps to TMF data model
Subset of MARTIF
Designed for various use cases
Customizable
13/23
14. Quick introduction to TBX (2)
2 components :
Structure : core structure based on TMF
metamodel
Content : formalism to express data-categories
and their constraints
Content
Form
Core DTD/Schema
Default TBX
Default XCS
XCS1
TBX variant 1
Adapted from ISO norm 30042:2008, Fig. 4, p.30
XCSn
TBX variant n
14/23
15. Quick introduction to TBX (3)
Form defined in DTD
Content
defined in XCS
respPerson
responsability
reliabilityCode
partOfSpeech
corpusTrace
termType
usageNote
Taken from ISO norm 30042:2008, Fig. 1, p.9
15/23
20. Feed-back on TBX
TBX is made for stable terminologies with little
uncertainy on the status of translations not
machine-generated lexicons of “candidate
translations” :
difficult to separate of term + properties from its
alignments
no data category specific to automatically estimated
reliability
Difficult to make text-driven, term-oriented
knowledge fit in a concept oriented format
no definition category that would apply to a single term
and not the whole concept
22. Future work
Integration of prototype in Libellex
TBX import / export
edition of linguistic properties
User testing (ergonomics)
Evaluation of added-value for translation
Explore new ways of :
aligning terms
selecting contexts
22/23
23. References
Post-edition prototype on line : http://80.82.238.151/Metricc/InterfaceValidation/ user “test”,
no password
Metricc project : http://www.metricc.com/
Lingua et Machina : http://www.lingua-et-machina.com/
Comparable corpora : Déjean, H., Gaussier, É. (2002) : “Une nouvelle approche à
l'extraction de lexiques bilingues à partir de corpus comparables”, In Lexicometrica,
Alignement Lexical dans les corpus multilingues, pp.1-22.
ArayaTermExtractor : http://www.heartsome.de
Xerox Terminology Suite : http://www.temis.com/
Iview : Nyström, M., Merkel, M., Ahrenberg, L., Zweignebaum, P., Petersson, H. and
Åhlfeldt H. (2006) : “Creating a medical English-Swedish dictionary using interactive word
alignment”', In BMC Medical Informatics and Decision Making, 2006, pp. 6-35
TMF : ISO 16642 - Terminological markup framework
TBX : ISO 30042 - Systems to manage terminology, knowledge and content -- TermBase
eXchange (TBX)
Data categories : ISO 12620 - Terminology and other language and content resources -Specification of data categories and management of a Data Category Registry for language
resources