We present two terminology extraction tools to compare a knowledge-poor and a knowledge-rich approach. Both tools process SWT and MWT and are designed to handle multilingualism. We run an evaluation on 6 languages and 2 different domains using crawled comparable corpora and hand-crafted reference term lists (RTL). We discuss the 3 main results achieved for terminology extraction. The first two evaluation scenarios concern the knowledge-rich framework. Scenario 1 (S1) compares performances for each of the languages depending on the ranking that is applied: specificity score vs. the number of occurrences. Scenario 2 (S2) examines the relevancy of the term variant identification to increase the precision ranking for any of the languages. Scenario 3 (S3) compares both tools and demonstrates that a probabilistic term extraction approach, developed with minimal effort, achieves satisfactory results when compared to a rule-based method.
conference: cicling 2013 - samos
Knowledge-poor and Knowledge-rich Approaches for Multilingual Terminology Extraction
1. Cicling
Knowledge-poor and Knowledge-rich Approaches 2013
for Multilingual Terminology Extraction
Béatrice Daille (1), Helena Blancafort(2)
(1)University of Nantes – Lina, (2)Syllabs
Abstract
We present two terminology extraction tools to compare a knowledge-poor and a knowledge-rich
approach. Both tools process SWT and MWT and are designed to handle multilingualism. We run an evaluation
on 6 languages and 2 different domains using crawled comparable corpora and hand-crafted reference term
lists (RTL). We discuss the 3 main results achieved for terminology extraction. The first two evaluation scenarios
concern the knowledge-rich framework. Scenario 1 (S1) compares performances for each of the languages
depending on the ranking that is applied: specificity score vs. the number of occurrences. Scenario 2 (S2)
examines the relevancy of the term variant identification to increase the precision ranking for any of the
languages. Scenario 3 (S3) compares both tools and demonstrates that a probabilistic term extraction approach,
developed with minimal effort, achieves satisfactory results when compared to a rule-based method.
Knowledge-Rich Framework (KR) Knowledge-Poor Approach (KP)
1. Linguistic processing: tokenization, POS tagging, 1. Training of a Pseudo POS Tagger (Clark, 2003) with
lemmatization (TreeTagger) raw corpora (2,5 GB to 250 MB)
2. Rule-based candidate term (CT) extraction based 2. CRFs (Lafferty et al. 2001) to train a Term Candidate
on POS tags Extractor (Sha et al., Guégan and Loupy, 2011),
3. Hand-crafted rules for the grouping of variants manually small annotated corpora with CT (300 to
4. Multilingual framework : 6 languages 600 sentences)
Shared Features
Extraction of SWT and MWT, ranking based on specificity score
Ressources Term Variation
• Hand-crafted RTL from 103 to 159 terms with variants wind energy
• Monolingual Crawled Comparable Corpora from 220K to 474K wind turbine energy, onshore wind energy
tokens energy from wind, small-scale wind energy
Evaluation (Wind Energy Domain)
S1: F-Measure based on specificity S1: F-Measure based on occurrences S2: F-Measure based on specificity
ranking of CT with and without variants
Conclusions
S1: Specificity ranking outperforms the frequency of occurrence
ranking
S2: The handling of term variants improves the ranking for the first
candidate terms
S3: The knowledge-poor approach provides satisfactory results with
minimal effort. Results are language and domain dependent.
EN better results than DE limits of multilingual framework
ES better results in mobile domain than wind energy
KR tool and RTL handle MWT of 2-3, KP longer terms as small
scale domestic wind turbine system
S3: F-Measure based on specificity Future Work
Knowledge-rich vs. knowledge-poor
- Evaluate method using a POS tagger but no hand-crafted rules
The research leading to these results has received funding from the European Community's FP7/2007-2013 under
grant agreement nº 248005 for the TTC project (Terminology Extraction, Translation Tools and Comparable Corpora)