Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Identification of Translationese: A Machine Learning Approach
1. Introduction
Methodology
Evaluation
Conclusions
Identification of Translationese:
A Machine Learning Approach
Iustina Ilisei1, Diana Inkpen2, Gloria Corpas3 and
Ruslan Mitkov1
1University of Wolverhampton, United Kingdom
2University of Ottawa, Canada
3University of Malaga, Spain
CICLing 2010, Iasi, Romania
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
2. Introduction
Methodology
Evaluation
Conclusions
Outline
1 Introduction
Introduction in Translation Studies
Universals of Translation
Related Studies
Corpus-based Approach
Machine-Learning Approach
2 Methodology
Objective
Resources
Data Representation
3 Evaluation
Classification
Results Analysis
4 Conclusions
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
3. Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Studies
Introduction
Translationese Effect
Translations exhibit their own unnatural language, their
own peculiar lexico-grammatical and syntactic
characteristics. (Gellerstam,1986)
Translational language can not avoid the effect of
translationese. (Baker,1993; Laviosa,1997; McEnery &
Xiao (2002, 2007) )
Intrigue
As two languages can not be perfectly mapped with each
other → translated text and its original can not be perfectly
matched
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
4. Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Studies
Language Universals in Translation
Mona Baker
“it will be necessary to develop tools that will enable us to
identify universal features of translation, that is features which
typically occur in translated text rather than original utterances
and which are not the result of interference from specific
linguistic systems”. (Baker, 1993:243)
Practical Perspective
a (self)assessment tool for translators
multilingual plagiarism detection
direction of translation detection can improve SMT
performance
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
5. Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Studies
Translation Universals
According to Baker (1993,1996)
Simplification
Translations tend to be simpler and easier-to-follow
texts
Explicitation
Translations tend to spell things out rather than leave them
implicit
Convergence
Translations tend to be more similar than non-translations
Normalisation
Translations conform to patterns typical to the target
language, even to the point of exaggerating them
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
6. Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Studies
Related studies
Corpus-Based approach
S. Laviosa (2008)
In translations: low proportion of lexical words over function words, high
proportion of high-frequency words compared to low-frequency words,
a relatively great repetition of the most frequent words, and less variety
in the most frequently used words
G. Corpas (2008)
Simplification confirmed for lexical richness, and contradicted in terms
of complex sentences, information load, sentence length, depth of
trees, senses per word.
G. Corpas, R. Mitkov, N. Afzal, V. Pekar (2008)
Translations exhibit lower lexical density and richness, seem to be more
readable, have a smaller proportion of simple sentences, and use less
discourse markers.
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
7. Introduction
Methodology
Evaluation
Conclusions
Introduction in Translation Studies
Universals of Translation
Related Studies
Related studies: Machine-Learning Approach
Supervised Learning Approach
Baroni & Bernardini (2006) “A new approach to the study of
translationese: Machine Learning the difference between
original and translated texts”
SVM classifier distinguishes professional translations from
original texts with accuracy above the chance level
Depends heavily on lexical cues, the distribution of
n-grams of function words, morpho-syntactic categories,
personal pronouns and adverbs in general
Human accuracy - much lower than the accuracy of the
system
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
8. Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Aim of the Study
Objective
Language-independent learning system able to distinguish
between translated and non-translated texts.
To investigate the validation of the simplification
hypothesis.
To explore characteristic features which most influence the
translational language.
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
10. Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Translational Corpora
Resources
Comparable corpora: translated texts vs. non-translated
texts
Spanish Monolingual Comparable Corpora
Medical Translations by professionals (MTP) vs.
Comparable Original Medical texts by professionals (MTPC)
Medical Translations by translation students (MTS) vs.
Comparable Original Medical texts by translation students
(MTSC)
Technical Translations by professionals (TT) vs Comparable
Original Technical texts by professionals (TTC)
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
11. Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Datasets: training and testing
Training set
450 instances (156 translation class, 294 non-translation class)
Testing set
148 instances (52 translation class, 96 non-translation class)
Set pair one: MTP-MTPC (2 + 2 translation vs non-translation)
Set pair two: MTS-MTSC (36 + 66 translation vs non-translation)
Set pair three: TT-TTC (14 + 28 translation vs non-translation)
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
12. Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Data Representation
Data Repesentation without Simplification Features (DR - SF)
Proportion in each text of: grammatical words, nouns, finite
verbs, auxialiary verbs, adjectives, adverbs, numerals,
pronouns, prepositions, determinants, conjunctions,
grammatical words/lexical words ratio
Data Repesentation with Simplification Features (DR + SF)
All above (DR - SF) + simplification features
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
13. Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Simplification Features
Proposed features to grasp simplicity in texts
Sentence Length: proportion of number of words per sentence
Sentence Length: the average of the maximum parse tree depth
per sentence in texts
Types of sentence: proportion of sentences without finite
verbs / simple sentences / complex sentences in texts
Ambiguity: average number of senses per word in texts
Word Length: average number of syllables per word in texts
Lexical Richness: proportion of type lemmas per tokens in texts
Information Load: proportion of lexical words per tokens in texts
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
15. Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Classification Experiments
Including Simplification Excluding Simplification
Features Features
10-fold Test 10-fold Test
Classifier cross-validation set cross-validation set
Baseline 65.33% 64.86% 65.33% 64.86%
Naive Bayes *76.67% 79.05% 69.33% 75.00%
BayesNet 78.67% 79.73% 75.11% 77.03%
Jrip 79.56% 83.11% 73.33% 77.03%
Decision Tree 78.22% 81.76% 78.22% 81.76%
Simple Logistic *77.33% 83.11% 71.11% 80.41%
SVM *79.11% *81.76% 69.33% 73.65%
Meta-classifier *80.00% 87.16% 73.33% 85.81%
Table: Classification Results: Accuracies for several classifiers
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
16. Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Classification Experiments
Including Simplification Excluding Simplification
Features Features
Classifier MTS-MTSC TT-TTC MTS-MTSC TT-TTC
Baseline 64.71% 66.67% 64.71% 66.67%
Naive Bayes 71.57% 95.24% 71.57% 80.95%
BayesNet 73.53% 97.62% 71.57% 92.86%
Jrip 79.42% 95.24% 72.55% 92.86%
Decision Tree 77.45% 92.86% 75.49% 95.24%
Simple Logistic 77.45% 97.62% 79.41% 83.33%
SVM 75.49% *97.62% 74.51% 69.05%
Meta-classifier 82.35% 97.62% 78.43% 92.86%
Table: Classification accuracy results on the medical and technical
test datasets.
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
17. Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Decision Tree
Exploit features in categorisation task:
First level
Lexical Richness
Secondly
Sentence Length (words/sentence)
Grammatical words/Lexical words proportion
Thirdly
Pronoun proportion in texts
Conjunction proportion in texts
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
18. Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
JRip Classifier Rules
Rule 1: (lexicalRichness <= 0.16) and (ratioFiniteVerbs <=
0.08) => class=translation
Rule 2: (simpleSentences >= 0.3) and (wordLength <=
2.46) and (sentenceLength >= 20.7) and (ratioNouns >=
0.33) => class=translation
Rule 3: (ratioFiniteVerbs <= 0.09) and (ratioPreps <= 0.13)
=> class=translation
Rule 4: => class=non-translation
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
19. Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Attributes Ranking Filters
Information Gain Chi squared
lexicalRichness lexicalRichness
grammsPerLexics grammsPerLexics
ratioFiniteVerbs ratioFiniteVerbs
ratioNumerals ratioNumerals
ratioAdjectives ratioAdjectives
sentenceLength sentenceLength
ratioProns ratioProns
simpleSentences wordLength
wordLength simpleSentences
grammaticalWords zeroSentences
zeroSentences ratioNouns
ratioNouns lexicalWords
..... .....
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach
20. Introduction
Methodology
Evaluation
Conclusions
Conclusions
Summary
Learning system able to distinguish between translated
text and non-translated text for Spanish language.
On a technical dataset, the accuracy reaches up to 97.62%
The addition of the features related to simplification leads
to an increased accuracy of the classifiers: SVM reports
statistical significance improvement.
The results may be considered as an argument for the
existence of the Simplification Universal.
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identification of Translationese: A Machine Learning Approach