Identification of Translationese: A Machine Learning Approach

Introduction
Methodology
Evaluation
Conclusions
Identiﬁcation of Translationese:
A Machine Learning Approach
Iustina Ilisei1, Diana Inkpen2, Gloria Corpas3 and
Ruslan Mitkov1
1University of Wolverhampton, United Kingdom
2University of Ottawa, Canada
3University of Malaga, Spain
CICLing 2010, Iasi, Romania
Iustina Ilisei, Diana Inkpen, Gloria Corpas, Ruslan Mitkov Identiﬁcation of Translationese: A Machine Learning Approach

Introduction
Methodology
Evaluation
Conclusions
Outline
1 Introduction
Introduction in Translation Studies
Universals of Translation
Related Studies
Corpus-based Approach
Machine-Learning Approach
2 Methodology
Objective
Resources
Data Representation
3 Evaluation
Classiﬁcation
Results Analysis
4 Conclusions

Introduction
Methodology
Evaluation
Conclusions
Related Studies
Introduction
Translationese Effect
Translations exhibit their own unnatural language, their
own peculiar lexico-grammatical and syntactic
characteristics. (Gellerstam,1986)
Translational language can not avoid the effect of
translationese. (Baker,1993; Laviosa,1997; McEnery &
Xiao (2002, 2007) )
Intrigue
As two languages can not be perfectly mapped with each
other → translated text and its original can not be perfectly
matched

Introduction
Methodology
Evaluation
Conclusions
Related Studies
Language Universals in Translation
Mona Baker
“it will be necessary to develop tools that will enable us to
identify universal features of translation, that is features which
typically occur in translated text rather than original utterances
and which are not the result of interference from speciﬁc
linguistic systems”. (Baker, 1993:243)
Practical Perspective
a (self)assessment tool for translators
multilingual plagiarism detection
direction of translation detection can improve SMT
performance

Introduction
Methodology
Evaluation
Conclusions
Related Studies
Translation Universals
According to Baker (1993,1996)
Simpliﬁcation
Translations tend to be simpler and easier-to-follow
texts
Explicitation
Translations tend to spell things out rather than leave them
implicit
Convergence
Translations tend to be more similar than non-translations
Normalisation
Translations conform to patterns typical to the target
language, even to the point of exaggerating them

Introduction
Methodology
Evaluation
Conclusions
Related Studies
Related studies
Corpus-Based approach
S. Laviosa (2008)
In translations: low proportion of lexical words over function words, high
proportion of high-frequency words compared to low-frequency words,
a relatively great repetition of the most frequent words, and less variety
in the most frequently used words
G. Corpas (2008)
Simpliﬁcation conﬁrmed for lexical richness, and contradicted in terms
of complex sentences, information load, sentence length, depth of
trees, senses per word.
G. Corpas, R. Mitkov, N. Afzal, V. Pekar (2008)
Translations exhibit lower lexical density and richness, seem to be more
readable, have a smaller proportion of simple sentences, and use less
discourse markers.

Introduction
Methodology
Evaluation
Conclusions
Related Studies
Related studies: Machine-Learning Approach
Supervised Learning Approach
Baroni & Bernardini (2006) “A new approach to the study of
translationese: Machine Learning the difference between
original and translated texts”
SVM classiﬁer distinguishes professional translations from
original texts with accuracy above the chance level
Depends heavily on lexical cues, the distribution of
n-grams of function words, morpho-syntactic categories,
personal pronouns and adverbs in general
Human accuracy - much lower than the accuracy of the
system

Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Aim of the Study
Objective
Language-independent learning system able to distinguish
between translated and non-translated texts.
To investigate the validation of the simpliﬁcation
hypothesis.
To explore characteristic features which most inﬂuence the
translational language.

Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Methodology
Our assumption
if(addition of the simplification features
improves learning accuracy)
then this is an argument towards the existence
of the Simplification Universal
else “further research required”

Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Translational Corpora
Resources
Comparable corpora: translated texts vs. non-translated
texts
Spanish Monolingual Comparable Corpora
Medical Translations by professionals (MTP) vs.
Comparable Original Medical texts by professionals (MTPC)
Medical Translations by translation students (MTS) vs.
Comparable Original Medical texts by translation students
(MTSC)
Technical Translations by professionals (TT) vs Comparable
Original Technical texts by professionals (TTC)

Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Datasets: training and testing
Training set
450 instances (156 translation class, 294 non-translation class)
Testing set
148 instances (52 translation class, 96 non-translation class)
Set pair one: MTP-MTPC (2 + 2 translation vs non-translation)
Set pair two: MTS-MTSC (36 + 66 translation vs non-translation)
Set pair three: TT-TTC (14 + 28 translation vs non-translation)

Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Data Representation
Data Repesentation without Simplification Features (DR - SF)
Proportion in each text of: grammatical words, nouns, finite
verbs, auxialiary verbs, adjectives, adverbs, numerals,
pronouns, prepositions, determinants, conjunctions,
grammatical words/lexical words ratio
Data Repesentation with Simplification Features (DR + SF)
All above (DR - SF) + simplification features

Introduction
Methodology
Evaluation
Conclusions
Objective
Resources
Data Representation
Simpliﬁcation Features
Proposed features to grasp simplicity in texts
Sentence Length: proportion of number of words per sentence
Sentence Length: the average of the maximum parse tree depth
per sentence in texts
Types of sentence: proportion of sentences without ﬁnite
verbs / simple sentences / complex sentences in texts
Ambiguity: average number of senses per word in texts
Word Length: average number of syllables per word in texts
Lexical Richness: proportion of type lemmas per tokens in texts
Information Load: proportion of lexical words per tokens in texts

Introduction
Methodology
Evaluation
Conclusions
Classiﬁcation
Results Analysis
Classiﬁcation Experiments
Experiments
Trained/tested on the entire dataset
Trained on the entire dataset and tested on separate test
datasets
Set MTS-MTSC (medical texts)
Set TT-TTC (technical texts)

Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Including Simplification Excluding Simplification
Features Features
10-fold Test 10-fold Test
Classifier cross-validation set cross-validation set
Baseline 65.33% 64.86% 65.33% 64.86%
Naive Bayes *76.67% 79.05% 69.33% 75.00%
BayesNet 78.67% 79.73% 75.11% 77.03%
Jrip 79.56% 83.11% 73.33% 77.03%
Decision Tree 78.22% 81.76% 78.22% 81.76%
Simple Logistic *77.33% 83.11% 71.11% 80.41%
SVM *79.11% *81.76% 69.33% 73.65%
Meta-classifier *80.00% 87.16% 73.33% 85.81%
Table: Classification Results: Accuracies for several classifiers

Introduction
Methodology
Evaluation
Conclusions
Classification
Results Analysis
Including Simplification Excluding Simplification
Features Features
Classifier MTS-MTSC TT-TTC MTS-MTSC TT-TTC
Baseline 64.71% 66.67% 64.71% 66.67%
Naive Bayes 71.57% 95.24% 71.57% 80.95%
BayesNet 73.53% 97.62% 71.57% 92.86%
Jrip 79.42% 95.24% 72.55% 92.86%
Decision Tree 77.45% 92.86% 75.49% 95.24%
Simple Logistic 77.45% 97.62% 79.41% 83.33%
SVM 75.49% *97.62% 74.51% 69.05%
Meta-classifier 82.35% 97.62% 78.43% 92.86%
Table: Classification accuracy results on the medical and technical
test datasets.

Introduction
Methodology
Evaluation
Conclusions
Classiﬁcation
Results Analysis
Decision Tree
Exploit features in categorisation task:
First level
Lexical Richness
Secondly
Sentence Length (words/sentence)
Grammatical words/Lexical words proportion
Thirdly
Pronoun proportion in texts
Conjunction proportion in texts

Introduction
Methodology
Evaluation
Conclusions
Classiﬁcation
Results Analysis
JRip Classiﬁer Rules
Rule 1: (lexicalRichness <= 0.16) and (ratioFiniteVerbs <=
0.08) => class=translation
Rule 2: (simpleSentences >= 0.3) and (wordLength <=
2.46) and (sentenceLength >= 20.7) and (ratioNouns >=
0.33) => class=translation
Rule 3: (ratioFiniteVerbs <= 0.09) and (ratioPreps <= 0.13)
=> class=translation
Rule 4: => class=non-translation

Introduction
Methodology
Evaluation
Conclusions
Classiﬁcation
Results Analysis
Attributes Ranking Filters
Information Gain Chi squared
lexicalRichness lexicalRichness
grammsPerLexics grammsPerLexics
ratioFiniteVerbs ratioFiniteVerbs
ratioNumerals ratioNumerals
ratioAdjectives ratioAdjectives
sentenceLength sentenceLength
ratioProns ratioProns
simpleSentences wordLength
wordLength simpleSentences
grammaticalWords zeroSentences
zeroSentences ratioNouns
ratioNouns lexicalWords
..... .....

Introduction
Methodology
Evaluation
Conclusions
Conclusions
Summary
Learning system able to distinguish between translated
text and non-translated text for Spanish language.
On a technical dataset, the accuracy reaches up to 97.62%
The addition of the features related to simplification leads
to an increased accuracy of the classifiers: SVM reports
statistical significance improvement.
The results may be considered as an argument for the
existence of the Simplification Universal.

Introduction
Methodology
Evaluation
Conclusions
Thank you for your attention !

Identification of Translationese: A Machine Learning Approach

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Identification of Translationese: A Machine Learning Approach

Similar to Identification of Translationese: A Machine Learning Approach (20)

Recently uploaded

Recently uploaded (20)

Identification of Translationese: A Machine Learning Approach