Russian Learner Translator Corpus: design, research potential and applications
17th International Conference on Text, Speech and Dialogue Brno, Czech Republic, September 8–12 2014
1. RUSSIAN LEARNER TRANSLATOR CORPUS:
design, research potential and
applications
Andrey Kutuzov
National Research University Higher School of Economics
Maria Kunilovskaya
Tyumen State University
17th International Conference on Text, Speech and Dialogue
Brno, Czech Republic, September 8–12 2014
2. General description
• inspired by MeLLANGE
• online and downloadable http://rus-ltc.org
• 1.3 mln tokens
• translations from 10 universities
• 11 source text genres (inc. essays, educational,
informational)
• multiple: 263 sources, 1952 translations
• bi-directional:
approx. 200 English ST(≈300K tokens) with their 1300
Russian translations (≈700 thousand tokens), and
over 40 Russian ST and approx. 600 English translations
• 10 types of linguistic and extralinguistic meta data
• Lexical and POS query interface (Freeling-based linguistic
mark-up) RusLTC at TSD-2014 2
3. Corpus design
1) Txt-archive structured by file-naming conventions
RU_1_23.txt and EN_1_23_9.txt
RU_1_23.head.txt and EN_1_23_9.head.txt
2) TMX file
• pair-wise alignment with LF aligner batch mode
• manual correction (Olifant /Heartsome tmx-editors)
• merging TUVs with identical source segments + adding XML tags to
link segments to head files (a homegrown script)
3) Error-tagged subcorpus
• a collection of 265 annotated translations (for 33 sources);
• stand-off machine readable annotation
• pre-defined error classification
• 6,471 error tags
• online tag-editor based of brat http://brat.nlplab.org/index.html
RusLTC at TSD-2014 3
6. Application and Research
RusLTC is a general purpose data source for translation studies
and translation education research, inc. study of
1. variation and choice in translation;
2. ’translationese’ and the translator interlanguage;
3. interdependence between the translation characteristics
and various meta data (direction and conditions of
translation, source text genre);
4. translation-related “problem areas” or rich points in source
texts;
5. translation quality and translation quality assessment (TQA)
Direct use
• in the curriculum and materials design
• as a teaching and learning aid.
RusLTC at TSD-2014 6
7. RusLTC research: gender asymmetry
in translated texts
1) The same gender asymmetry in male and
female translations as in Russian original
(based on lexical variety)
2) Sentence length figures for female
translations contradict similar statistics for
originals
RusLTC at TSD-2014 7
8. Research based on RusLTC: splitting in
EN-RU translation
1) types of syntactic structures that undergo
splitting in English-Russian translation:
– coordination with “, and”
– non-restrictive relative clauses
2) most frequent mistakes associated with splitting:
– loss or misinterpretation of semantic relations
between propositions,
– issues with anaphora resolution and
– greater communicative value acquired by upgraded
sentences.
RusLTC at TSD-2014 8
9. Error-tagged part: inter-rater reliability
AIM: to gauge reliability of mark-up results based on
error classification proposed and establish the areas of
disagreement
RusLTC at TSD-2014 9
23
38
112
130
30
114
30
30
112
130
38
93
α=0.734 versus α=0.569
10. Error statistics analysis to inform translation
didactics
Hypothesis 1: The better one knows L1 the better she
understands the source/the better the transfer skills.
Hypothesis 2: Final year students make less mistakes than 4th
year students
Hypothesis 3: Test translations show better results than routine
translations because students are more motivated to
perform better
Hypothesis 4: The quantitative results of the error annotation
depend on the order of translations in the set (“order
effect”)
RusLTC at TSD-2014 10
11. Use in the classroom
1) Students have online access to:
• their own error-tagged and commented translations;
• peer translations;
• mistakes statistics which reflects their individual
progress and difficulties.
RusLTC at TSD-2014 11
12. 2) Students’ rating based on the
quality of final translation
RusLTC at TSD-2014 12
Quality parameters used for consecutive ranking to arrive at relative evaluation:
1. number of critical errors,
2. number of content errors and
3. total number of mistakes.
13. 3) Follow students’ individual
progress over the year
(based on the total number of mistakes normalized by the text
size)
RusLTC at TSD-2014 13
14. 4) Think of remedial activities
RusLTC at TSD-2014 14
The top ten mistakes in the sample
15. 1) Theory-based exercises utilizing multiple
concordances
• discussing translation strategies, identifying translation problems
and comparing/evaluating solutions
• developing skills to overcome known transfer issues in English-
Russian translation which are due to interlingual typological
differences
2) Corpus-driven exercises to prevent most
common mistakes
• developing L1 competence through building up corpus-querying
and documentary research skills;
• extending the scope of world knowledge through information
search and developing text analysis and text comprehension
aptitude.
5) Design materials and teaching aids
RusLTC at TSD-2014 15
16. Summary
1) Russian Learner Translator Corpus is an available and
extensive source of data for translation studies and
translator education research (http://www.rus-ltc.org/);
2) The error-tagged subcorpus (http://dev.rus-
ltc.org/brat/#/rusltc/) is a method to provide students
extensive feedback on their translations
3) and a means of accumulating research data on TQA;
4) RusLTC content is used in designing teaching materials.
Thank you!
RusLTC at TSD-2014 16