13. Constantin Orasan (UoW) Natural Language Processing for Translation

Natural Language Processing
for Translation
Constantin Orasan
University of Wolverhampton, UK

Structure

1. How NLP can help machine translation
2. How information retrieval can help machine translation
3. View from the industry

MT is difficult

 Language is ambiguous at all levels:
 Lexical: bank, file, chair
 Syntactic: John saw the man with the telescope.
 Semantic: The rabbit is ready for lunch.
 Discourse: John hid Bill’s car keys. He was drunk.
 Pragmatics: You owe me £20.
 It gets even more difficult when we start working in
multilingual settings

Steps of processing








Tokenisation
Morphology
Syntax
Lexical Semantics
Discourse
Pragmatics

 NLP complexity

increases this way!
(and in general the accuracy of
methods decreases)

General NLP frameworks
 GATE (General Architecture for Text Engineering) http://gate.ac.uk
framework written in Java:
 First designed for information extraction tasks
 Developed into a robust framework that offers processing a different
linguistic levels in many languages
 Provides wrappers for other tools such as Weka, LingPipe, etc.
 NLTK: set of NLP modules written in Python with emphasis on teaching
 Lingpipe (http://alias-i.com/lingpipe/) java toolkit for language processing
 OpenNLP (http://opennlp.apache.org/) ML-based toolkit for language
processing
 … the list can go on

Vauquois triangle
Direct translation

Source
text

Source
syntax

Shallow (syntactic) transfer

Target
text

Target
syntax

Source Deep (semantic) transfer Target
semantics
semantics
Interlingua

NLP in SMT

 Can benefit from linguistic information, but many of the
existing models are largely data driven and do not incorporate
much linguistic information
 See the lecture on
Statistical MT: Word, Phrase and Tree Based Models
(overview)
Khalil Sima'an (UvA) and Trevor Cohn (USFD)

NLP in EBMT

 In many cases it requires some kind of linguistic information
Example Based Machine Translation
Joseph van Genabith (DCU) and Kalil Sima'an (UvA)

NLP in TM

 The existing TM solutions do not rely on much linguistic
information
 Second generation and third generations of TM rely on
linguistic input
Translation Memories
Ruslan Mitkov (UoW), Manuel Arcedillo (Hermes) and Juanjo
Arevalillo (Hermes)

But there are many other ways in which we
could improve the results of translation
engines by incorporating linguistic information

Improve tokenisation

 For European languages tokenisation is considered more or
less a simple problem.
 In non-segmented languages (such as many oriental ones),
identification of tokens is extremely complex
 Tokens do not have explicit boundaries (written directly adjacent to
one another with no whitespace between them).
 Practically all the characters can be one-character words in
themselves, but they can also be joined together to form multicharacter words.

 Even in segmented languages like English, identification of
tokens can be difficult.

Tokenisation may not be so easy

 Even in segmented languages like English where Tokens are
usually separated by whitespaces and punctuation there are problem:


Abbreviations: when full stops follow abbreviations, they should be merged with the
abbreviation to form one token (e.g. etc., yrs., Mr.)



Multiple strings separated by white space can in fact form one token (e.g., numerical
expressions in French:1 200 000)



Hyphenation can be ambiguous:
 Sometimes part of the word segment, e.g. self-assessment, F-16, forty-two,
 Sometimes not, e.g. London-based



Additional challenges:
 Numerical, special expressions (dates, measures, email addresses)
 Language specific rules for contracting words and phrases (e.g. can’t, won’t vs.
O'Brien: contain multiple tokens with no white spaces between)
 Ambiguous punctuation (e.g. “.” in yrs., 05.11.08 )

Why tokenisation is important in
MT?
 It will influence any task that requires a dictionary/gazeeter
lookup
 Can influence how words are aligned
 Abbreviations were shown to help SMT (Li and Yarowsky,
2008)
 Translation of named entities (n.b. NER is seen as part of
tokenization)

Abbreviations

 Unseen abbreviations are treated as unknown words and left
untranslated
 Modern Chinese is a highly abbreviated language and 20% of
sentences in an newspaper article contain an abbreviation
 The way abbreviations are formed follows much more
complex rules than English
 Li and Yarowsky (2008) propose an unsupervised method for
extracting relations between full-form phrases and their
abbreviations

Li and Yarowsky (2008)

 Step 1: Identification of English entities
 Step 2: Translate the entities into Chinese using a baseline
translator
 Step 3: Full-abbreviations relations are extracted on the basis
of co-occurrence in a Chinese monolingual corpus
 Step 4: Translation induction for Chinese abbreviations
 Step 5: Integration with the baseline translation system
Evaluation shows that the results of the BLEU scores improve

Translation of named entities

 Incorrect NE translation can seriously harm the quality of
translation
 There are 2 main sources of problems:
 Ambiguity: NE normally are composed of words which can
be translated in isolation
 Sparsity: some named entities are very sparse
 Integration of NEs into the translation model leads to various
results ranging from significant improvements to low
improvements and even negative impact

NE in SMT
 The main approach in SMT is to determine the NEs in a text and
translate them using an external model.
 Then they are:
 Used as the default translation (Li et al, 2009)
 Added dynamically to compete with other translations (Turchi
et al., 2012; Bouamor et al., 2012),
 not used, and do not translate the original NE (Tinsley et al.,
2012)
 Nikoulina et al. (2012) propose replacing NEs with placeholders
in order to reduce sparsity, in this way learning a better model

Nikoulina et al. (2012)

1.
2.
3.
4.

The Named Entites are detected and replaced with
placeholders to produce reduced sentences
A reduced translation model is used to translate the reduced
sentences
An external NE translator is employed
The translated NEs are reinserted in the reduced translations

The disadvantage of the approach is that the framework is loosely
dependent on the SMT task  a postprocessing step is applied to
the output of the NER + a prediction model to determine which NEs
can be safely translated

MT and semantics
 Noted as problem for Machine Translation back in the late 1949’s (Weaver,
1949)
A word can often only be translated if you know the specific sense intended
 Bar-Hillel (1960) posed the following problem:
Little John was looking for his toy box. Finally, he found it. The box was in the
pen. John was very happy.
 Is “pen” a writing instrument or an enclosure where children play or an
enclosure for livestock?


…declared it unsolvable, and left the field of MT…

Lexical Divergence: many-to-many

Lexical Divergence: solution?

 Domain specific dictionaries can improve the quality of
translation in post-editing environments

 Word sense disambiguation (WSD) is the ability to identify the
meaning of words in context in a computational manner
 WSD is seen as a more general solution to lexical
divergence, but
 WSD is an AI complete

Carpuat and Wu (2005)
 SMT models only rely on local context to choose among lexical
translation candidates
 The assumption is that a dedicated WSD module can help the
translation process
 Use a baseline Chinese to English translation engine and a state-of-theart Chinese WSD
 WSD incorporated in the decoder
 WSD incorporated in a post-processor
 Translation obtained using the English gloss of HowNet
 WSD does not help a typical SMT, but this is mainly due to the fact that
current SMT systems (2005) cannot take advantage of the sense
information

Chan et al. (2007)

 Successfully integrate WSD in Hiero, a state-of-the-art Chinese
to English hierarchical phrase-based MT system
 Introduce two additional features in the MT model at the
decoding stage that take into consideration that some words
were chosen by the WSD system
 The improvement noticed is modest, but statistically significant
 Carpuat and Wu (2007) find similar results, but instead of WSD
they perform fully phrasal multi-word disambiguation and their
disambiguation system is tightly integrated in the SMT engine

NLP in evaluation of MT
 Evaluation metrics like BLUE treat any divergence from the
reference translation as a mistake
 Several alternative metrics were proposed to address this
problem:
 METEOR (Denkowski and Lavie, 2010) accounts for
synonyms and paraphrases
 Calculate meaning equivalence using bidirectional textual
entailment (Pado et al., 2009)
 Using semantic role labels (Gimenez and Marquez, 2007)
 TINE (Rios et al., 2011) measures the similarity between
sentences using shallow semantic representation

Other NLP applications which
could be useful





Automatic terminology extraction
Automatic extraction of ontologies
Automatic compilation of (parallel/comparable) corpora
Use of parallel corpora to train various systems

References
 Bouamor, D., Semmar, N., and Zweigenbaum, P. (2012). Identifying multiword expressions in statistical machine translation. In Proceedings of LREC
2012.
 Carpuat, M., & Wu, D. (2007). Improving Statistical Machine Translation
Using Word Sense Disambiguation. In Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning (pp. 61–72). Prague, Czech
Republic. Retrieved from http://acl.ldc.upenn.edu/D/D07/D07-1007.pdf
 Carpuat, M., & Wu, D. (2005). Word Sense Disambiguation vs. Statistical
Machine Translation. Proceedings of the 43rd Annual Meeting of the ACL,
(June), 387–394. Retrieved from http://acl.ldc.upenn.edu/P/P07/P071005.pdf
 Chan, Y. S., Ng, H. T., & Chiang, D. (2007). Word Sense Disambiguation
Improves Statistical Machine Translation. In Proceedings of the 45th Annual
Meeting of the Association for Computational Linguistics (pp. 33–40).

References
 Li, M., Zhang, J., Zhou, Y., and Chengqing, Z. (2009). The CASIA statistical
machine translation system for IWSLT 2009. In Proceedings of IWSLT 2009
 Li, Z., & Yarowsky, D. (2008). Unsupervised Translation Induction for Chinese
Abbreviations using Monolingual Corpora. In Proceedings of ACL-08 (pp. 425
– 433). Columbus, Ohio, USA. Retrieved from
http://aclweb.org/anthology//P/P08/P08-1049.pdf
 Navigli, R. (2009). Word sense disambiguation. ACM Computing Surveys,
41(2), 1–69. doi:10.1145/1459352.1459355
 Nikoulina, V., Sandor, A., & Dymetman, M. (2012). Hybrid Adaptation of
Named Entity Recognition for Statistical Machine Translation. In Second
ML4HMT Workshop (pp. 1–16).
 Pado, S., Galley, M., Jurafsky, D., & Manning, C. (2009). Robust Machine
Translation Evaluation with Entailment Features. In Proceedings of the 47th
Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP (pp. 297–305).

References

 Rios, M.; Aziz, W.; Specia, L. (2011). TINE: A Metric to Assess MT Adequacy.
In Proceedings of the 6th Workshop on Statistical Machine Translation
(WMT-2011), July, Edinburgh, UK
 Tinsley, J., Ceausu, A., and Zhang, J. (2012). PLUTO: automated solutions
for patent translation. In EACL JointWorkshop on Exploitng Synergies
between Information Retrieval andMachine Translation (ESIRMT) and Hybrid
Approaches to Machine Translation (HyTra): Proceedings of the workshop,
EACL 2012.
 Turchi, M., Atkinson, M.,Wilcox, A., Crawley, B., Bucci, S., Steinberger, R.,
and Van der Goot, E. (2012). ONTS: "Optima" news translation system. In
Proceedings of the Demonstrations at the 13th Conference of the European
Chapter of the Association for Computational Linguistics.

13. Constantin Orasan (UoW) Natural Language Processing for Translation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à 13. Constantin Orasan (UoW) Natural Language Processing for Translation

Similaire à 13. Constantin Orasan (UoW) Natural Language Processing for Translation (20)

Plus de RIILP

Plus de RIILP (20)

Dernier

Dernier (20)

13. Constantin Orasan (UoW) Natural Language Processing for Translation