SlideShare une entreprise Scribd logo
1  sur  28
Télécharger pour lire hors ligne
Natural Language Processing
for Translation
Constantin Orasan
University of Wolverhampton, UK
Structure

1. How NLP can help machine translation
2. How information retrieval can help machine translation
3. View from the industry
MT is difficult

 Language is ambiguous at all levels:
 Lexical: bank, file, chair
 Syntactic: John saw the man with the telescope.
 Semantic: The rabbit is ready for lunch.
 Discourse: John hid Bill’s car keys. He was drunk.
 Pragmatics: You owe me £20.
 It gets even more difficult when we start working in
multilingual settings
Steps of processing








Tokenisation
Morphology
Syntax
Lexical Semantics
Discourse
Pragmatics

 NLP complexity

increases this way!
(and in general the accuracy of
methods decreases)
General NLP frameworks
 GATE (General Architecture for Text Engineering) http://gate.ac.uk
framework written in Java:
 First designed for information extraction tasks
 Developed into a robust framework that offers processing a different
linguistic levels in many languages
 Provides wrappers for other tools such as Weka, LingPipe, etc.
 NLTK: set of NLP modules written in Python with emphasis on teaching
 Lingpipe (http://alias-i.com/lingpipe/) java toolkit for language processing
 OpenNLP (http://opennlp.apache.org/) ML-based toolkit for language
processing
 … the list can go on
Vauquois triangle
Direct translation

Source
text

Source
syntax

Shallow (syntactic) transfer

Target
text

Target
syntax

Source Deep (semantic) transfer Target
semantics
semantics
Interlingua
NLP in SMT

 Can benefit from linguistic information, but many of the
existing models are largely data driven and do not incorporate
much linguistic information
 See the lecture on
Statistical MT: Word, Phrase and Tree Based Models
(overview)
Khalil Sima'an (UvA) and Trevor Cohn (USFD)
NLP in EBMT

 In many cases it requires some kind of linguistic information
 See the lecture on
Example Based Machine Translation
Joseph van Genabith (DCU) and Kalil Sima'an (UvA)
NLP in TM

 The existing TM solutions do not rely on much linguistic
information
 Second generation and third generations of TM rely on
linguistic input
 See the lecture on
Translation Memories
Ruslan Mitkov (UoW), Manuel Arcedillo (Hermes) and Juanjo
Arevalillo (Hermes)
But there are many other ways in which we
could improve the results of translation
engines by incorporating linguistic information
Improve tokenisation

 For European languages tokenisation is considered more or
less a simple problem.
 In non-segmented languages (such as many oriental ones),
identification of tokens is extremely complex
 Tokens do not have explicit boundaries (written directly adjacent to
one another with no whitespace between them).
 Practically all the characters can be one-character words in
themselves, but they can also be joined together to form multicharacter words.

 Even in segmented languages like English, identification of
tokens can be difficult.
Tokenisation may not be so easy

 Even in segmented languages like English where Tokens are
usually separated by whitespaces and punctuation there are problem:


Abbreviations: when full stops follow abbreviations, they should be merged with the
abbreviation to form one token (e.g. etc., yrs., Mr.)



Multiple strings separated by white space can in fact form one token (e.g., numerical
expressions in French:1 200 000)



Hyphenation can be ambiguous:
 Sometimes part of the word segment, e.g. self-assessment, F-16, forty-two,
 Sometimes not, e.g. London-based



Additional challenges:
 Numerical, special expressions (dates, measures, email addresses)
 Language specific rules for contracting words and phrases (e.g. can’t, won’t vs.
O'Brien: contain multiple tokens with no white spaces between)
 Ambiguous punctuation (e.g. “.” in yrs., 05.11.08 )
Why tokenisation is important in
MT?
 It will influence any task that requires a dictionary/gazeeter
lookup
 Can influence how words are aligned
 Abbreviations were shown to help SMT (Li and Yarowsky,
2008)
 Translation of named entities (n.b. NER is seen as part of
tokenization)
Abbreviations

 Unseen abbreviations are treated as unknown words and left
untranslated
 Modern Chinese is a highly abbreviated language and 20% of
sentences in an newspaper article contain an abbreviation
 The way abbreviations are formed follows much more
complex rules than English
 Li and Yarowsky (2008) propose an unsupervised method for
extracting relations between full-form phrases and their
abbreviations
Li and Yarowsky (2008)

 Step 1: Identification of English entities
 Step 2: Translate the entities into Chinese using a baseline
translator
 Step 3: Full-abbreviations relations are extracted on the basis
of co-occurrence in a Chinese monolingual corpus
 Step 4: Translation induction for Chinese abbreviations
 Step 5: Integration with the baseline translation system
Evaluation shows that the results of the BLEU scores improve
Translation of named entities

 Incorrect NE translation can seriously harm the quality of
translation
 There are 2 main sources of problems:
 Ambiguity: NE normally are composed of words which can
be translated in isolation
 Sparsity: some named entities are very sparse
 Integration of NEs into the translation model leads to various
results ranging from significant improvements to low
improvements and even negative impact
NE in SMT
 The main approach in SMT is to determine the NEs in a text and
translate them using an external model.
 Then they are:
 Used as the default translation (Li et al, 2009)
 Added dynamically to compete with other translations (Turchi
et al., 2012; Bouamor et al., 2012),
 not used, and do not translate the original NE (Tinsley et al.,
2012)
 Nikoulina et al. (2012) propose replacing NEs with placeholders
in order to reduce sparsity, in this way learning a better model
Nikoulina et al. (2012)

1.
2.
3.
4.

The Named Entites are detected and replaced with
placeholders to produce reduced sentences
A reduced translation model is used to translate the reduced
sentences
An external NE translator is employed
The translated NEs are reinserted in the reduced translations

The disadvantage of the approach is that the framework is loosely
dependent on the SMT task  a postprocessing step is applied to
the output of the NER + a prediction model to determine which NEs
can be safely translated
MT and semantics
 Noted as problem for Machine Translation back in the late 1949’s (Weaver,
1949)
A word can often only be translated if you know the specific sense intended
 Bar-Hillel (1960) posed the following problem:
Little John was looking for his toy box. Finally, he found it. The box was in the
pen. John was very happy.
 Is “pen” a writing instrument or an enclosure where children play or an
enclosure for livestock?


…declared it unsolvable, and left the field of MT…
Lexical Divergence: many-to-many
Lexical Divergence: solution?

 Domain specific dictionaries can improve the quality of
translation in post-editing environments

 Word sense disambiguation (WSD) is the ability to identify the
meaning of words in context in a computational manner
 WSD is seen as a more general solution to lexical
divergence, but
 WSD is an AI complete
Carpuat and Wu (2005)
 SMT models only rely on local context to choose among lexical
translation candidates
 The assumption is that a dedicated WSD module can help the
translation process
 Use a baseline Chinese to English translation engine and a state-of-theart Chinese WSD
 WSD incorporated in the decoder
 WSD incorporated in a post-processor
 Translation obtained using the English gloss of HowNet
 WSD does not help a typical SMT, but this is mainly due to the fact that
current SMT systems (2005) cannot take advantage of the sense
information
Chan et al. (2007)

 Successfully integrate WSD in Hiero, a state-of-the-art Chinese
to English hierarchical phrase-based MT system
 Introduce two additional features in the MT model at the
decoding stage that take into consideration that some words
were chosen by the WSD system
 The improvement noticed is modest, but statistically significant
 Carpuat and Wu (2007) find similar results, but instead of WSD
they perform fully phrasal multi-word disambiguation and their
disambiguation system is tightly integrated in the SMT engine
NLP in evaluation of MT
 Evaluation metrics like BLUE treat any divergence from the
reference translation as a mistake
 Several alternative metrics were proposed to address this
problem:
 METEOR (Denkowski and Lavie, 2010) accounts for
synonyms and paraphrases
 Calculate meaning equivalence using bidirectional textual
entailment (Pado et al., 2009)
 Using semantic role labels (Gimenez and Marquez, 2007)
 TINE (Rios et al., 2011) measures the similarity between
sentences using shallow semantic representation
Other NLP applications which
could be useful





Automatic terminology extraction
Automatic extraction of ontologies
Automatic compilation of (parallel/comparable) corpora
Use of parallel corpora to train various systems
References
 Bouamor, D., Semmar, N., and Zweigenbaum, P. (2012). Identifying multiword expressions in statistical machine translation. In Proceedings of LREC
2012.
 Carpuat, M., & Wu, D. (2007). Improving Statistical Machine Translation
Using Word Sense Disambiguation. In Proceedings of the 2007 Joint
Conference on Empirical Methods in Natural Language Processing and
Computational Natural Language Learning (pp. 61–72). Prague, Czech
Republic. Retrieved from http://acl.ldc.upenn.edu/D/D07/D07-1007.pdf
 Carpuat, M., & Wu, D. (2005). Word Sense Disambiguation vs. Statistical
Machine Translation. Proceedings of the 43rd Annual Meeting of the ACL,
(June), 387–394. Retrieved from http://acl.ldc.upenn.edu/P/P07/P071005.pdf
 Chan, Y. S., Ng, H. T., & Chiang, D. (2007). Word Sense Disambiguation
Improves Statistical Machine Translation. In Proceedings of the 45th Annual
Meeting of the Association for Computational Linguistics (pp. 33–40).
References
 Li, M., Zhang, J., Zhou, Y., and Chengqing, Z. (2009). The CASIA statistical
machine translation system for IWSLT 2009. In Proceedings of IWSLT 2009
 Li, Z., & Yarowsky, D. (2008). Unsupervised Translation Induction for Chinese
Abbreviations using Monolingual Corpora. In Proceedings of ACL-08 (pp. 425
– 433). Columbus, Ohio, USA. Retrieved from
http://aclweb.org/anthology//P/P08/P08-1049.pdf
 Navigli, R. (2009). Word sense disambiguation. ACM Computing Surveys,
41(2), 1–69. doi:10.1145/1459352.1459355
 Nikoulina, V., Sandor, A., & Dymetman, M. (2012). Hybrid Adaptation of
Named Entity Recognition for Statistical Machine Translation. In Second
ML4HMT Workshop (pp. 1–16).
 Pado, S., Galley, M., Jurafsky, D., & Manning, C. (2009). Robust Machine
Translation Evaluation with Entailment Features. In Proceedings of the 47th
Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP (pp. 297–305).
References

 Rios, M.; Aziz, W.; Specia, L. (2011). TINE: A Metric to Assess MT Adequacy.
In Proceedings of the 6th Workshop on Statistical Machine Translation
(WMT-2011), July, Edinburgh, UK
 Tinsley, J., Ceausu, A., and Zhang, J. (2012). PLUTO: automated solutions
for patent translation. In EACL JointWorkshop on Exploitng Synergies
between Information Retrieval andMachine Translation (ESIRMT) and Hybrid
Approaches to Machine Translation (HyTra): Proceedings of the workshop,
EACL 2012.
 Turchi, M., Atkinson, M.,Wilcox, A., Crawley, B., Bucci, S., Steinberger, R.,
and Van der Goot, E. (2012). ONTS: "Optima" news translation system. In
Proceedings of the Demonstrations at the 13th Conference of the European
Chapter of the Association for Computational Linguistics.

Contenu connexe

Tendances

Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translationMarcis Pinnis
 
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
S URVEY  O N M ACHINE  T RANSLITERATION A ND  M ACHINE L EARNING M ODELSS URVEY  O N M ACHINE  T RANSLITERATION A ND  M ACHINE L EARNING M ODELS
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELSijnlc
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to HindiRajat Jain
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Taggingtheyaseen51
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine TranslationJaganadh Gopinadhan
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approachvini89
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsParisa Niksefat
 
CBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERCBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERijnlc
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationLifeng (Aaron) Han
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding Systeminscit2006
 
Language translation english to hindi
Language translation english to hindiLanguage translation english to hindi
Language translation english to hindiRAJENDRA VERMA
 
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningError Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningCITE
 

Tendances (20)

Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Moses
MosesMoses
Moses
 
SMT3
SMT3SMT3
SMT3
 
NLP pipeline in machine translation
NLP pipeline in machine translationNLP pipeline in machine translation
NLP pipeline in machine translation
 
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
S URVEY  O N M ACHINE  T RANSLITERATION A ND  M ACHINE L EARNING M ODELSS URVEY  O N M ACHINE  T RANSLITERATION A ND  M ACHINE L EARNING M ODELS
S URVEY O N M ACHINE T RANSLITERATION A ND M ACHINE L EARNING M ODELS
 
Machine translation from English to Hindi
Machine translation from English to HindiMachine translation from English to Hindi
Machine translation from English to Hindi
 
Parts of Speect Tagging
Parts of Speect TaggingParts of Speect Tagging
Parts of Speect Tagging
 
Pxc3898474
Pxc3898474Pxc3898474
Pxc3898474
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
Machine translation with statistical approach
Machine translation with statistical approachMachine translation with statistical approach
Machine translation with statistical approach
 
Arabic MT Project
Arabic MT ProjectArabic MT Project
Arabic MT Project
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
Error Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation OutputsError Analysis of Rule-based Machine Translation Outputs
Error Analysis of Rule-based Machine Translation Outputs
 
CBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMERCBAS: CONTEXT BASED ARABIC STEMMER
CBAS: CONTEXT BASED ARABIC STEMMER
 
A deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine TranslationA deep analysis of Multi-word Expression and Machine Translation
A deep analysis of Multi-word Expression and Machine Translation
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 
NLP_KASHK:Text Normalization
NLP_KASHK:Text NormalizationNLP_KASHK:Text Normalization
NLP_KASHK:Text Normalization
 
Language translation english to hindi
Language translation english to hindiLanguage translation english to hindi
Language translation english to hindi
 
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language LearningError Detection and Feedback with OT-LFG for Computer-assisted Language Learning
Error Detection and Feedback with OT-LFG for Computer-assisted Language Learning
 

Similaire à 13. Constantin Orasan (UoW) Natural Language Processing for Translation

Machine Translation Approaches and Design Aspects
Machine Translation Approaches and Design AspectsMachine Translation Approaches and Design Aspects
Machine Translation Approaches and Design AspectsIOSR Journals
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
Developing an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyDeveloping an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyAlexander Decker
 
Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation ofAndi Wu
 
Role of Machine Translation and Word Sense Disambiguation in Natural Language...
Role of Machine Translation and Word Sense Disambiguation in Natural Language...Role of Machine Translation and Word Sense Disambiguation in Natural Language...
Role of Machine Translation and Word Sense Disambiguation in Natural Language...IOSR Journals
 
Tutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemTutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemIJERA Editor
 
Using ontology based context in the
Using ontology based context in theUsing ontology based context in the
Using ontology based context in theijaia
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingScott Faria
 
SECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptxSECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptxssuser1ac0fa
 
Semantic Rules Representation in Controlled Natural Language in FluentEditor
Semantic Rules Representation in Controlled Natural Language in FluentEditorSemantic Rules Representation in Controlled Natural Language in FluentEditor
Semantic Rules Representation in Controlled Natural Language in FluentEditorCognitum
 
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...ijnlc
 
Natural Language Processing: State of The Art, Current Trends and Challenges
Natural Language Processing: State of The Art, Current Trends and ChallengesNatural Language Processing: State of The Art, Current Trends and Challenges
Natural Language Processing: State of The Art, Current Trends and Challengesantonellarose
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 

Similaire à 13. Constantin Orasan (UoW) Natural Language Processing for Translation (20)

Machine Translation Approaches and Design Aspects
Machine Translation Approaches and Design AspectsMachine Translation Approaches and Design Aspects
Machine Translation Approaches and Design Aspects
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
Developing an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyDeveloping an architecture for translation engine using ontology
Developing an architecture for translation engine using ontology
 
Customizable Segmentation of
Customizable Segmentation ofCustomizable Segmentation of
Customizable Segmentation of
 
Role of Machine Translation and Word Sense Disambiguation in Natural Language...
Role of Machine Translation and Word Sense Disambiguation in Natural Language...Role of Machine Translation and Word Sense Disambiguation in Natural Language...
Role of Machine Translation and Word Sense Disambiguation in Natural Language...
 
Tutorial - Speech Synthesis System
Tutorial - Speech Synthesis SystemTutorial - Speech Synthesis System
Tutorial - Speech Synthesis System
 
Using ontology based context in the
Using ontology based context in theUsing ontology based context in the
Using ontology based context in the
 
An Overview Of Natural Language Processing
An Overview Of Natural Language ProcessingAn Overview Of Natural Language Processing
An Overview Of Natural Language Processing
 
srinu.pptx
srinu.pptxsrinu.pptx
srinu.pptx
 
SECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptxSECOND LANGUAGE RESEARCH.pptx
SECOND LANGUAGE RESEARCH.pptx
 
Semantic Rules Representation in Controlled Natural Language in FluentEditor
Semantic Rules Representation in Controlled Natural Language in FluentEditorSemantic Rules Representation in Controlled Natural Language in FluentEditor
Semantic Rules Representation in Controlled Natural Language in FluentEditor
 
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
International Journal on Natural Language Computing (IJNLC) Vol. 4, No.2,Apri...
 
Jq3616701679
Jq3616701679Jq3616701679
Jq3616701679
 
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
 
**JUNK** (no subject)
**JUNK** (no subject)**JUNK** (no subject)
**JUNK** (no subject)
 
Natural Language Processing: State of The Art, Current Trends and Challenges
Natural Language Processing: State of The Art, Current Trends and ChallengesNatural Language Processing: State of The Art, Current Trends and Challenges
Natural Language Processing: State of The Art, Current Trends and Challenges
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
arttt.pdf
arttt.pdfarttt.pdf
arttt.pdf
 

Plus de RIILP

Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD RIILP
 
Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic RIILP
 
Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones RIILP
 
Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones RIILP
 
Gianluca Giulinin - FAO
Gianluca Giulinin - FAO Gianluca Giulinin - FAO
Gianluca Giulinin - FAO RIILP
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic RIILP
 
Tony O'Dowd - KantanMT
Tony O'Dowd -  KantanMT Tony O'Dowd -  KantanMT
Tony O'Dowd - KantanMT RIILP
 
Santanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAARSantanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAARRIILP
 
Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU RIILP
 
Anna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMAAnna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMARIILP
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD RIILP
 
Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW RIILP
 
Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA RIILP
 
Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU RIILP
 
Liling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARLiling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARRIILP
 
Sandra de luca - Acclaro
Sandra de luca - AcclaroSandra de luca - Acclaro
Sandra de luca - AcclaroRIILP
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015RIILP
 
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015RIILP
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015RIILP
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015RIILP
 

Plus de RIILP (20)

Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD Gabriella Gonzalez - eTRAD
Gabriella Gonzalez - eTRAD
 
Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic Manuel Herranz - Pangeanic
Manuel Herranz - Pangeanic
 
Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones Carla Parra Escartin - ER2 Hermes Traducciones
Carla Parra Escartin - ER2 Hermes Traducciones
 
Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones Juanjo Arevelillo - Hermes Traducciones
Juanjo Arevelillo - Hermes Traducciones
 
Gianluca Giulinin - FAO
Gianluca Giulinin - FAO Gianluca Giulinin - FAO
Gianluca Giulinin - FAO
 
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
Lianet Sepulveda & Alexander Raginsky - ER 3a & ER 3b Pangeanic
 
Tony O'Dowd - KantanMT
Tony O'Dowd -  KantanMT Tony O'Dowd -  KantanMT
Tony O'Dowd - KantanMT
 
Santanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAARSantanu Pal - ESR 2 USAAR
Santanu Pal - ESR 2 USAAR
 
Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU Chris Hokamp - ESR 9 DCU
Chris Hokamp - ESR 9 DCU
 
Anna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMAAnna Zaretskaya - ESR 1 UMA
Anna Zaretskaya - ESR 1 UMA
 
Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD  Carolina Scarton - ESR 7 - USFD
Carolina Scarton - ESR 7 - USFD
 
Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW Rohit Gupta - ESR 4 - UoW
Rohit Gupta - ESR 4 - UoW
 
Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA Hernani Costa - ESR 3 - UMA
Hernani Costa - ESR 3 - UMA
 
Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU Liangyou Li - ESR 8 - DCU
Liangyou Li - ESR 8 - DCU
 
Liling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAARLiling Tan - ESR 5 USAAR
Liling Tan - ESR 5 USAAR
 
Sandra de luca - Acclaro
Sandra de luca - AcclaroSandra de luca - Acclaro
Sandra de luca - Acclaro
 
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
ER1 Eduard Barbu - EXPERT Summer School - Malaga 2015
 
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
ESR1 Anna Zaretskaya - EXPERT Summer School - Malaga 2015
 
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
ESR2 Santanu Pal - EXPERT Summer School - Malaga 2015
 
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
ESR3 Hernani Costa - EXPERT Summer School - Malaga 2015
 

Dernier

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Dernier (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

13. Constantin Orasan (UoW) Natural Language Processing for Translation

  • 1. Natural Language Processing for Translation Constantin Orasan University of Wolverhampton, UK
  • 2. Structure 1. How NLP can help machine translation 2. How information retrieval can help machine translation 3. View from the industry
  • 3. MT is difficult  Language is ambiguous at all levels:  Lexical: bank, file, chair  Syntactic: John saw the man with the telescope.  Semantic: The rabbit is ready for lunch.  Discourse: John hid Bill’s car keys. He was drunk.  Pragmatics: You owe me £20.  It gets even more difficult when we start working in multilingual settings
  • 4. Steps of processing       Tokenisation Morphology Syntax Lexical Semantics Discourse Pragmatics  NLP complexity increases this way! (and in general the accuracy of methods decreases)
  • 5. General NLP frameworks  GATE (General Architecture for Text Engineering) http://gate.ac.uk framework written in Java:  First designed for information extraction tasks  Developed into a robust framework that offers processing a different linguistic levels in many languages  Provides wrappers for other tools such as Weka, LingPipe, etc.  NLTK: set of NLP modules written in Python with emphasis on teaching  Lingpipe (http://alias-i.com/lingpipe/) java toolkit for language processing  OpenNLP (http://opennlp.apache.org/) ML-based toolkit for language processing  … the list can go on
  • 6. Vauquois triangle Direct translation Source text Source syntax Shallow (syntactic) transfer Target text Target syntax Source Deep (semantic) transfer Target semantics semantics Interlingua
  • 7. NLP in SMT  Can benefit from linguistic information, but many of the existing models are largely data driven and do not incorporate much linguistic information  See the lecture on Statistical MT: Word, Phrase and Tree Based Models (overview) Khalil Sima'an (UvA) and Trevor Cohn (USFD)
  • 8. NLP in EBMT  In many cases it requires some kind of linguistic information  See the lecture on Example Based Machine Translation Joseph van Genabith (DCU) and Kalil Sima'an (UvA)
  • 9. NLP in TM  The existing TM solutions do not rely on much linguistic information  Second generation and third generations of TM rely on linguistic input  See the lecture on Translation Memories Ruslan Mitkov (UoW), Manuel Arcedillo (Hermes) and Juanjo Arevalillo (Hermes)
  • 10. But there are many other ways in which we could improve the results of translation engines by incorporating linguistic information
  • 11. Improve tokenisation  For European languages tokenisation is considered more or less a simple problem.  In non-segmented languages (such as many oriental ones), identification of tokens is extremely complex  Tokens do not have explicit boundaries (written directly adjacent to one another with no whitespace between them).  Practically all the characters can be one-character words in themselves, but they can also be joined together to form multicharacter words.  Even in segmented languages like English, identification of tokens can be difficult.
  • 12. Tokenisation may not be so easy  Even in segmented languages like English where Tokens are usually separated by whitespaces and punctuation there are problem:  Abbreviations: when full stops follow abbreviations, they should be merged with the abbreviation to form one token (e.g. etc., yrs., Mr.)  Multiple strings separated by white space can in fact form one token (e.g., numerical expressions in French:1 200 000)  Hyphenation can be ambiguous:  Sometimes part of the word segment, e.g. self-assessment, F-16, forty-two,  Sometimes not, e.g. London-based  Additional challenges:  Numerical, special expressions (dates, measures, email addresses)  Language specific rules for contracting words and phrases (e.g. can’t, won’t vs. O'Brien: contain multiple tokens with no white spaces between)  Ambiguous punctuation (e.g. “.” in yrs., 05.11.08 )
  • 13. Why tokenisation is important in MT?  It will influence any task that requires a dictionary/gazeeter lookup  Can influence how words are aligned  Abbreviations were shown to help SMT (Li and Yarowsky, 2008)  Translation of named entities (n.b. NER is seen as part of tokenization)
  • 14. Abbreviations  Unseen abbreviations are treated as unknown words and left untranslated  Modern Chinese is a highly abbreviated language and 20% of sentences in an newspaper article contain an abbreviation  The way abbreviations are formed follows much more complex rules than English  Li and Yarowsky (2008) propose an unsupervised method for extracting relations between full-form phrases and their abbreviations
  • 15. Li and Yarowsky (2008)  Step 1: Identification of English entities  Step 2: Translate the entities into Chinese using a baseline translator  Step 3: Full-abbreviations relations are extracted on the basis of co-occurrence in a Chinese monolingual corpus  Step 4: Translation induction for Chinese abbreviations  Step 5: Integration with the baseline translation system Evaluation shows that the results of the BLEU scores improve
  • 16. Translation of named entities  Incorrect NE translation can seriously harm the quality of translation  There are 2 main sources of problems:  Ambiguity: NE normally are composed of words which can be translated in isolation  Sparsity: some named entities are very sparse  Integration of NEs into the translation model leads to various results ranging from significant improvements to low improvements and even negative impact
  • 17. NE in SMT  The main approach in SMT is to determine the NEs in a text and translate them using an external model.  Then they are:  Used as the default translation (Li et al, 2009)  Added dynamically to compete with other translations (Turchi et al., 2012; Bouamor et al., 2012),  not used, and do not translate the original NE (Tinsley et al., 2012)  Nikoulina et al. (2012) propose replacing NEs with placeholders in order to reduce sparsity, in this way learning a better model
  • 18. Nikoulina et al. (2012) 1. 2. 3. 4. The Named Entites are detected and replaced with placeholders to produce reduced sentences A reduced translation model is used to translate the reduced sentences An external NE translator is employed The translated NEs are reinserted in the reduced translations The disadvantage of the approach is that the framework is loosely dependent on the SMT task  a postprocessing step is applied to the output of the NER + a prediction model to determine which NEs can be safely translated
  • 19. MT and semantics  Noted as problem for Machine Translation back in the late 1949’s (Weaver, 1949) A word can often only be translated if you know the specific sense intended  Bar-Hillel (1960) posed the following problem: Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy.  Is “pen” a writing instrument or an enclosure where children play or an enclosure for livestock?  …declared it unsolvable, and left the field of MT…
  • 21. Lexical Divergence: solution?  Domain specific dictionaries can improve the quality of translation in post-editing environments  Word sense disambiguation (WSD) is the ability to identify the meaning of words in context in a computational manner  WSD is seen as a more general solution to lexical divergence, but  WSD is an AI complete
  • 22. Carpuat and Wu (2005)  SMT models only rely on local context to choose among lexical translation candidates  The assumption is that a dedicated WSD module can help the translation process  Use a baseline Chinese to English translation engine and a state-of-theart Chinese WSD  WSD incorporated in the decoder  WSD incorporated in a post-processor  Translation obtained using the English gloss of HowNet  WSD does not help a typical SMT, but this is mainly due to the fact that current SMT systems (2005) cannot take advantage of the sense information
  • 23. Chan et al. (2007)  Successfully integrate WSD in Hiero, a state-of-the-art Chinese to English hierarchical phrase-based MT system  Introduce two additional features in the MT model at the decoding stage that take into consideration that some words were chosen by the WSD system  The improvement noticed is modest, but statistically significant  Carpuat and Wu (2007) find similar results, but instead of WSD they perform fully phrasal multi-word disambiguation and their disambiguation system is tightly integrated in the SMT engine
  • 24. NLP in evaluation of MT  Evaluation metrics like BLUE treat any divergence from the reference translation as a mistake  Several alternative metrics were proposed to address this problem:  METEOR (Denkowski and Lavie, 2010) accounts for synonyms and paraphrases  Calculate meaning equivalence using bidirectional textual entailment (Pado et al., 2009)  Using semantic role labels (Gimenez and Marquez, 2007)  TINE (Rios et al., 2011) measures the similarity between sentences using shallow semantic representation
  • 25. Other NLP applications which could be useful     Automatic terminology extraction Automatic extraction of ontologies Automatic compilation of (parallel/comparable) corpora Use of parallel corpora to train various systems
  • 26. References  Bouamor, D., Semmar, N., and Zweigenbaum, P. (2012). Identifying multiword expressions in statistical machine translation. In Proceedings of LREC 2012.  Carpuat, M., & Wu, D. (2007). Improving Statistical Machine Translation Using Word Sense Disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (pp. 61–72). Prague, Czech Republic. Retrieved from http://acl.ldc.upenn.edu/D/D07/D07-1007.pdf  Carpuat, M., & Wu, D. (2005). Word Sense Disambiguation vs. Statistical Machine Translation. Proceedings of the 43rd Annual Meeting of the ACL, (June), 387–394. Retrieved from http://acl.ldc.upenn.edu/P/P07/P071005.pdf  Chan, Y. S., Ng, H. T., & Chiang, D. (2007). Word Sense Disambiguation Improves Statistical Machine Translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (pp. 33–40).
  • 27. References  Li, M., Zhang, J., Zhou, Y., and Chengqing, Z. (2009). The CASIA statistical machine translation system for IWSLT 2009. In Proceedings of IWSLT 2009  Li, Z., & Yarowsky, D. (2008). Unsupervised Translation Induction for Chinese Abbreviations using Monolingual Corpora. In Proceedings of ACL-08 (pp. 425 – 433). Columbus, Ohio, USA. Retrieved from http://aclweb.org/anthology//P/P08/P08-1049.pdf  Navigli, R. (2009). Word sense disambiguation. ACM Computing Surveys, 41(2), 1–69. doi:10.1145/1459352.1459355  Nikoulina, V., Sandor, A., & Dymetman, M. (2012). Hybrid Adaptation of Named Entity Recognition for Statistical Machine Translation. In Second ML4HMT Workshop (pp. 1–16).  Pado, S., Galley, M., Jurafsky, D., & Manning, C. (2009). Robust Machine Translation Evaluation with Entailment Features. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP (pp. 297–305).
  • 28. References  Rios, M.; Aziz, W.; Specia, L. (2011). TINE: A Metric to Assess MT Adequacy. In Proceedings of the 6th Workshop on Statistical Machine Translation (WMT-2011), July, Edinburgh, UK  Tinsley, J., Ceausu, A., and Zhang, J. (2012). PLUTO: automated solutions for patent translation. In EACL JointWorkshop on Exploitng Synergies between Information Retrieval andMachine Translation (ESIRMT) and Hybrid Approaches to Machine Translation (HyTra): Proceedings of the workshop, EACL 2012.  Turchi, M., Atkinson, M.,Wilcox, A., Crawley, B., Bucci, S., Steinberger, R., and Van der Goot, E. (2012). ONTS: "Optima" news translation system. In Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics.