paper introducing: Exploiting source side monolingual data in neural machine translation

Exploiting Source-side Monolingual Data
in Neural Machine Translation
Jiajun Zhang, Chengqing Zong
EMNLP2016
presentation
Sekizawa Yuuki
2017/10/30 1

Exploiting Source-side Monolingual Data in
Neural Machine Translation
• The source-side monolingual data is not fully explored
• (especially parallel corpus is far from sufficient)
• propose two approaches
1. employ the self-learning algorithm to generate the synthetic
large-scale parallel data for NMT training.
2. apply the multi-task learning framework using two NMTs
• predict the translation and the reordered source-side
monolingual sentences simultaneously.
• proposed methods obtain significant improvements
over the strong attention-based NMT.
2017/10/30 2

NMT(Bahdanau+ 2014)
2017/10/30 3

previous work using monolingual data
• SMT
• Koehn et al. (2007)
• Chiang (2007)
• NMT
• Gulcehre et al. (2015) target only
• Sennrich et al. (2015) target only
• Luong et al. (2015)
2017/10/31 4

• approach1: self-learning method
synthetic parallel corpus
(big enough)
bilingual corpus
(not big enough)
proposed method 1
2017/10/30 5
source
corpus
target
corpus
1. MT
training
MT baseline
source-side
monolingual
corpus (large)
target-side
translated
corpus
new NMT system
3. NMT training
using combined
corpus

self-learning method
• make synthetic bilingual corpus
• target parts may negatively influence the decoder model
• distinguish original bitext from the synthetic bilingual
sentences during NMT training
• freezing the parameters of the decoder network for the
synthetic data
2017/10/30 6

proposed method 2
• approach2: sentence reordering method
2017/10/30 7
same encoder

sentence reordering method
2017/10/31 8
• reordering
• trained on source-side monolingual data (large) using NMT
• target-side is reordered source sentence
• using the pre-ordering rules (Wang et al., 2007)
• translation (more attention)
• trained on the sentence aligned parallel data (small)
• training : reordering à translation à reordering à…
one epoch several epochs

sentence reordering method
2017/10/31
• objective function (multi-task learning)
translation
reordering
parameter collection

experiment settings
• language: Chinese-to-English translation
• machine translation corpus
• small bilingual data : 0.63M sentence from LDC corpora5
• validation: NIST 2003 (MT03) dataset
• test: NIST 2004 (MT04), NIST 2005 (MT05) and NIST 2006
(MT06) datasets.
• source-side monolingual data
• collect about 20M Chinese sentences from LDC
• retain the 6.5M sentences
• more than 50% words should appear in the source-side
portion of the bilingual training data
• ordered by the word hit rate.
2017/10/31 10

experiment settings
• segmentation
• Chinese: sentences Stanford Word Segmenter6.
• English: Moses decoder7
• source sentence’s parser
• Berkeley parser (Petrov et al., 2006)
• reordering method (Wang et al., 2007)
• training settings
• remove all the sentences of length over 50 words
• limit the vocabulary in both Chinese and English to the
most 40K words
2017/10/31 11

results on BLEU score
2017/10/31 12
SL: self-learning (make synthetic bilingual corpus)
MTL: multi-task learning (sentence reordering)
Autoencoder: multi-task learning framework in which a simple autoencoder is
adopted on source-side monolingual data (Luong et al., 2015)
NMT baseline à
(Bahdanau+ 2014)

Quality with amount of monolingual data
2017/10/31 13

experiment with large corpus
2017/10/31 14
The large-scale data set contains about 2.1M sentence pairs
12M monolingual data set

Exploiting Source-side Monolingual Data in
Neural Machine Translation
• The source-side monolingual data is not fully explored
• (especially parallel corpus is far from sufficient)
• propose two approaches
1. employ the self-learning algorithm to generate the synthetic
large-scale parallel data for NMT training.
2. apply the multi-task learning framework using two NMTs
• predict the translation and the reordered source-side
monolingual sentences simultaneously.
• proposed methods obtain significant improvements
over the strong attention-based NMT.
2017/11/6 15

paper introducing: Exploiting source side monolingual data in neural machine translation

Recommended

Recommended

More Related Content

Similar to paper introducing: Exploiting source side monolingual data in neural machine translation

Similar to paper introducing: Exploiting source side monolingual data in neural machine translation (20)

More from sekizawayuuki

More from sekizawayuuki (20)

Recently uploaded

Recently uploaded (20)

paper introducing: Exploiting source side monolingual data in neural machine translation