Brief introduction on attention mechanism and its application in neural machine translation, especially in transformer, where attention was used to remove RNNs completely from NMT.
2. Machine Translation task in NLP
1. Definition: translating text in one language to another language
2. Evaluation Data set: public datasets including sentence pairs of 2 language (source
language and target language)
a. Main in researches: Eng-Fra, Eng-Ger
b. Vietnamese: Eng-Vi 133k sentence pairs
PHAM QUANG KHANG 2
French-to-English translations from newstest2014
(Artetxe et al., 2018)
Source Reference
And of course, we all share
the same adaptive
imperatives.
Và tất nhiên, tất cả chúng ta
đều trải qua quá trình phát
triển và thích nghi như nhau.
Eng-to-Vi dataset in IWSLT 15 (133K sentence pairs)
3. BLEU score: standard evaluation for MT
1. Def: BLEU score compare n-grams of the candidate with n-grams of the reference
translation and count the number of matches (Papineni et al. 2002)
2. Calculation:
PHAM QUANG KHANG 3
Example:
Candidate: the the the the the the
Ref: The cat is on the mat
Countclip(the) = 2, Count(the) = 7
precision = 2/7
Papineni et al,. BLEU: a Method for Automatic Evaluation of Machine Translation
Precision
4. NMT as a hot topic for research
Number of public paper on NMT got a spike from last year
Key players:
Facebook: tackling low-resource
language (Turkish, Vietnamese …)
Amazon: improving efficiency
Google: improving output of NMT
Business needs is higher than ever
since auto translation can save
massive cost for global firms
PHAM QUANG KHANG 4
Number of NMT paper in the last few years
Source: https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-
had-its-busiest-month-ever/
Paper counted from Arxiv
5. Main approaches for NMT recently
1. Recurrent Neural Networks (RNNs):
Using LSTM, GRU, BiDirectional RNNs for encoder and decoders
2. Attention with RNNs: use attention between encoder and decoder
Using Attention mechanism while decoding to improve the ability to capture long term
dependencies
3. Attention only: Transformer
Use only Attention for both long term dependency and reducing calculation cost
PHAM QUANG KHANG 5
6. RNNs: processing sequential data
Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time
step t, ht can be calculated using input at t and hidden state ht-1
PHAM QUANG KHANG 6
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Luong et al.2015
7. RNNs: processing sequential data
Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time
step t, ht can be calculated using input at t and hidden state ht-1
RNNs has shown promising results: using RNNs to achieve close to state-of-the-art
performance of conventional phrase-based machine translation on English-to-French task.
PHAM QUANG KHANG 7
Luong et al.2015
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
BLEU score of several
architecture on Eng-Ger
dataset
Luong et al.2015
8. Attention: aligning while translating
Intuition: each time the proposed model generates a word in a translation, it searches for a
set of positions in a source sentence where the most relevant information is concentrated
(Bahdanau et al. 2016)
Advantages over pure RNNs:
1. Do not encode whole input into 1 vector => not losing information
2. Allow adaptive selection to which should the model attend to
PHAM QUANG KHANG 8
Bahdanau et al. 2015https://github.com/tensorflow/nmt
9. Attention mechanism
When decode for predicting output, calculate the score between hidden vector of current state
and all hidden vector from input sentence
PHAM QUANG KHANG 9
https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb#scrollTo=TNfHIF71ulLu
Luong et al.2015
10. Attention for long sentence translation
Bahdanau et al
Compare to pure RNNs decoder on same
dataset and test on combined of WMT 12&13
Thang Luong et al.
Test on WMT’14 Eng-Ger test set
PHAM QUANG KHANG 10
Attention proved to be better than normal RNNs decoder for long sentence translation
11. Google NMT system
Attention between
encoder and decoder
PHAM QUANG KHANG 11
Wu et al., Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation
12. Self-attention for representation learning
The essence of encoder is to create representation of input
Self-attention allows connection between words within one sentence while shortening the
path between words, which really differentiate with RNNs and CNNs
Input and ouput: query, keys and values are from the same source
PHAM QUANG KHANG 12
http://deeplearning.hatenablog.com/entry/transformerLin et al. A Structured Self-Attentive Sentence Embedding
13. Transformer: Attention is all you need
1. Use self-attention instead of RNNs, CNNs
a. Multi-head Attention for self-attention and source-target
attention
b. Position-wise Feed Forward after Attention
c. Masked Multi-head Attention to prevent target words to
attend to “future” word
d. Word embedding + Positional Encoding
2. Reduce total computational complexity per layer
3. Amount of computation can be parallelized
4. Enhance the ability to learn long-range dependencies
PHAM QUANG KHANG 13
An architecture based solely on attention mechanisms
Attention Is All You Need
14. Multi-head attention
Attention is only scaled dot-product between keys and query
Each (V,K,Q) are projected into multiple set of (v, k, q) for different learning (multi-head)
Output is concatenated again then linear transformed
PHAM QUANG KHANG 14
15. State-of-the-art in MT
Outperforms best (at the time of paper) reported models by more than 2.0 BLEU with
training cost way less than those models
PHAM QUANG KHANG 15
Attention Is All You Need
16. Visualization of self-attention in transformer
PHAM QUANG KHANG 16
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
The model recognized which does “it” refer to (left hand: animal, right hand: street)
17. Potential of MT and Transformer
1. Customers oriented applications:
a. Text-to-text translation: google translate, facebook translate…
=> Need better engine for rare and difficult language like: Japanese, Vietnamese, Arabic…
b. Speech Text, Speech to speech …
2. B2B applications:
a. Full text translation (company documentations ….)
b. Domain specific real time translation: for meeting, for work flow automation…
3. Scientific:
a. Attention and self-attention as an approach for other tasks like: language modeling, question
answering system (BERT – Google: AI outperform human in question answering)
PHAM QUANG KHANG 17
18. References
1. Attention Is All You Need, NIPS 2017
2. Dzmitry et al. Neural Machine Translation by Jointly Learning to Align and Translate
3. Lin et al. A Structured Self-Attentitive Sentence Embedding
4. Deep Learning (Ian Goodfellow et al, 2016)
5. IWSLT 2015 dataset for English-Vietnamese: https://wit3.fbk.eu/mt.php?release=2015-01
6. Yonghui Wu, et al. Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation
7. https://machinelearningmastery.com/introduction-neural-machine-translation/
8. http://deeplearning.hatenablog.com/entry/transformer
9. https://medium.com/the-new-nlp/ai-outperforms-humans-in-question-answering-
70554f51136b
PHAM QUANG KHANG 18