SlideShare une entreprise Scribd logo
1  sur  18
Attention mechanism
in Neural Machine Translation
PHAM QUANG KHANG
Machine Translation task in NLP
1. Definition: translating text in one language to another language
2. Evaluation Data set: public datasets including sentence pairs of 2 language (source
language and target language)
a. Main in researches: Eng-Fra, Eng-Ger
b. Vietnamese: Eng-Vi 133k sentence pairs
PHAM QUANG KHANG 2
French-to-English translations from newstest2014
(Artetxe et al., 2018)
Source Reference
And of course, we all share
the same adaptive
imperatives.
Và tất nhiên, tất cả chúng ta
đều trải qua quá trình phát
triển và thích nghi như nhau.
Eng-to-Vi dataset in IWSLT 15 (133K sentence pairs)
BLEU score: standard evaluation for MT
1. Def: BLEU score compare n-grams of the candidate with n-grams of the reference
translation and count the number of matches (Papineni et al. 2002)
2. Calculation:
PHAM QUANG KHANG 3
Example:
Candidate: the the the the the the
Ref: The cat is on the mat
Countclip(the) = 2, Count(the) = 7
precision = 2/7
Papineni et al,. BLEU: a Method for Automatic Evaluation of Machine Translation
Precision
NMT as a hot topic for research
 Number of public paper on NMT got a spike from last year
 Key players:
 Facebook: tackling low-resource
language (Turkish, Vietnamese …)
 Amazon: improving efficiency
 Google: improving output of NMT
 Business needs is higher than ever
since auto translation can save
massive cost for global firms
PHAM QUANG KHANG 4
Number of NMT paper in the last few years
Source: https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just-
had-its-busiest-month-ever/
Paper counted from Arxiv
Main approaches for NMT recently
1. Recurrent Neural Networks (RNNs):
 Using LSTM, GRU, BiDirectional RNNs for encoder and decoders
2. Attention with RNNs: use attention between encoder and decoder
 Using Attention mechanism while decoding to improve the ability to capture long term
dependencies
3. Attention only: Transformer
 Use only Attention for both long term dependency and reducing calculation cost
PHAM QUANG KHANG 5
RNNs: processing sequential data
 Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time
step t, ht can be calculated using input at t and hidden state ht-1
PHAM QUANG KHANG 6
http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Luong et al.2015
RNNs: processing sequential data
 Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time
step t, ht can be calculated using input at t and hidden state ht-1
RNNs has shown promising results: using RNNs to achieve close to state-of-the-art
performance of conventional phrase-based machine translation on English-to-French task.
PHAM QUANG KHANG 7
Luong et al.2015
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
BLEU score of several
architecture on Eng-Ger
dataset
Luong et al.2015
Attention: aligning while translating
 Intuition: each time the proposed model generates a word in a translation, it searches for a
set of positions in a source sentence where the most relevant information is concentrated
(Bahdanau et al. 2016)
 Advantages over pure RNNs:
1. Do not encode whole input into 1 vector => not losing information
2. Allow adaptive selection to which should the model attend to
PHAM QUANG KHANG 8
Bahdanau et al. 2015https://github.com/tensorflow/nmt
Attention mechanism
 When decode for predicting output, calculate the score between hidden vector of current state
and all hidden vector from input sentence
PHAM QUANG KHANG 9
https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb#scrollTo=TNfHIF71ulLu
Luong et al.2015
Attention for long sentence translation
Bahdanau et al
Compare to pure RNNs decoder on same
dataset and test on combined of WMT 12&13
Thang Luong et al.
Test on WMT’14 Eng-Ger test set
PHAM QUANG KHANG 10
Attention proved to be better than normal RNNs decoder for long sentence translation
Google NMT system
Attention between
encoder and decoder
PHAM QUANG KHANG 11
Wu et al., Google’s Neural Machine Translation System: Bridging the Gap
between Human and Machine Translation
Self-attention for representation learning
 The essence of encoder is to create representation of input
 Self-attention allows connection between words within one sentence while shortening the
path between words, which really differentiate with RNNs and CNNs
 Input and ouput: query, keys and values are from the same source
PHAM QUANG KHANG 12
http://deeplearning.hatenablog.com/entry/transformerLin et al. A Structured Self-Attentive Sentence Embedding
Transformer: Attention is all you need
1. Use self-attention instead of RNNs, CNNs
a. Multi-head Attention for self-attention and source-target
attention
b. Position-wise Feed Forward after Attention
c. Masked Multi-head Attention to prevent target words to
attend to “future” word
d. Word embedding + Positional Encoding
2. Reduce total computational complexity per layer
3. Amount of computation can be parallelized
4. Enhance the ability to learn long-range dependencies
PHAM QUANG KHANG 13
An architecture based solely on attention mechanisms
Attention Is All You Need
Multi-head attention
 Attention is only scaled dot-product between keys and query
 Each (V,K,Q) are projected into multiple set of (v, k, q) for different learning (multi-head)
 Output is concatenated again then linear transformed
PHAM QUANG KHANG 14
State-of-the-art in MT
 Outperforms best (at the time of paper) reported models by more than 2.0 BLEU with
training cost way less than those models
PHAM QUANG KHANG 15
Attention Is All You Need
Visualization of self-attention in transformer
PHAM QUANG KHANG 16
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
The model recognized which does “it” refer to (left hand: animal, right hand: street)
Potential of MT and Transformer
1. Customers oriented applications:
a. Text-to-text translation: google translate, facebook translate…
=> Need better engine for rare and difficult language like: Japanese, Vietnamese, Arabic…
b. Speech  Text, Speech to speech …
2. B2B applications:
a. Full text translation (company documentations ….)
b. Domain specific real time translation: for meeting, for work flow automation…
3. Scientific:
a. Attention and self-attention as an approach for other tasks like: language modeling, question
answering system (BERT – Google: AI outperform human in question answering)
PHAM QUANG KHANG 17
References
1. Attention Is All You Need, NIPS 2017
2. Dzmitry et al. Neural Machine Translation by Jointly Learning to Align and Translate
3. Lin et al. A Structured Self-Attentitive Sentence Embedding
4. Deep Learning (Ian Goodfellow et al, 2016)
5. IWSLT 2015 dataset for English-Vietnamese: https://wit3.fbk.eu/mt.php?release=2015-01
6. Yonghui Wu, et al. Google’s Neural Machine Translation System: Bridging the Gap between
Human and Machine Translation
7. https://machinelearningmastery.com/introduction-neural-machine-translation/
8. http://deeplearning.hatenablog.com/entry/transformer
9. https://medium.com/the-new-nlp/ai-outperforms-humans-in-question-answering-
70554f51136b
PHAM QUANG KHANG 18

Contenu connexe

Tendances

Tendances (20)

[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Recurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRURecurrent Neural Networks, LSTM and GRU
Recurrent Neural Networks, LSTM and GRU
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
Attention mechanism 소개 자료
Attention mechanism 소개 자료Attention mechanism 소개 자료
Attention mechanism 소개 자료
 
Sequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learningSequence to sequence (encoder-decoder) learning
Sequence to sequence (encoder-decoder) learning
 
BERT
BERTBERT
BERT
 
LSTM Basics
LSTM BasicsLSTM Basics
LSTM Basics
 
Recurrent neural networks rnn
Recurrent neural networks   rnnRecurrent neural networks   rnn
Recurrent neural networks rnn
 
Recurrent neural network
Recurrent neural networkRecurrent neural network
Recurrent neural network
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Long Short Term Memory
Long Short Term MemoryLong Short Term Memory
Long Short Term Memory
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Introduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNNIntroduction For seq2seq(sequence to sequence) and RNN
Introduction For seq2seq(sequence to sequence) and RNN
 
Word2Vec
Word2VecWord2Vec
Word2Vec
 

Similaire à Notes on attention mechanism

Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
DataWorks Summit
 

Similaire à Notes on attention mechanism (20)

Master Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslatorMaster Thesis of Computer Engineering: OpenTranslator
Master Thesis of Computer Engineering: OpenTranslator
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...ANALYZING ARCHITECTURES FOR NEURAL  MACHINE TRANSLATION USING LOW  COMPUTATIO...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIO...
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
 
Natural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application TrendsNatural Language Processing - Research and Application Trends
Natural Language Processing - Research and Application Trends
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP Approaches
 
IRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text DetectionIRJET- On-Screen Translator using NLP and Text Detection
IRJET- On-Screen Translator using NLP and Text Detection
 
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISHA NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
A NEURAL MACHINE LANGUAGE TRANSLATION SYSTEM FROM GERMAN TO ENGLISH
 
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
Topic 4: The Magician's Hat: Turning Data into Business Intelligence (3)
 
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
EMPLOYING PIVOT LANGUAGE TECHNIQUE THROUGH STATISTICAL AND NEURAL MACHINE TRA...
 
Information-Flow Analysis of Design Breaks up
Information-Flow Analysis of Design Breaks upInformation-Flow Analysis of Design Breaks up
Information-Flow Analysis of Design Breaks up
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATIONEXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
EXTENDING OUTPUT ATTENTIONS IN RECURRENT NEURAL NETWORKS FOR DIALOG GENERATION
 
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
Pangeanic Cor-ActivaTM-Neural machine translation Taus Tokyo 2017
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
Large Scale Text Processing
Large Scale Text ProcessingLarge Scale Text Processing
Large Scale Text Processing
 
Large Scale Processing of Unstructured Text
Large Scale Processing of Unstructured TextLarge Scale Processing of Unstructured Text
Large Scale Processing of Unstructured Text
 
Neural machine translation by jointly learning to align and translate
Neural machine translation by jointly learning to align and translateNeural machine translation by jointly learning to align and translate
Neural machine translation by jointly learning to align and translate
 
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
[IJET-V2I1P13] Authors:Shilpa More, Gagandeep .S. Dhir , Deepak Daiwadney and...
 

Dernier

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Dernier (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 

Notes on attention mechanism

  • 1. Attention mechanism in Neural Machine Translation PHAM QUANG KHANG
  • 2. Machine Translation task in NLP 1. Definition: translating text in one language to another language 2. Evaluation Data set: public datasets including sentence pairs of 2 language (source language and target language) a. Main in researches: Eng-Fra, Eng-Ger b. Vietnamese: Eng-Vi 133k sentence pairs PHAM QUANG KHANG 2 French-to-English translations from newstest2014 (Artetxe et al., 2018) Source Reference And of course, we all share the same adaptive imperatives. Và tất nhiên, tất cả chúng ta đều trải qua quá trình phát triển và thích nghi như nhau. Eng-to-Vi dataset in IWSLT 15 (133K sentence pairs)
  • 3. BLEU score: standard evaluation for MT 1. Def: BLEU score compare n-grams of the candidate with n-grams of the reference translation and count the number of matches (Papineni et al. 2002) 2. Calculation: PHAM QUANG KHANG 3 Example: Candidate: the the the the the the Ref: The cat is on the mat Countclip(the) = 2, Count(the) = 7 precision = 2/7 Papineni et al,. BLEU: a Method for Automatic Evaluation of Machine Translation Precision
  • 4. NMT as a hot topic for research  Number of public paper on NMT got a spike from last year  Key players:  Facebook: tackling low-resource language (Turkish, Vietnamese …)  Amazon: improving efficiency  Google: improving output of NMT  Business needs is higher than ever since auto translation can save massive cost for global firms PHAM QUANG KHANG 4 Number of NMT paper in the last few years Source: https://slator.com/technology/google-facebook-amazon-neural-machine-translation-just- had-its-busiest-month-ever/ Paper counted from Arxiv
  • 5. Main approaches for NMT recently 1. Recurrent Neural Networks (RNNs):  Using LSTM, GRU, BiDirectional RNNs for encoder and decoders 2. Attention with RNNs: use attention between encoder and decoder  Using Attention mechanism while decoding to improve the ability to capture long term dependencies 3. Attention only: Transformer  Use only Attention for both long term dependency and reducing calculation cost PHAM QUANG KHANG 5
  • 6. RNNs: processing sequential data  Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time step t, ht can be calculated using input at t and hidden state ht-1 PHAM QUANG KHANG 6 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Luong et al.2015
  • 7. RNNs: processing sequential data  Recurrent Neural Network (RNN): a neural network consists of hidden state h, at each time step t, ht can be calculated using input at t and hidden state ht-1 RNNs has shown promising results: using RNNs to achieve close to state-of-the-art performance of conventional phrase-based machine translation on English-to-French task. PHAM QUANG KHANG 7 Luong et al.2015 http://colah.github.io/posts/2015-08-Understanding-LSTMs/ BLEU score of several architecture on Eng-Ger dataset Luong et al.2015
  • 8. Attention: aligning while translating  Intuition: each time the proposed model generates a word in a translation, it searches for a set of positions in a source sentence where the most relevant information is concentrated (Bahdanau et al. 2016)  Advantages over pure RNNs: 1. Do not encode whole input into 1 vector => not losing information 2. Allow adaptive selection to which should the model attend to PHAM QUANG KHANG 8 Bahdanau et al. 2015https://github.com/tensorflow/nmt
  • 9. Attention mechanism  When decode for predicting output, calculate the score between hidden vector of current state and all hidden vector from input sentence PHAM QUANG KHANG 9 https://colab.research.google.com/github/tensorflow/tensorflow/blob/master/tensorflow/contrib/eager/python/examples/nmt_with_attention/nmt_with_attention.ipynb#scrollTo=TNfHIF71ulLu Luong et al.2015
  • 10. Attention for long sentence translation Bahdanau et al Compare to pure RNNs decoder on same dataset and test on combined of WMT 12&13 Thang Luong et al. Test on WMT’14 Eng-Ger test set PHAM QUANG KHANG 10 Attention proved to be better than normal RNNs decoder for long sentence translation
  • 11. Google NMT system Attention between encoder and decoder PHAM QUANG KHANG 11 Wu et al., Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
  • 12. Self-attention for representation learning  The essence of encoder is to create representation of input  Self-attention allows connection between words within one sentence while shortening the path between words, which really differentiate with RNNs and CNNs  Input and ouput: query, keys and values are from the same source PHAM QUANG KHANG 12 http://deeplearning.hatenablog.com/entry/transformerLin et al. A Structured Self-Attentive Sentence Embedding
  • 13. Transformer: Attention is all you need 1. Use self-attention instead of RNNs, CNNs a. Multi-head Attention for self-attention and source-target attention b. Position-wise Feed Forward after Attention c. Masked Multi-head Attention to prevent target words to attend to “future” word d. Word embedding + Positional Encoding 2. Reduce total computational complexity per layer 3. Amount of computation can be parallelized 4. Enhance the ability to learn long-range dependencies PHAM QUANG KHANG 13 An architecture based solely on attention mechanisms Attention Is All You Need
  • 14. Multi-head attention  Attention is only scaled dot-product between keys and query  Each (V,K,Q) are projected into multiple set of (v, k, q) for different learning (multi-head)  Output is concatenated again then linear transformed PHAM QUANG KHANG 14
  • 15. State-of-the-art in MT  Outperforms best (at the time of paper) reported models by more than 2.0 BLEU with training cost way less than those models PHAM QUANG KHANG 15 Attention Is All You Need
  • 16. Visualization of self-attention in transformer PHAM QUANG KHANG 16 https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html The model recognized which does “it” refer to (left hand: animal, right hand: street)
  • 17. Potential of MT and Transformer 1. Customers oriented applications: a. Text-to-text translation: google translate, facebook translate… => Need better engine for rare and difficult language like: Japanese, Vietnamese, Arabic… b. Speech  Text, Speech to speech … 2. B2B applications: a. Full text translation (company documentations ….) b. Domain specific real time translation: for meeting, for work flow automation… 3. Scientific: a. Attention and self-attention as an approach for other tasks like: language modeling, question answering system (BERT – Google: AI outperform human in question answering) PHAM QUANG KHANG 17
  • 18. References 1. Attention Is All You Need, NIPS 2017 2. Dzmitry et al. Neural Machine Translation by Jointly Learning to Align and Translate 3. Lin et al. A Structured Self-Attentitive Sentence Embedding 4. Deep Learning (Ian Goodfellow et al, 2016) 5. IWSLT 2015 dataset for English-Vietnamese: https://wit3.fbk.eu/mt.php?release=2015-01 6. Yonghui Wu, et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation 7. https://machinelearningmastery.com/introduction-neural-machine-translation/ 8. http://deeplearning.hatenablog.com/entry/transformer 9. https://medium.com/the-new-nlp/ai-outperforms-humans-in-question-answering- 70554f51136b PHAM QUANG KHANG 18