SlideShare une entreprise Scribd logo
DA 5330 – Advanced Machine Learning
Applications
Lecture 10 – Transformers
Maninda Edirisooriya
manindaw@uom.lk
Limitations of RNN Models
• Slow computation for longer sequences as the computation cannot
be done in parallel due to the dependencies in timesteps
• As there are significant number of timesteps the backpropagation
depth increases which increases Vanishing Gradient and Exploding
Gradient problems
• As information is passed from the history as a hidden state vector the
amount of information is limited to that vector size
• As information passed from the history gets updated in each time
step, the history is forgotten after number of time steps
Attention-based Models
• Instead of processing all the time steps with the same weight,
attention models performed well when only certain time steps are
given an exponentially higher weight while processing any time step
which are known as Attention Models
• Thought Attention Models were significantly better, its processing
requirement (Complexity) was Quadratic (i.e. proportional to the
square of the number of time stamps) which was an extra slowdown
• However, the paper published with name “Attention is all you need”
by Vasvani et al. 2017 proposed that RNN units can be replaced with a
higher performance mechanism keeping only the “Attention” in mind
• This model is known as a Transformer Model
Transformer Model Architecture
Encoder Decoder
Transformer Model
• The original paper defined this model (with both Encoder and Decoder) for the
application of Natural Language Translation
• However, the Encoder and Decoder models were separately used independently
in some later models for different tasks
Source: https://pub.aimind.so/unraveling-the-power-of-language-models-understanding-llms-and-transformer-variants-71bfc42e0b21
Encoder Only (Autoencoding) Models
• Only the Encoder of the Transformer is used
• Pre-Trained with Masked Language Models
• Some random tokens of the input sequence are masked
• Try to predict the missing (masked) tokens to reconstruct the original
sequence
• This process learns the Bidirectional Context of the tokens in a sequence
(probabilities of being around certain tokens in both right and left)
• Used in applications like Sentence Classification for Sentiment
Analysis and token level operations like Named Entity Recognition
• BERT and RoBERTa are some examples
Decoder Only (Autoregressive) Models
• Only the Decoder of the Transformer is used
• Pre-Trained with Causal Language Models
• Last token of the input sequence is masked
• Try to predict the last token to reconstruct the original sequence
• Also known as Full Language Model as well
• This process learns the Unidirectional Context of the tokens in a sequence
(probabilities of being the next token given the tokens at the left)
• Used in applications like Text Generation
• GPT and BLOOM are some examples
Encoder Decoder (Sequence-to-Sequence)
Models
• Use both Encoder and the Decoder of the Transformer
• Pre-Training objective may depend on the requirement. In T5 model,
• In Encoder, some random tokens of the input sequence are masked with a
unique placeholder token, added to the vocabulary, known as Sentinel token
• This process is known as Span Corruption
• Decoder tries to predict the missing (masked) tokens to reconstruct the
original sequence, replacing the Sentinel tokens, with auto-regression
• Used in applications like Translation, Summarization and Question-
answering
• T5 and BART are some examples
Encoder – Input and Embedding
• Inputs is the sequence of tokens (words in case of
Natural Language Processing (NLP))
• Each input token is converted to a vector using Input
Embedding (Word Embedding in case of NLP)
Output
Encoder – Input and Embedding
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
Encoder – Positional Encoding
Output
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
Encoder – Input and Embedding
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
Encoder – Multi-Head Attention
• Multi-Head Attention is about applying multiple
similar operations known as Single-Head Attention
or simply Attention
Attention(Q, K, V) = softmax(
𝑄𝑘𝑇
𝑑𝑘
)V
• The type of attention used here is known as Self
Attention where each token is having a attention
against all the tokens in the input sequence
• For the Encoder we take, Q = K = V = X
Output
Self Attention
Source: https://jalammar.github.io/illustrated-transformer/
• Self Attention formula is inspired by
the data query from a data store
where Q is the query which is
matched against the K key values
where V is the actual value
• 𝐐𝐊𝐓
is a measure between the
similarity between Q and K
• 𝐝𝐤 is used to normalize by dividing it
by the dimensionality of the K
• Softmax is used to give the attention
to the largest
• Finally, normalized similarity is used to
the weight V resulting the Attention
Self Attention
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
Encoder – Multi-Head Attention
• When Single-Head Attention is defined as,
Attention(Q, K, V) = softmax(
𝑄𝑘𝑇
𝑑𝑘
)V
• Multi-Head Attention Head is defined as,
headi(Q, K, V) = Attention(QWi
Q, KWi
K, VWi
V)
• i.e.: We can have arbitrary number of heads where
parameter weight matrices have to be defined for Q, K, and
V for all heads
• Multi-Head is defined as,
MultiHead(Q, K, V) = Concat(head1, head2, … headh)WO
• i.e. MultiHead is the concatenation of all the heads
multiplied by another parameter matrix WO
Output
Encoder – Add & Normalization
• Input given to the multi-head attention is added to the
output as the Residual Inputs (remember ResNet?)
• Then the result is Layer Normalized
• Similar to the Batch Norm but instead of normalizing on the
items in the batch (or the minibatch), normalization happens
on the values in the layer
Output
Decoder – Masked Multi-Head Attention
• Multi-Head Attention for the Decoder is same as for the
Encoder
• However, only the query, Q is received from the
previous layer
• K and V are received from the Encoder output
• Here, K and V contain the context related information
that are required to process Q which is generated only
from the input to decoder
Masking the Multi-Head Attention
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
• The model must not see the tokens on
the right side of the sequence
• Therefore, the softmax output related
this attention should be zero
• For that, all the values that are right
from the diagonal will be replaced
with minus infinite, before the
Softmax is applied
Training a Transformer
Source: https://www.youtube.com/watch?v=bCz4OMemCcA
• Vocabulary have special tokens,
• <SOS> for the Start of the Sentence
• <EOS> for the End of the Sentence
• Encoded output is given to the Decoder
(as K and V) to translate its input to
Italian
• Linear layer maps the Decoder output to
the vocabulary size
• Softmax layer outputs the positional
encodings of the tokens in one timestep
• Cross Entropy loss is used
Making Inferences with a Transformer
• Unlike training a transformer, while making inferences, a transformer
needs one timestep to generate a single token
• The reason is because we have to use that generated token to
generate the next token
Questions?

Contenu connexe

Similaire à Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and Design of Transformer model

Introduction to compiler
Introduction to compilerIntroduction to compiler
Introduction to compiler
Abha Damani
 
Intelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_finalIntelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_final
Suhas Pillai
 
Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02
riddhi viradiya
 

Similaire à Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and Design of Transformer model (20)

13 risc
13 risc13 risc
13 risc
 
Introduction to compiler
Introduction to compilerIntroduction to compiler
Introduction to compiler
 
Intelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_finalIntelligent Handwriting Recognition_MIL_presentation_v3_final
Intelligent Handwriting Recognition_MIL_presentation_v3_final
 
Transformer Mods for Document Length Inputs
Transformer Mods for Document Length InputsTransformer Mods for Document Length Inputs
Transformer Mods for Document Length Inputs
 
Building Big Data Streaming Architectures
Building Big Data Streaming ArchitecturesBuilding Big Data Streaming Architectures
Building Big Data Streaming Architectures
 
Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02Unit iii-111206004501-phpapp02
Unit iii-111206004501-phpapp02
 
Reduced instruction set computers
Reduced instruction set computersReduced instruction set computers
Reduced instruction set computers
 
Variables in Pharo
Variables in PharoVariables in Pharo
Variables in Pharo
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
Top schools in noida
Top schools in noidaTop schools in noida
Top schools in noida
 
Cc module 3.pptx
Cc module 3.pptxCc module 3.pptx
Cc module 3.pptx
 
NLP Classifier Models & Metrics
NLP Classifier Models & MetricsNLP Classifier Models & Metrics
NLP Classifier Models & Metrics
 
Top schools in gudgao
Top schools in gudgaoTop schools in gudgao
Top schools in gudgao
 
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
15_NEW-2020-ATTENTION-ENC-DEC-TRANSFORMERS-Lect15.pptx
 
The Phases of a Compiler
The Phases of a CompilerThe Phases of a Compiler
The Phases of a Compiler
 
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNetFrom Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
From Hours to Minutes: The Journey of Optimizing Mask-RCNN and BERT Using MXNet
 
NLP and Deep Learning for non_experts
NLP and Deep Learning for non_expertsNLP and Deep Learning for non_experts
NLP and Deep Learning for non_experts
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performance
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
IncQuery-D: Distributed Incremental Model Queries over the Cloud: Engineerin...
 

Plus de Maninda Edirisooriya

Plus de Maninda Edirisooriya (20)

Lecture 11 - Advance Learning Techniques
Lecture 11 - Advance Learning TechniquesLecture 11 - Advance Learning Techniques
Lecture 11 - Advance Learning Techniques
 
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
Lecture 9 - Deep Sequence Models, Learn Recurrent Neural Networks (RNN), GRU ...
 
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
Extra Lecture - Support Vector Machines (SVM), a lecture in subject module St...
 
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
Lecture 11 - KNN and Clustering, a lecture in subject module Statistical & Ma...
 
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
Lecture 10 - Model Testing and Evaluation, a lecture in subject module Statis...
 
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
Lecture 9 - Decision Trees and Ensemble Methods, a lecture in subject module ...
 
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
Lecture 8 - Feature Engineering and Optimization, a lecture in subject module...
 
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
Lecture 7 - Bias, Variance and Regularization, a lecture in subject module St...
 
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
Lecture 6 - Logistic Regression, a lecture in subject module Statistical & Ma...
 
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
Lecture 5 - Gradient Descent, a lecture in subject module Statistical & Machi...
 
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
Lecture 4 - Linear Regression, a lecture in subject module Statistical & Mach...
 
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
Lecture 3 - Exploratory Data Analytics (EDA), a lecture in subject module Sta...
 
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
Lecture 2 - Introduction to Machine Learning, a lecture in subject module Sta...
 
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAMAnalyzing the effectiveness of mobile and web channels using WSO2 BAM
Analyzing the effectiveness of mobile and web channels using WSO2 BAM
 
WSO2 BAM - Your big data toolbox
WSO2 BAM - Your big data toolboxWSO2 BAM - Your big data toolbox
WSO2 BAM - Your big data toolbox
 
Training Report
Training ReportTraining Report
Training Report
 
GViz - Project Report
GViz - Project ReportGViz - Project Report
GViz - Project Report
 
Mortivation
MortivationMortivation
Mortivation
 
Hafnium impact 2008
Hafnium impact 2008Hafnium impact 2008
Hafnium impact 2008
 
ChatCrypt
ChatCryptChatCrypt
ChatCrypt
 

Dernier

DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DrGurudutt
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
grop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tthgrop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tth
AmanyaSylus
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
AbrahamGadissa
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
Kamal Acharya
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
Kamal Acharya
 
Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
Kamal Acharya
 

Dernier (20)

Maestro Scripting Language CNC programacion
Maestro Scripting Language CNC programacionMaestro Scripting Language CNC programacion
Maestro Scripting Language CNC programacion
 
Event Management System Vb Net Project Report.pdf
Event Management System Vb Net  Project Report.pdfEvent Management System Vb Net  Project Report.pdf
Event Management System Vb Net Project Report.pdf
 
Construction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptxConstruction method of steel structure space frame .pptx
Construction method of steel structure space frame .pptx
 
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdfDR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
DR PROF ING GURUDUTT SAHNI WIKIPEDIA.pdf
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.Quality defects in TMT Bars, Possible causes and Potential Solutions.
Quality defects in TMT Bars, Possible causes and Potential Solutions.
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Peek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdfPeek implant persentation - Copy (1).pdf
Peek implant persentation - Copy (1).pdf
 
retail automation billing system ppt.pptx
retail automation billing system ppt.pptxretail automation billing system ppt.pptx
retail automation billing system ppt.pptx
 
grop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tthgrop material handling.pdf and resarch ethics tth
grop material handling.pdf and resarch ethics tth
 
Digital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdfDigital Signal Processing Lecture notes n.pdf
Digital Signal Processing Lecture notes n.pdf
 
Top 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering ScientistTop 13 Famous Civil Engineering Scientist
Top 13 Famous Civil Engineering Scientist
 
Hall booking system project report .pdf
Hall booking system project report  .pdfHall booking system project report  .pdf
Hall booking system project report .pdf
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdf
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 
Fruit shop management system project report.pdf
Fruit shop management system project report.pdfFruit shop management system project report.pdf
Fruit shop management system project report.pdf
 
Furniture showroom management system project.pdf
Furniture showroom management system project.pdfFurniture showroom management system project.pdf
Furniture showroom management system project.pdf
 
Paint shop management system project report.pdf
Paint shop management system project report.pdfPaint shop management system project report.pdf
Paint shop management system project report.pdf
 

Lecture - 10 Transformer Model, Motivation to Transformers, Principles, and Design of Transformer model

  • 1. DA 5330 – Advanced Machine Learning Applications Lecture 10 – Transformers Maninda Edirisooriya manindaw@uom.lk
  • 2. Limitations of RNN Models • Slow computation for longer sequences as the computation cannot be done in parallel due to the dependencies in timesteps • As there are significant number of timesteps the backpropagation depth increases which increases Vanishing Gradient and Exploding Gradient problems • As information is passed from the history as a hidden state vector the amount of information is limited to that vector size • As information passed from the history gets updated in each time step, the history is forgotten after number of time steps
  • 3. Attention-based Models • Instead of processing all the time steps with the same weight, attention models performed well when only certain time steps are given an exponentially higher weight while processing any time step which are known as Attention Models • Thought Attention Models were significantly better, its processing requirement (Complexity) was Quadratic (i.e. proportional to the square of the number of time stamps) which was an extra slowdown • However, the paper published with name “Attention is all you need” by Vasvani et al. 2017 proposed that RNN units can be replaced with a higher performance mechanism keeping only the “Attention” in mind • This model is known as a Transformer Model
  • 5. Transformer Model • The original paper defined this model (with both Encoder and Decoder) for the application of Natural Language Translation • However, the Encoder and Decoder models were separately used independently in some later models for different tasks Source: https://pub.aimind.so/unraveling-the-power-of-language-models-understanding-llms-and-transformer-variants-71bfc42e0b21
  • 6. Encoder Only (Autoencoding) Models • Only the Encoder of the Transformer is used • Pre-Trained with Masked Language Models • Some random tokens of the input sequence are masked • Try to predict the missing (masked) tokens to reconstruct the original sequence • This process learns the Bidirectional Context of the tokens in a sequence (probabilities of being around certain tokens in both right and left) • Used in applications like Sentence Classification for Sentiment Analysis and token level operations like Named Entity Recognition • BERT and RoBERTa are some examples
  • 7. Decoder Only (Autoregressive) Models • Only the Decoder of the Transformer is used • Pre-Trained with Causal Language Models • Last token of the input sequence is masked • Try to predict the last token to reconstruct the original sequence • Also known as Full Language Model as well • This process learns the Unidirectional Context of the tokens in a sequence (probabilities of being the next token given the tokens at the left) • Used in applications like Text Generation • GPT and BLOOM are some examples
  • 8. Encoder Decoder (Sequence-to-Sequence) Models • Use both Encoder and the Decoder of the Transformer • Pre-Training objective may depend on the requirement. In T5 model, • In Encoder, some random tokens of the input sequence are masked with a unique placeholder token, added to the vocabulary, known as Sentinel token • This process is known as Span Corruption • Decoder tries to predict the missing (masked) tokens to reconstruct the original sequence, replacing the Sentinel tokens, with auto-regression • Used in applications like Translation, Summarization and Question- answering • T5 and BART are some examples
  • 9. Encoder – Input and Embedding • Inputs is the sequence of tokens (words in case of Natural Language Processing (NLP)) • Each input token is converted to a vector using Input Embedding (Word Embedding in case of NLP) Output
  • 10. Encoder – Input and Embedding Source: https://www.youtube.com/watch?v=bCz4OMemCcA
  • 11. Encoder – Positional Encoding Output Source: https://www.youtube.com/watch?v=bCz4OMemCcA
  • 12. Encoder – Input and Embedding Source: https://www.youtube.com/watch?v=bCz4OMemCcA
  • 13. Encoder – Multi-Head Attention • Multi-Head Attention is about applying multiple similar operations known as Single-Head Attention or simply Attention Attention(Q, K, V) = softmax( 𝑄𝑘𝑇 𝑑𝑘 )V • The type of attention used here is known as Self Attention where each token is having a attention against all the tokens in the input sequence • For the Encoder we take, Q = K = V = X Output
  • 14. Self Attention Source: https://jalammar.github.io/illustrated-transformer/ • Self Attention formula is inspired by the data query from a data store where Q is the query which is matched against the K key values where V is the actual value • 𝐐𝐊𝐓 is a measure between the similarity between Q and K • 𝐝𝐤 is used to normalize by dividing it by the dimensionality of the K • Softmax is used to give the attention to the largest • Finally, normalized similarity is used to the weight V resulting the Attention
  • 16. Encoder – Multi-Head Attention • When Single-Head Attention is defined as, Attention(Q, K, V) = softmax( 𝑄𝑘𝑇 𝑑𝑘 )V • Multi-Head Attention Head is defined as, headi(Q, K, V) = Attention(QWi Q, KWi K, VWi V) • i.e.: We can have arbitrary number of heads where parameter weight matrices have to be defined for Q, K, and V for all heads • Multi-Head is defined as, MultiHead(Q, K, V) = Concat(head1, head2, … headh)WO • i.e. MultiHead is the concatenation of all the heads multiplied by another parameter matrix WO Output
  • 17. Encoder – Add & Normalization • Input given to the multi-head attention is added to the output as the Residual Inputs (remember ResNet?) • Then the result is Layer Normalized • Similar to the Batch Norm but instead of normalizing on the items in the batch (or the minibatch), normalization happens on the values in the layer Output
  • 18. Decoder – Masked Multi-Head Attention • Multi-Head Attention for the Decoder is same as for the Encoder • However, only the query, Q is received from the previous layer • K and V are received from the Encoder output • Here, K and V contain the context related information that are required to process Q which is generated only from the input to decoder
  • 19. Masking the Multi-Head Attention Source: https://www.youtube.com/watch?v=bCz4OMemCcA • The model must not see the tokens on the right side of the sequence • Therefore, the softmax output related this attention should be zero • For that, all the values that are right from the diagonal will be replaced with minus infinite, before the Softmax is applied
  • 20. Training a Transformer Source: https://www.youtube.com/watch?v=bCz4OMemCcA • Vocabulary have special tokens, • <SOS> for the Start of the Sentence • <EOS> for the End of the Sentence • Encoded output is given to the Decoder (as K and V) to translate its input to Italian • Linear layer maps the Decoder output to the vocabulary size • Softmax layer outputs the positional encodings of the tokens in one timestep • Cross Entropy loss is used
  • 21. Making Inferences with a Transformer • Unlike training a transformer, while making inferences, a transformer needs one timestep to generate a single token • The reason is because we have to use that generated token to generate the next token