SlideShare a Scribd company logo
1 of 36
Download to read offline
Transformer Seq2Seq Models:
Concepts, Trends & Limitations
Patrick von Platen - Hugging Face Inc.
Hugging Face: Democratizing NLP
Hugging Face: Democratizing NLP
Popular open source NLP platform
▪ 1,000+ research paper mentions
▪ Used in production by 1,000+
companies
▪ 5,000+ pretrained models in the
model hub
▪ 40,000+ daily pip installs
▪ 40,000+ stars
Hugging Face: Democratizing NLP
● Knowledge sharing
○ NAACL 2019 / EMNLP 2020 Tutorial (Transfer Learning / Neural Lang Generation)
○ Workshop NeuralGen 2019 (Language Generation with Neural Networks)
○ Workshop SustaiNLP 2020 (Environmental/computational friendly NLP)
○ EurNLP Summit (European NLP summit)
● Code & model sharing: Open-sourcing the “right way”
○ Two extremes: 1000-commands research-code ⟺ 1-command production code
■ To target the widest community – our goal is to be 👆 right in the middle
○ Breaking barriers
■ Researchers / Practitioners
■ PyTorch / TensorFlow
○ Speeding up and fueling research in Natural Language Processing
■ Make people stand on the shoulders of giants
Transformer Seq2Seq Models:
Concepts, Trends & Limitations
Today’s Menu
● Intro: Seq2Seq: RNN vs. Transformer & Transfer Learning
● Architectures
● Pretraining Objectives
● Current Trends
● Current Limitations
Intro
Previously: Recurrent Neural Networks
● First end-to-end model to successfully
tackle seq2seq tasks Sutskever et al.
(2014)
● Can model input-dependent
sequence length
● Most prominent application: Google
Translate (see O’Reilly Article)
=> BUT
● Difficult to parallelize
● Ineffective at modeling long-term
dependencies
● Only single state to encode all
information of input
Now: Transformer Networks
● Can also model input-dependent
sequential length
● Highly parallelizable
● Effective at maintaining long-term
dependencies
● Less information loss by encoding
input as a sequence not a state
NLP took a turn in 2018
Large Text Datasets
Compute Power
Transfer Learning &
Pre-training
Transformer Networks
Sequential Transfer Learning: Pre-Training
Base model
Pre-trained
language model
Very large corpus
$$$ in compute
Days of training
Random init
models
word2vec
ELMo
GPT
BERT
Sequential Transfer Learning: Fine-Tuning
Small dataset
Training can be done on single GPU
Easily reproducible
word2vec
ELMo
GPT
BERT
Pre-trained
language model
Text classification
Word labeling
Question-Answeri
ng
....
Fine-tuned
language model
Architectures
Types of Transformer Models
Autoencoding Models Autoregressive Models Seq2Seq Models
Prefix-LM Model
Autoencoding Models
● Mathematical Model: P( class | “input seq”)
● Tasks: Natural Language Understanding e.g.
sentiment classification, named entity
recognition, ...
● Prominent Models: BERT, ALBERT,
DistilBERT
Autoregressive Models
● Mathematical Model: P( out_seq_i | out_seq_0:i-1)
● Tasks: Natural Language Generation, especially
open-domain generation
● Prominent Models: GPT1, GPT2, GPT3
Prefix Language Models
● Mathematical Model: P( out_seq_i | out_seq_0:i-1,
in_seq_0:n)
● Tasks: Natural Language Understanding &
Generation
● Prominent Models: UniLM, XLNet
Sequence-to-Sequence Models
● Mathematical Model: P( out_seq_i |
out_seq_0:i-1, in_seq_0:n)
● Tasks: Natural Language Generation,
especially Conditioned Natural
Generation (Seq2Seq)
● Prominent Models: T5, BART, Pegasus
What architecture to choose for Seq2Seq?
● Using autoregressive language models for seq2seq forces the model’s representation of the
sequence input to be unnecessary limited (see p. 17 of Colin et al. (2019))
● Prefix LM allows model can effectively be applied to both NLG and NLU tasks Dong et al. (2019),
but architecture is inherently more restricted than Seq2Seq architecture
● Seq2Seq models usually have more parameters, which however does not necessarily mean that
Seq2Seq model has higher “memory” or “computational” complexity
=> Seq2Seq architecture is usually the preferred architecture for Seq2Seq tasks.
Table 2 in “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” by Colin Raffel et al. https://arxiv.org/abs/1910.10683
Self-supervised Pretraining
Pretraining objectives for Encoder-Only
● BERT (Devlin et al. (2019)) and GPT (Radford et al. (2019))
among the first papers to clear superiority of transfer learning
compared to no transfer learning
● BERT uses masked language modeling & GPT language
modeling objective
● Joshi et al. (2019) showed that better results can achieved
when masking whole spans of words
Taken from the blog “The Illustrated GPT-2 (Visualizing Transformer Language
Models)” by Jay Alammar http://jalammar.github.io/illustrated-gpt2/
Figure 2 in “SpanBERT: Improving Pre-training by Representing
and Predicting Spans” by Mandar Joshi et al.
https://arxiv.org/abs/1907.10529
Taken from the blog “The Illustrated BERT, ELMo, and co.
(How NLP Cracked Transfer Learning)” by Jay Alammar
http://jalammar.github.io/illustrated-bert/
Pretraining objectives for Seq2Seq
● T5 Raffel et al. (2019) and BART Lewis et al. (2019) first to do massive pretraining of Seq2Seq
models
● In Seq2Seq denoising objective model inputs to the model need not to be aligned with model outputs
=> weaker inductive bias given by architecture, wider range of denoising schemes can be used
● BERT-like span masking Joshi et al. (2019) is extended to generalized span masking Raffel et al.
(2019)
Figure 2 in “Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer” by Colin Raffel et al. https://arxiv.org/abs/1910.10683
Figure 2 in “BART: Denoising Sequence-to-Sequence Pre-training for Natural
Language Generation, Translation, and Comprehension” by Mike Lewis et al.
https://arxiv.org/abs/1910.13461
Current Trends
Current Trends (I)
❏ Recent trends
❏ Going big on model sizes - over 1 billion parameters as become the norm for SOTA
⬆
Google
GShard
600B
⬆
GPT3
175B
Current Trends (II)
● Sharing encoder and decoder weights can yield improved performance when distribution of model
input and model output is similar and less training data is available, cf. to (Rothe et al. (2020)) and
(Raffel et al. (2019)).
● Task specific unsupervised pretraining can give significant improvement over general denoising
objectives, e.g. (Zhang et al. (2020)), shows that pretraining with “Extracted Gap-sentences”
significantly outperforms other Seq2Seq models in summarization tasks.
Figure 1 in “PEGASUS: Pre-training with Extracted Gap-sentences for
Abstractive Summarization” by Jingqing Zhang et al. https://arxiv.org/abs/1912.08777
Table 1 in “”Leveraging Pre-trained Checkpoints
for Sequence Generation Tasks by Rothe et al.
https://arxiv.org/abs/1912.08777
Current Trends (III)
● Long-range sequence modeling for Seq2Seq models were recently proposed in (Zaheer et al.
(2020)) and (Beltagy et al. (2020)) especially for long-document summarization. Sparse attention is
usually only necessary for the encoder since the length of output sequence is typically small as
compared to the input sequence.
● Warm-starting Seq2Seq models from pretrained BERT-like or GPT2-like checkpoints (Rothe et al.
(2020)) can yield competitive performance to state-of-the-art Seq2Seq model at a fraction of the
training cost.
Figure 2 in “Longformer: The Long-Document Transformer” by Iz Beltagy et al.
https://arxiv.org/abs/2004.05150
Current Limitations / Shortcomings
Current Limitations / Shortcomings (I)
Need for grounded representations.
● Limits of distributional hypothesis—difficult to learn certain types of information from raw text
○ Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013)
○ Common sense isn’t written down
○ No relation with other modalities (image, audio…)
○ Continuous Training / Catastrophic Forgetting
● Possible solutions:
○ Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018)
○ Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019)
○ Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019)
○ Retrieval augmented models (e.g. Lewis et al. 2020)
Current Limitations / Shortcomings (II)
● Auto-regressive generation is expensive. At inference time only one encoder forward pass is made,
whereas N_target_length decoder forward passes are made.
○ Google Translate uses a Transformer Encoder and a RNN decoder, cf. to google ai blog
○ Asymmetric encoder-decoder architecture, e.g., Shazeer (2019) or Shleifer et al. (2020)
○ Self-attention can be expressed as a linear dot-product of kernel feature maps for fast
autoregressive Transformers Katharopoulos et al. (2020)
Figure 1 in “PRE-TRAINED SUMMARIZATION DISTILLATION” by
Sam Shleifer et al. https://arxiv.org/abs/2010.13002
Current Limitations / Shortcomings (III)
● “Transformer foregoes RNNs’ inductive bias towards learning iterative or recursive
transformations” (Deghani et al. (2019))
●
○ Transformer does not generalize well to lengths not encountered during training. Universal
Transformers: (Deghani et al. (2019)) replace fixed stack of transformer layers with dynamic
number of transformer layers => the transformer layer recursively processes the hidden states
○ Replace self-attention by kernel feature map so that Transformer behaves like an RNN
Katharopoulos et al. (2020).
Figure from “Moving Beyond Translation with the Universal Transformer” at Google AI Blog:
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
Resources
What architecture to choose for Seq2Seq (Extended)
Pan and Yang (2010)
● Using autoregressive language models for seq2seq forces the model’s representation of the
sequence input X_1:n to be unnecessary limited (see p. 17 of Colin et al. (2019))
● Prefix LM allows model can effectively be applied to both NLG and NLU tasks Dong et al. (2019), but
architecture is inherently more restricted than Seq2Seq models
○ Parameters between encoding & decoding are always shared
○ Denoising objective is limited to BERT/GPT2-like masking; cannot be extended to
“auto-regressive” span masking
● Seq2Seq models usually have more parameters, which however does not necessarily mean that
Seq2Seq model has higher “memory” or “computational” complexity
○ Encoder is run only once, the same way the prefix is forwarded only once in Prefix-LM or
autoregressive LM
○ N^2 memory complexity is divided into N_enc^2 + N_enc*N_dec + N_dec^2 complexity
Current Trends (Extended)
● Sharing encoder and decoder weights can yield improved performance when distribution of model
input and model output is similar and less training data is available, cf. to (Rothe et al. (2020)) and
(Raffel et al. (2019)).
● Task specific unsupervised pretraining can give significant improvement over general denoising
objectives, e.g. (Zhang et al. (2020)), shows that pretraining with “Extracted Gap-sentences”
significantly outperforms other Seq2Seq models in summarization tasks.
● Long-range sequence modeling for Seq2Seq models were recently proposed in (Zaheer et al.
(2020)) and (Beltagy et al. (2020)) especially for long-document summarization. Sparse attention is
usually only necessary for the encoder since the length of output sequence is typically small as
compared to the input sequence.
● Warm-starting Seq2Seq models from pretrained BERT-like or GPT2-like checkpoints (Rothe et al.
(2020)) can yield competitive performance to state-of-the-art Seq2Seq model at a fraction of the
training cost.
Current Limitations / Shortcomings (Extended)
● Auto-regressive generation is expensive. At inference time only one encoder forward pass is made,
whereas N_target_length decoder forward passes are made.
○ Google Translate uses a Transformer Encoder and a RNN decoder, cf. to google ai blog.
○ Asymmetric encoder-decoder architecture, e.g., Shazeer (2019) or Shleifer et al. (2020).
○ Self-attention can be expressed as a linear dot-product of kernel feature maps for fast
autoregressive Transformers Katharopoulos et al. (2020).
○ Beam search is expensive. Meister et al. (2020) propose more efficient beam search.
● “Transformer foregoes RNNs’ inductive bias towards learning iterative or recursive
transformations” (Deghani et al. (2019))
○ Transformer does not generalize well to lengths not encountered during training. Universal
Transformers: (Deghani et al. (2019)) replace fixed stack of transformer layers with dynamic
number of transformer layers => the transformer layer recursively processes the hidden states
○ Replace self-attention by kernel feature map so that Transformer behaves like an RNN
Katharopoulos et al. (2020).
● Long-range sequence modeling techniques do not fit well with Encoder-Decoder model design.
○ (Kitaev et al. 2020)’s LSH self-attention cannot be used in a cross-attention layer
○ (Wang et al. (2020)’s Linformer cannot do cached auto-regressive generation
○ Sparse Attention is not trivial for cross-attention layer => (Zaheer et al. (2020)) and (Beltagy et
al. (2020) simply apply full attention in the cross-attention layer.
Links to interesting Reads
- 🤗 Hugging Face Blog: Transformers-based Encoder-Decoder Models
- 🤗 Hugging Face Blog: Leveraging Pre-trained Language Model
Checkpoints for Encoder-Decoder Models
- Paper: Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer by Colin Raffel et al.
- Paper: Leveraging Pre-trained Checkpoints for Sequence Generation
Tasks
- 🤗 Newsletter: https://huggingface.curated.co/
Questions?

More Related Content

What's hot

Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMsSylvainGugger
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
 
‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...Leiden University
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
LLaMA 2.pptx
LLaMA 2.pptxLLaMA 2.pptx
LLaMA 2.pptxRkRahul16
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)H K Yoon
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERTshaurya uppal
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...Po-Chuan Chen
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxSaiPragnaKancheti
 

What's hot (20)

Fine tuning large LMs
Fine tuning large LMsFine tuning large LMs
Fine tuning large LMs
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...‘Big models’: the success and pitfalls of Transformer models in natural langu...
‘Big models’: the success and pitfalls of Transformer models in natural langu...
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
LLaMA 2.pptx
LLaMA 2.pptxLLaMA 2.pptx
LLaMA 2.pptx
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
NLP State of the Art | BERT
NLP State of the Art | BERTNLP State of the Art | BERT
NLP State of the Art | BERT
 
Bert
BertBert
Bert
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Introduction to Transformer Model
Introduction to Transformer ModelIntroduction to Transformer Model
Introduction to Transformer Model
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
gpt3_presentation.pdf
gpt3_presentation.pdfgpt3_presentation.pdf
gpt3_presentation.pdf
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 

Similar to Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)

Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
Supervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking systemSupervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking systemMarsan Ma
 
Seed rl paper review
Seed rl paper reviewSeed rl paper review
Seed rl paper reviewKyoungman Lee
 
BloombergGPT.pdfA Large Language Model for Finance
BloombergGPT.pdfA Large Language Model for FinanceBloombergGPT.pdfA Large Language Model for Finance
BloombergGPT.pdfA Large Language Model for Finance957671457
 
Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...IAESIJAI
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesIRJET Journal
 
IRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine LearningIRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine LearningIRJET Journal
 
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...San Kim
 
An Efficient Approach to Produce Source Code by Interpreting Algorithm
An Efficient Approach to Produce Source Code by Interpreting AlgorithmAn Efficient Approach to Produce Source Code by Interpreting Algorithm
An Efficient Approach to Produce Source Code by Interpreting AlgorithmIRJET Journal
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
GPT and other Text Transformers: Black Swans and Stochastic Parrots
GPT and other Text Transformers:  Black Swans and Stochastic ParrotsGPT and other Text Transformers:  Black Swans and Stochastic Parrots
GPT and other Text Transformers: Black Swans and Stochastic ParrotsKonstantin Savenkov
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindsporeijdms
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesSchwannden Kuo
 
From CasMaCat to SEECAT: Patterns of Interaction in Advanced Computer-Assiste...
From CasMaCat to SEECAT: Patterns of Interaction in Advanced Computer-Assiste...From CasMaCat to SEECAT: Patterns of Interaction in Advanced Computer-Assiste...
From CasMaCat to SEECAT: Patterns of Interaction in Advanced Computer-Assiste...Yandex
 
advancedzplmacroprogramming_081820.pptx
advancedzplmacroprogramming_081820.pptxadvancedzplmacroprogramming_081820.pptx
advancedzplmacroprogramming_081820.pptxssuser6a1dbf
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksMarkus Scheidgen
 

Similar to Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI) (20)

Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Supervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking systemSupervised embedding techniques in search ranking system
Supervised embedding techniques in search ranking system
 
Seed rl paper review
Seed rl paper reviewSeed rl paper review
Seed rl paper review
 
BloombergGPT.pdfA Large Language Model for Finance
BloombergGPT.pdfA Large Language Model for FinanceBloombergGPT.pdfA Large Language Model for Finance
BloombergGPT.pdfA Large Language Model for Finance
 
Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...Analysis of the evolution of advanced transformer-based language models: Expe...
Analysis of the evolution of advanced transformer-based language models: Expe...
 
Recent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP ApproachesRecent Trends in Translation of Programming Languages using NLP Approaches
Recent Trends in Translation of Programming Languages using NLP Approaches
 
IRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine LearningIRJET - Pseudocode to Python Translation using Machine Learning
IRJET - Pseudocode to Python Translation using Machine Learning
 
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
20230419-LLaMA-Adapter_ Efficient Fine-tuning of Language Models with Zero-in...
 
An Efficient Approach to Produce Source Code by Interpreting Algorithm
An Efficient Approach to Produce Source Code by Interpreting AlgorithmAn Efficient Approach to Produce Source Code by Interpreting Algorithm
An Efficient Approach to Produce Source Code by Interpreting Algorithm
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
GPT and other Text Transformers: Black Swans and Stochastic Parrots
GPT and other Text Transformers:  Black Swans and Stochastic ParrotsGPT and other Text Transformers:  Black Swans and Stochastic Parrots
GPT and other Text Transformers: Black Swans and Stochastic Parrots
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindspore
 
Tensorflow 2.0 and Coral Edge TPU
Tensorflow 2.0 and Coral Edge TPU Tensorflow 2.0 and Coral Edge TPU
Tensorflow 2.0 and Coral Edge TPU
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminaries
 
From CasMaCat to SEECAT: Patterns of Interaction in Advanced Computer-Assiste...
From CasMaCat to SEECAT: Patterns of Interaction in Advanced Computer-Assiste...From CasMaCat to SEECAT: Patterns of Interaction in Advanced Computer-Assiste...
From CasMaCat to SEECAT: Patterns of Interaction in Advanced Computer-Assiste...
 
advancedzplmacroprogramming_081820.pptx
advancedzplmacroprogramming_081820.pptxadvancedzplmacroprogramming_081820.pptx
advancedzplmacroprogramming_081820.pptx
 
Generation of Random EMF Models for Benchmarks
Generation of Random EMF Models for BenchmarksGeneration of Random EMF Models for Benchmarks
Generation of Random EMF Models for Benchmarks
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 

More from Deep Learning Italia

Machine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for MarketingMachine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for MarketingDeep Learning Italia
 
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettiveModelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettiveDeep Learning Italia
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxDeep Learning Italia
 
Meetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdfMeetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdfDeep Learning Italia
 
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfMEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfDeep Learning Italia
 
The science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantisticaThe science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantisticaDeep Learning Italia
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliDeep Learning Italia
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Deep Learning Italia
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsDeep Learning Italia
 
Use Cases Machine Learning for Healthcare
Use Cases Machine Learning for HealthcareUse Cases Machine Learning for Healthcare
Use Cases Machine Learning for HealthcareDeep Learning Italia
 
NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation Deep Learning Italia
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet upDeep Learning Italia
 
Macaluso antonio meetup dli 2020-12-15
Macaluso antonio  meetup dli 2020-12-15Macaluso antonio  meetup dli 2020-12-15
Macaluso antonio meetup dli 2020-12-15Deep Learning Italia
 
Algoritmi non supervisionati per Time Series
Algoritmi non supervisionati per Time SeriesAlgoritmi non supervisionati per Time Series
Algoritmi non supervisionati per Time SeriesDeep Learning Italia
 

More from Deep Learning Italia (20)

Machine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for MarketingMachine Learning driven Quantum Optimization for Marketing
Machine Learning driven Quantum Optimization for Marketing
 
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettiveModelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
Modelli linguistici da Eliza a ChatGPT P roblemi , fraintendimenti e prospettive
 
Transformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptxTransformers In Vision From Zero to Hero (DLI).pptx
Transformers In Vision From Zero to Hero (DLI).pptx
 
Meetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdfMeetup Luglio - Operations Research.pdf
Meetup Luglio - Operations Research.pdf
 
Meetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdfMeetup Giugno - c-ResUNET.pdf
Meetup Giugno - c-ResUNET.pdf
 
MEETUP Maggio - Team Automata
MEETUP Maggio - Team AutomataMEETUP Maggio - Team Automata
MEETUP Maggio - Team Automata
 
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdfMEETUP APRILE - Ganomaly - Anomaly Detection.pdf
MEETUP APRILE - Ganomaly - Anomaly Detection.pdf
 
2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx2022_Meetup_Mazza-Marzo.pptx
2022_Meetup_Mazza-Marzo.pptx
 
Machine Learning Security
Machine Learning SecurityMachine Learning Security
Machine Learning Security
 
The science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantisticaThe science of can and can t e la computazione quantistica
The science of can and can t e la computazione quantistica
 
Dli meetup moccia
Dli meetup mocciaDli meetup moccia
Dli meetup moccia
 
Pi school-dli-presentation de nobili
Pi school-dli-presentation de nobiliPi school-dli-presentation de nobili
Pi school-dli-presentation de nobili
 
Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework Machine Learning Explanations: LIME framework
Machine Learning Explanations: LIME framework
 
Explanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence ModelsExplanation methods for Artificial Intelligence Models
Explanation methods for Artificial Intelligence Models
 
Use Cases Machine Learning for Healthcare
Use Cases Machine Learning for HealthcareUse Cases Machine Learning for Healthcare
Use Cases Machine Learning for Healthcare
 
NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation NLG, Training, Inference & Evaluation
NLG, Training, Inference & Evaluation
 
Towards quantum machine learning calogero zarbo - meet up
Towards quantum machine learning  calogero zarbo - meet upTowards quantum machine learning  calogero zarbo - meet up
Towards quantum machine learning calogero zarbo - meet up
 
Macaluso antonio meetup dli 2020-12-15
Macaluso antonio  meetup dli 2020-12-15Macaluso antonio  meetup dli 2020-12-15
Macaluso antonio meetup dli 2020-12-15
 
Data privacy e anonymization in R
Data privacy e anonymization in RData privacy e anonymization in R
Data privacy e anonymization in R
 
Algoritmi non supervisionati per Time Series
Algoritmi non supervisionati per Time SeriesAlgoritmi non supervisionati per Time Series
Algoritmi non supervisionati per Time Series
 

Recently uploaded

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiSuhani Kapoor
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfLars Albertsson
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxStephen266013
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service BhilaiLow Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
Low Rate Call Girls Bhilai Anika 8250192130 Independent Escort Service Bhilai
 
RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Industrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdfIndustrialised data - the key to AI success.pdf
Industrialised data - the key to AI success.pdf
 
B2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docxB2 Creative Industry Response Evaluation.docx
B2 Creative Industry Response Evaluation.docx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)

  • 1. Transformer Seq2Seq Models: Concepts, Trends & Limitations Patrick von Platen - Hugging Face Inc.
  • 3. Hugging Face: Democratizing NLP Popular open source NLP platform ▪ 1,000+ research paper mentions ▪ Used in production by 1,000+ companies ▪ 5,000+ pretrained models in the model hub ▪ 40,000+ daily pip installs ▪ 40,000+ stars
  • 4. Hugging Face: Democratizing NLP ● Knowledge sharing ○ NAACL 2019 / EMNLP 2020 Tutorial (Transfer Learning / Neural Lang Generation) ○ Workshop NeuralGen 2019 (Language Generation with Neural Networks) ○ Workshop SustaiNLP 2020 (Environmental/computational friendly NLP) ○ EurNLP Summit (European NLP summit) ● Code & model sharing: Open-sourcing the “right way” ○ Two extremes: 1000-commands research-code ⟺ 1-command production code ■ To target the widest community – our goal is to be 👆 right in the middle ○ Breaking barriers ■ Researchers / Practitioners ■ PyTorch / TensorFlow ○ Speeding up and fueling research in Natural Language Processing ■ Make people stand on the shoulders of giants
  • 6. Today’s Menu ● Intro: Seq2Seq: RNN vs. Transformer & Transfer Learning ● Architectures ● Pretraining Objectives ● Current Trends ● Current Limitations
  • 8. Previously: Recurrent Neural Networks ● First end-to-end model to successfully tackle seq2seq tasks Sutskever et al. (2014) ● Can model input-dependent sequence length ● Most prominent application: Google Translate (see O’Reilly Article) => BUT ● Difficult to parallelize ● Ineffective at modeling long-term dependencies ● Only single state to encode all information of input
  • 9. Now: Transformer Networks ● Can also model input-dependent sequential length ● Highly parallelizable ● Effective at maintaining long-term dependencies ● Less information loss by encoding input as a sequence not a state
  • 10. NLP took a turn in 2018 Large Text Datasets Compute Power Transfer Learning & Pre-training Transformer Networks
  • 11. Sequential Transfer Learning: Pre-Training Base model Pre-trained language model Very large corpus $$$ in compute Days of training Random init models word2vec ELMo GPT BERT
  • 12. Sequential Transfer Learning: Fine-Tuning Small dataset Training can be done on single GPU Easily reproducible word2vec ELMo GPT BERT Pre-trained language model Text classification Word labeling Question-Answeri ng .... Fine-tuned language model
  • 14. Types of Transformer Models Autoencoding Models Autoregressive Models Seq2Seq Models Prefix-LM Model
  • 15. Autoencoding Models ● Mathematical Model: P( class | “input seq”) ● Tasks: Natural Language Understanding e.g. sentiment classification, named entity recognition, ... ● Prominent Models: BERT, ALBERT, DistilBERT
  • 16. Autoregressive Models ● Mathematical Model: P( out_seq_i | out_seq_0:i-1) ● Tasks: Natural Language Generation, especially open-domain generation ● Prominent Models: GPT1, GPT2, GPT3
  • 17. Prefix Language Models ● Mathematical Model: P( out_seq_i | out_seq_0:i-1, in_seq_0:n) ● Tasks: Natural Language Understanding & Generation ● Prominent Models: UniLM, XLNet
  • 18. Sequence-to-Sequence Models ● Mathematical Model: P( out_seq_i | out_seq_0:i-1, in_seq_0:n) ● Tasks: Natural Language Generation, especially Conditioned Natural Generation (Seq2Seq) ● Prominent Models: T5, BART, Pegasus
  • 19. What architecture to choose for Seq2Seq? ● Using autoregressive language models for seq2seq forces the model’s representation of the sequence input to be unnecessary limited (see p. 17 of Colin et al. (2019)) ● Prefix LM allows model can effectively be applied to both NLG and NLU tasks Dong et al. (2019), but architecture is inherently more restricted than Seq2Seq architecture ● Seq2Seq models usually have more parameters, which however does not necessarily mean that Seq2Seq model has higher “memory” or “computational” complexity => Seq2Seq architecture is usually the preferred architecture for Seq2Seq tasks. Table 2 in “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” by Colin Raffel et al. https://arxiv.org/abs/1910.10683
  • 21. Pretraining objectives for Encoder-Only ● BERT (Devlin et al. (2019)) and GPT (Radford et al. (2019)) among the first papers to clear superiority of transfer learning compared to no transfer learning ● BERT uses masked language modeling & GPT language modeling objective ● Joshi et al. (2019) showed that better results can achieved when masking whole spans of words Taken from the blog “The Illustrated GPT-2 (Visualizing Transformer Language Models)” by Jay Alammar http://jalammar.github.io/illustrated-gpt2/ Figure 2 in “SpanBERT: Improving Pre-training by Representing and Predicting Spans” by Mandar Joshi et al. https://arxiv.org/abs/1907.10529 Taken from the blog “The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)” by Jay Alammar http://jalammar.github.io/illustrated-bert/
  • 22. Pretraining objectives for Seq2Seq ● T5 Raffel et al. (2019) and BART Lewis et al. (2019) first to do massive pretraining of Seq2Seq models ● In Seq2Seq denoising objective model inputs to the model need not to be aligned with model outputs => weaker inductive bias given by architecture, wider range of denoising schemes can be used ● BERT-like span masking Joshi et al. (2019) is extended to generalized span masking Raffel et al. (2019) Figure 2 in “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” by Colin Raffel et al. https://arxiv.org/abs/1910.10683 Figure 2 in “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension” by Mike Lewis et al. https://arxiv.org/abs/1910.13461
  • 24. Current Trends (I) ❏ Recent trends ❏ Going big on model sizes - over 1 billion parameters as become the norm for SOTA ⬆ Google GShard 600B ⬆ GPT3 175B
  • 25. Current Trends (II) ● Sharing encoder and decoder weights can yield improved performance when distribution of model input and model output is similar and less training data is available, cf. to (Rothe et al. (2020)) and (Raffel et al. (2019)). ● Task specific unsupervised pretraining can give significant improvement over general denoising objectives, e.g. (Zhang et al. (2020)), shows that pretraining with “Extracted Gap-sentences” significantly outperforms other Seq2Seq models in summarization tasks. Figure 1 in “PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization” by Jingqing Zhang et al. https://arxiv.org/abs/1912.08777 Table 1 in “”Leveraging Pre-trained Checkpoints for Sequence Generation Tasks by Rothe et al. https://arxiv.org/abs/1912.08777
  • 26. Current Trends (III) ● Long-range sequence modeling for Seq2Seq models were recently proposed in (Zaheer et al. (2020)) and (Beltagy et al. (2020)) especially for long-document summarization. Sparse attention is usually only necessary for the encoder since the length of output sequence is typically small as compared to the input sequence. ● Warm-starting Seq2Seq models from pretrained BERT-like or GPT2-like checkpoints (Rothe et al. (2020)) can yield competitive performance to state-of-the-art Seq2Seq model at a fraction of the training cost. Figure 2 in “Longformer: The Long-Document Transformer” by Iz Beltagy et al. https://arxiv.org/abs/2004.05150
  • 27. Current Limitations / Shortcomings
  • 28. Current Limitations / Shortcomings (I) Need for grounded representations. ● Limits of distributional hypothesis—difficult to learn certain types of information from raw text ○ Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013) ○ Common sense isn’t written down ○ No relation with other modalities (image, audio…) ○ Continuous Training / Catastrophic Forgetting ● Possible solutions: ○ Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018) ○ Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019) ○ Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019) ○ Retrieval augmented models (e.g. Lewis et al. 2020)
  • 29. Current Limitations / Shortcomings (II) ● Auto-regressive generation is expensive. At inference time only one encoder forward pass is made, whereas N_target_length decoder forward passes are made. ○ Google Translate uses a Transformer Encoder and a RNN decoder, cf. to google ai blog ○ Asymmetric encoder-decoder architecture, e.g., Shazeer (2019) or Shleifer et al. (2020) ○ Self-attention can be expressed as a linear dot-product of kernel feature maps for fast autoregressive Transformers Katharopoulos et al. (2020) Figure 1 in “PRE-TRAINED SUMMARIZATION DISTILLATION” by Sam Shleifer et al. https://arxiv.org/abs/2010.13002
  • 30. Current Limitations / Shortcomings (III) ● “Transformer foregoes RNNs’ inductive bias towards learning iterative or recursive transformations” (Deghani et al. (2019)) ● ○ Transformer does not generalize well to lengths not encountered during training. Universal Transformers: (Deghani et al. (2019)) replace fixed stack of transformer layers with dynamic number of transformer layers => the transformer layer recursively processes the hidden states ○ Replace self-attention by kernel feature map so that Transformer behaves like an RNN Katharopoulos et al. (2020). Figure from “Moving Beyond Translation with the Universal Transformer” at Google AI Blog: https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
  • 32. What architecture to choose for Seq2Seq (Extended) Pan and Yang (2010) ● Using autoregressive language models for seq2seq forces the model’s representation of the sequence input X_1:n to be unnecessary limited (see p. 17 of Colin et al. (2019)) ● Prefix LM allows model can effectively be applied to both NLG and NLU tasks Dong et al. (2019), but architecture is inherently more restricted than Seq2Seq models ○ Parameters between encoding & decoding are always shared ○ Denoising objective is limited to BERT/GPT2-like masking; cannot be extended to “auto-regressive” span masking ● Seq2Seq models usually have more parameters, which however does not necessarily mean that Seq2Seq model has higher “memory” or “computational” complexity ○ Encoder is run only once, the same way the prefix is forwarded only once in Prefix-LM or autoregressive LM ○ N^2 memory complexity is divided into N_enc^2 + N_enc*N_dec + N_dec^2 complexity
  • 33. Current Trends (Extended) ● Sharing encoder and decoder weights can yield improved performance when distribution of model input and model output is similar and less training data is available, cf. to (Rothe et al. (2020)) and (Raffel et al. (2019)). ● Task specific unsupervised pretraining can give significant improvement over general denoising objectives, e.g. (Zhang et al. (2020)), shows that pretraining with “Extracted Gap-sentences” significantly outperforms other Seq2Seq models in summarization tasks. ● Long-range sequence modeling for Seq2Seq models were recently proposed in (Zaheer et al. (2020)) and (Beltagy et al. (2020)) especially for long-document summarization. Sparse attention is usually only necessary for the encoder since the length of output sequence is typically small as compared to the input sequence. ● Warm-starting Seq2Seq models from pretrained BERT-like or GPT2-like checkpoints (Rothe et al. (2020)) can yield competitive performance to state-of-the-art Seq2Seq model at a fraction of the training cost.
  • 34. Current Limitations / Shortcomings (Extended) ● Auto-regressive generation is expensive. At inference time only one encoder forward pass is made, whereas N_target_length decoder forward passes are made. ○ Google Translate uses a Transformer Encoder and a RNN decoder, cf. to google ai blog. ○ Asymmetric encoder-decoder architecture, e.g., Shazeer (2019) or Shleifer et al. (2020). ○ Self-attention can be expressed as a linear dot-product of kernel feature maps for fast autoregressive Transformers Katharopoulos et al. (2020). ○ Beam search is expensive. Meister et al. (2020) propose more efficient beam search. ● “Transformer foregoes RNNs’ inductive bias towards learning iterative or recursive transformations” (Deghani et al. (2019)) ○ Transformer does not generalize well to lengths not encountered during training. Universal Transformers: (Deghani et al. (2019)) replace fixed stack of transformer layers with dynamic number of transformer layers => the transformer layer recursively processes the hidden states ○ Replace self-attention by kernel feature map so that Transformer behaves like an RNN Katharopoulos et al. (2020). ● Long-range sequence modeling techniques do not fit well with Encoder-Decoder model design. ○ (Kitaev et al. 2020)’s LSH self-attention cannot be used in a cross-attention layer ○ (Wang et al. (2020)’s Linformer cannot do cached auto-regressive generation ○ Sparse Attention is not trivial for cross-attention layer => (Zaheer et al. (2020)) and (Beltagy et al. (2020) simply apply full attention in the cross-attention layer.
  • 35. Links to interesting Reads - 🤗 Hugging Face Blog: Transformers-based Encoder-Decoder Models - 🤗 Hugging Face Blog: Leveraging Pre-trained Language Model Checkpoints for Encoder-Decoder Models - Paper: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer by Colin Raffel et al. - Paper: Leveraging Pre-trained Checkpoints for Sequence Generation Tasks - 🤗 Newsletter: https://huggingface.curated.co/