Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)

Transformer Seq2Seq Models:
Concepts, Trends & Limitations
Patrick von Platen - Hugging Face Inc.

Hugging Face: Democratizing NLP

Popular open source NLP platform
▪ 1,000+ research paper mentions
▪ Used in production by 1,000+
companies
▪ 5,000+ pretrained models in the
model hub
▪ 40,000+ daily pip installs
▪ 40,000+ stars

● Knowledge sharing
○ NAACL 2019 / EMNLP 2020 Tutorial (Transfer Learning / Neural Lang Generation)
○ Workshop NeuralGen 2019 (Language Generation with Neural Networks)
○ Workshop SustaiNLP 2020 (Environmental/computational friendly NLP)
○ EurNLP Summit (European NLP summit)
● Code & model sharing: Open-sourcing the “right way”
○ Two extremes: 1000-commands research-code ⟺ 1-command production code
■ To target the widest community – our goal is to be 👆 right in the middle
○ Breaking barriers
■ Researchers / Practitioners
■ PyTorch / TensorFlow
○ Speeding up and fueling research in Natural Language Processing
■ Make people stand on the shoulders of giants

Transformer Seq2Seq Models:
Concepts, Trends & Limitations

Today’s Menu
● Intro: Seq2Seq: RNN vs. Transformer & Transfer Learning
● Architectures
● Pretraining Objectives
● Current Trends
● Current Limitations

Previously: Recurrent Neural Networks
● First end-to-end model to successfully
tackle seq2seq tasks Sutskever et al.
(2014)
● Can model input-dependent
sequence length
● Most prominent application: Google
Translate (see O’Reilly Article)
=> BUT
● Difficult to parallelize
● Ineffective at modeling long-term
dependencies
● Only single state to encode all
information of input

Now: Transformer Networks
● Can also model input-dependent
sequential length
● Highly parallelizable
● Effective at maintaining long-term
dependencies
● Less information loss by encoding
input as a sequence not a state

NLP took a turn in 2018
Large Text Datasets
Compute Power
Transfer Learning &
Pre-training
Transformer Networks

Sequential Transfer Learning: Pre-Training
Base model
Pre-trained
language model
Very large corpus
$$$ in compute
Days of training
Random init
models
word2vec
ELMo
GPT
BERT

Sequential Transfer Learning: Fine-Tuning
Small dataset
Training can be done on single GPU
Easily reproducible
word2vec
ELMo
GPT
BERT
Pre-trained
language model
Text classiﬁcation
Word labeling
Question-Answeri
ng
....
Fine-tuned
language model

Types of Transformer Models
Autoencoding Models Autoregressive Models Seq2Seq Models
Prefix-LM Model

Autoencoding Models
● Mathematical Model: P( class | “input seq”)
● Tasks: Natural Language Understanding e.g.
sentiment classification, named entity
recognition, ...
● Prominent Models: BERT, ALBERT,
DistilBERT

Autoregressive Models
● Mathematical Model: P( out_seq_i | out_seq_0:i-1)
● Tasks: Natural Language Generation, especially
open-domain generation
● Prominent Models: GPT1, GPT2, GPT3

Prefix Language Models
● Mathematical Model: P( out_seq_i | out_seq_0:i-1,
in_seq_0:n)
● Tasks: Natural Language Understanding &
Generation
● Prominent Models: UniLM, XLNet

Sequence-to-Sequence Models
● Mathematical Model: P( out_seq_i |
out_seq_0:i-1, in_seq_0:n)
● Tasks: Natural Language Generation,
especially Conditioned Natural
Generation (Seq2Seq)
● Prominent Models: T5, BART, Pegasus

What architecture to choose for Seq2Seq?
● Using autoregressive language models for seq2seq forces the model’s representation of the
sequence input to be unnecessary limited (see p. 17 of Colin et al. (2019))
● Prefix LM allows model can effectively be applied to both NLG and NLU tasks Dong et al. (2019),
but architecture is inherently more restricted than Seq2Seq architecture
● Seq2Seq models usually have more parameters, which however does not necessarily mean that
Seq2Seq model has higher “memory” or “computational” complexity
=> Seq2Seq architecture is usually the preferred architecture for Seq2Seq tasks.
Table 2 in “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer” by Colin Raffel et al. https://arxiv.org/abs/1910.10683

Pretraining objectives for Encoder-Only
● BERT (Devlin et al. (2019)) and GPT (Radford et al. (2019))
among the first papers to clear superiority of transfer learning
compared to no transfer learning
● BERT uses masked language modeling & GPT language
modeling objective
● Joshi et al. (2019) showed that better results can achieved
when masking whole spans of words
Taken from the blog “The Illustrated GPT-2 (Visualizing Transformer Language
Models)” by Jay Alammar http://jalammar.github.io/illustrated-gpt2/
Figure 2 in “SpanBERT: Improving Pre-training by Representing
and Predicting Spans” by Mandar Joshi et al.
https://arxiv.org/abs/1907.10529
Taken from the blog “The Illustrated BERT, ELMo, and co.
(How NLP Cracked Transfer Learning)” by Jay Alammar
http://jalammar.github.io/illustrated-bert/

Pretraining objectives for Seq2Seq
● T5 Raffel et al. (2019) and BART Lewis et al. (2019) first to do massive pretraining of Seq2Seq
models
● In Seq2Seq denoising objective model inputs to the model need not to be aligned with model outputs
=> weaker inductive bias given by architecture, wider range of denoising schemes can be used
● BERT-like span masking Joshi et al. (2019) is extended to generalized span masking Raffel et al.
(2019)
Figure 2 in “Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer” by Colin Raffel et al. https://arxiv.org/abs/1910.10683
Figure 2 in “BART: Denoising Sequence-to-Sequence Pre-training for Natural
Language Generation, Translation, and Comprehension” by Mike Lewis et al.

Current Trends (I)
❏ Recent trends
❏ Going big on model sizes - over 1 billion parameters as become the norm for SOTA
⬆
Google
GShard
600B
⬆
GPT3
175B

Current Trends (II)
● Sharing encoder and decoder weights can yield improved performance when distribution of model
input and model output is similar and less training data is available, cf. to (Rothe et al. (2020)) and
(Raffel et al. (2019)).
● Task specific unsupervised pretraining can give significant improvement over general denoising
objectives, e.g. (Zhang et al. (2020)), shows that pretraining with “Extracted Gap-sentences”
significantly outperforms other Seq2Seq models in summarization tasks.
Figure 1 in “PEGASUS: Pre-training with Extracted Gap-sentences for
Abstractive Summarization” by Jingqing Zhang et al. https://arxiv.org/abs/1912.08777
Table 1 in “”Leveraging Pre-trained Checkpoints
for Sequence Generation Tasks by Rothe et al.

Current Trends (III)
● Long-range sequence modeling for Seq2Seq models were recently proposed in (Zaheer et al.
(2020)) and (Beltagy et al. (2020)) especially for long-document summarization. Sparse attention is
usually only necessary for the encoder since the length of output sequence is typically small as
compared to the input sequence.
● Warm-starting Seq2Seq models from pretrained BERT-like or GPT2-like checkpoints (Rothe et al.
(2020)) can yield competitive performance to state-of-the-art Seq2Seq model at a fraction of the
training cost.
Figure 2 in “Longformer: The Long-Document Transformer” by Iz Beltagy et al.

Current Limitations / Shortcomings

Current Limitations / Shortcomings (I)
Need for grounded representations.
● Limits of distributional hypothesis—difficult to learn certain types of information from raw text
○ Human reporting bias: not stating the obvious (Gordon and Van Durme, 2013)
○ Common sense isn’t written down
○ No relation with other modalities (image, audio…)
○ Continuous Training / Catastrophic Forgetting
● Possible solutions:
○ Interactive/human-in-the-loop approaches (e.g. dialog: Hancock et al. 2018)
○ Incorporate structured knowledge (e.g. databases - ERNIE: Zhang et al 2019)
○ Multimodal learning (e.g. visual representations - VideoBERT: Sun et al. 2019)
○ Retrieval augmented models (e.g. Lewis et al. 2020)

Current Limitations / Shortcomings (II)
● Auto-regressive generation is expensive. At inference time only one encoder forward pass is made,
whereas N_target_length decoder forward passes are made.
○ Google Translate uses a Transformer Encoder and a RNN decoder, cf. to google ai blog
○ Asymmetric encoder-decoder architecture, e.g., Shazeer (2019) or Shleifer et al. (2020)
○ Self-attention can be expressed as a linear dot-product of kernel feature maps for fast
autoregressive Transformers Katharopoulos et al. (2020)
Figure 1 in “PRE-TRAINED SUMMARIZATION DISTILLATION” by
Sam Shleifer et al. https://arxiv.org/abs/2010.13002

Current Limitations / Shortcomings (III)
● “Transformer foregoes RNNs’ inductive bias towards learning iterative or recursive
transformations” (Deghani et al. (2019))
●
○ Transformer does not generalize well to lengths not encountered during training. Universal
Transformers: (Deghani et al. (2019)) replace fixed stack of transformer layers with dynamic
number of transformer layers => the transformer layer recursively processes the hidden states
○ Replace self-attention by kernel feature map so that Transformer behaves like an RNN
Katharopoulos et al. (2020).
Figure from “Moving Beyond Translation with the Universal Transformer” at Google AI Blog:
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html

What architecture to choose for Seq2Seq (Extended)
Pan and Yang (2010)
● Using autoregressive language models for seq2seq forces the model’s representation of the
sequence input X_1:n to be unnecessary limited (see p. 17 of Colin et al. (2019))
● Prefix LM allows model can effectively be applied to both NLG and NLU tasks Dong et al. (2019), but
architecture is inherently more restricted than Seq2Seq models
○ Parameters between encoding & decoding are always shared
○ Denoising objective is limited to BERT/GPT2-like masking; cannot be extended to
“auto-regressive” span masking
● Seq2Seq models usually have more parameters, which however does not necessarily mean that
Seq2Seq model has higher “memory” or “computational” complexity
○ Encoder is run only once, the same way the prefix is forwarded only once in Prefix-LM or
autoregressive LM
○ N^2 memory complexity is divided into N_enc^2 + N_enc*N_dec + N_dec^2 complexity

Current Trends (Extended)
● Sharing encoder and decoder weights can yield improved performance when distribution of model
input and model output is similar and less training data is available, cf. to (Rothe et al. (2020)) and
(Raffel et al. (2019)).
● Task specific unsupervised pretraining can give significant improvement over general denoising
objectives, e.g. (Zhang et al. (2020)), shows that pretraining with “Extracted Gap-sentences”
significantly outperforms other Seq2Seq models in summarization tasks.
● Long-range sequence modeling for Seq2Seq models were recently proposed in (Zaheer et al.
(2020)) and (Beltagy et al. (2020)) especially for long-document summarization. Sparse attention is
usually only necessary for the encoder since the length of output sequence is typically small as
compared to the input sequence.
● Warm-starting Seq2Seq models from pretrained BERT-like or GPT2-like checkpoints (Rothe et al.
(2020)) can yield competitive performance to state-of-the-art Seq2Seq model at a fraction of the
training cost.

Current Limitations / Shortcomings (Extended)
● Auto-regressive generation is expensive. At inference time only one encoder forward pass is made,
whereas N_target_length decoder forward passes are made.
○ Google Translate uses a Transformer Encoder and a RNN decoder, cf. to google ai blog.
○ Asymmetric encoder-decoder architecture, e.g., Shazeer (2019) or Shleifer et al. (2020).
○ Self-attention can be expressed as a linear dot-product of kernel feature maps for fast
autoregressive Transformers Katharopoulos et al. (2020).
○ Beam search is expensive. Meister et al. (2020) propose more efficient beam search.
● “Transformer foregoes RNNs’ inductive bias towards learning iterative or recursive
transformations” (Deghani et al. (2019))
○ Transformer does not generalize well to lengths not encountered during training. Universal
Transformers: (Deghani et al. (2019)) replace fixed stack of transformer layers with dynamic
number of transformer layers => the transformer layer recursively processes the hidden states
○ Replace self-attention by kernel feature map so that Transformer behaves like an RNN
Katharopoulos et al. (2020).
● Long-range sequence modeling techniques do not fit well with Encoder-Decoder model design.
○ (Kitaev et al. 2020)’s LSH self-attention cannot be used in a cross-attention layer
○ (Wang et al. (2020)’s Linformer cannot do cached auto-regressive generation
○ Sparse Attention is not trivial for cross-attention layer => (Zaheer et al. (2020)) and (Beltagy et
al. (2020) simply apply full attention in the cross-attention layer.

Links to interesting Reads
- 🤗 Hugging Face Blog: Transformers-based Encoder-Decoder Models
- 🤗 Hugging Face Blog: Leveraging Pre-trained Language Model
Checkpoints for Encoder-Decoder Models
- Paper: Exploring the Limits of Transfer Learning with a Unified
Text-to-Text Transformer by Colin Raffel et al.
- Paper: Leveraging Pre-trained Checkpoints for Sequence Generation
Tasks
- 🤗 Newsletter: https://huggingface.curated.co/

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)

Similar to Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI) (20)

More from Deep Learning Italia

More from Deep Learning Italia (20)

Recently uploaded

Recently uploaded (20)

Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)