Transformer Zoo

Transformer Zoo
Grigory Sapunov
DEVPARTY
27.06.2020
gs@inten.to

● Recap: Types of neural networks (FFN, CNN, RNN)
● Recap: Attention & Self-Attention
● Transformer architecture
● Transformer “language models” (GPT*, BERT)
● Transformer modifications (including transformers for images, sound and
other non-NLP tasks)
Plan

Recap: Types of neural networks
(FFN, CNN, RNN)

“Classic” types of neural networks
FFN CNN
RNN (LSTM, GRU, …)

“Classic” of seq2seq: encoder-decoder
https://www.quora.com/What-is-an-Encoder-Decoder-in-Deep-Learning

Modern seq2seq architectures
Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures,
https://arxiv.org/abs/1808.08946

Encoder-Decoder shortcomings
Encoder-Decoder can be applied to N-to-M sequence, yet an Encoder reads and
encodes a source sentence into a fixed-length vector. Is one hidden state really
enough? A neural network needs to be able to compress all the necessary
information of a source sentence into a fixed-length vector.

Encoder-Decoder with Attention
https://hackernoon.com/attention-mechanism-in-neural-network-30aaf5e39512
Attention Mechanism allows the decoder to attend to different parts of the source
sentence at each step of the output generation.
Instead of encoding the input sequence into a single fixed context vector, we let
the model learn how to generate a context vector for each output time step. That
is we let the model learn what to attend based on the input sentence and what it
has produced so far.

Encoder-Decoder with Attention
https://research.googleblog.com/2016/09/a-neural-network-for-machine.html
Attention Mechanism allows the decoder to attend to different parts of the source
sentence at each step of the output generation.

Visualizing RNN attention weights αij on MT
Neural Machine Translation by Jointly Learning to Align and Translate, https://arxiv.org/abs/1409.0473

Visualizing RNN attention heat maps on QA
Teaching Machines to Read and Comprehend, https://arxiv.org/abs/1506.03340

CNN+RNN with Attention
http://kelvinxu.github.io/projects/capgen.html

Self-attention (Intra-Attention)
Each element in the sentence attends to other elements. It gives context sensitive
encodings.
Long Short-Term Memory-Networks for Machine Reading, https://arxiv.org/abs/1601.06733

Self-Attention Neural Networks (SAN):
Transformer Architecture

Attention Is All You Need, https://arxiv.org/abs/1706.03762

Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the transformer is
the unit of multi-head self-attention
mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT datasets

Working pipeline
http://jalammar.github.io/illustrated-transformer/

Input embeddings

Encoder

Multi-head self-attention mechanism
Essentially, the Multi-Head Attention is just
several attention layers stacked together with
different linear transformations of the same
input.

The transformer adopts the scaled dot-product
attention: the output is a weighted sum of the
values, where the weight assigned to each value
is determined by the dot-product of the query
with all the keys:
The input consists of queries and keys of
dimension dk, and values of dimension dv.
Scaled dot-product attention

Decoder

The Final Linear and Softmax Layer

Multi-head self-attention example (2 heads shown)

Applying the Transformer to machine translation
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html

Resources
● The Annotated Transformer
http://nlp.seas.harvard.edu/2018/04/03/attention.html
● Attention? Attention!
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
● The Illustrated Transformer
● Paper Dissected: “Attention is All You Need” Explained
http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/
● The Transformer – Attention is all you need.
https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/
● When Recurrent Models Don't Need to be Recurrent
https://bair.berkeley.edu/blog/2018/08/06/recurrent/
● Self-Attention Mechanisms in Natural Language Processing,
https://www.alibabacloud.com/blog/self-attention-mechanisms-in-natural-language-
processing_593968

Code
● https://github.com/huggingface/transformers
● https://github.com/ThilinaRajapakse/simpletransformers
● https://github.com/pytorch/fairseq
● https://www.tensorflow.org/tutorials/text/transformer
● https://github.com/tensorflow/models/tree/master/official/transformer
Tensor2Tensor library (the original code)
● https://github.com/tensorflow/tensor2tensor
● Running the Transformer with Tensor2Tensor
https://cloud.google.com/tpu/docs/tutorials/transformer
● https://ai.googleblog.com/2017/06/accelerating-deep-learning-research.html

BERT
Bidirectional Encoder Representations from Transformers, or BERT.
BERT is designed to pre-train deep bidirectional representations by jointly
conditioning on both left and right context in all layers. As a result, the pre-trained
BERT representations can be fine-tuned with just one additional output layer to
create state-of-the-art models for a wide range of tasks, such as question
answering and language inference, without substantial task-specific architecture
modifications.
BERT uses only the encoder part of the Transformer.
Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing,
https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
Best NLP Model Ever? Google BERT Sets New Standards in 11 Language Tasks
https://medium.com/syncedreview/best-nlp-model-ever-google-bert-sets-new-standards-in-11-language-tasks-
4a2a189bc155

BERT
Bidirectional Encoder Representations from Transformers, or BERT
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,

Pre-training tasks:
● Masked Language Model: predict random words from within the sequence,
not the next word for a sequence of words.
● Next Sentence Prediction: give the model two sentences and ask it to
predict if the second sentence follows the first in a corpus or not.
Input =
[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
BERT

BERT: masked language model
https://jalammar.github.io/illustrated-bert/

BERT: next sentence prediction

BERT
How to use:
● Fine-tuning approach: pre-train some model architecture on a LM objective
before fine-tuning that same model for a supervised downstream task.
○ Our task specific models are formed by incorporating BERT with one additional output layer,
so a minimal number of parameters need to be learned from scratch.
● Feature-based approach: learned representations are typically used as
features in a downstream model.
○ Not all NLP tasks can be easily be represented by a Transformer encoder architecture, and
therefore require a task-specific model architecture to be added.
○ There are major computational benefits to being able to pre-compute an expensive
representation of the training data once and then run many experiments with less expensive
models on top of this representation

BERT: using fine-tuning approach

Example: BioBERT
https://github.com/dmis-lab/biobert
BioBERT: a pre-trained biomedical language representation model for biomedical text mining

Example: VideoBERT
https://ai.googleblog.com/2019/09/learning-cross-modal-temporal.html
VideoBERT: A Joint Model for Video and Language Representation Learning
Combine visual tokens (produced with the help of CNN) with text tokens (obtained
with ASR). Can use for video captioning, video to video or text to video prediction.

Example: VideoBERT
Text-to-video prediction can be used to automatically generate a set of
instructions (such as a recipe) from video, yielding video segments (tokens) that
reflect what is described at each step.

RoBERTa: A Robustly Optimized BERT
https://blog.inten.to/papers-roberta-a-robustly-optimized-bert-pretraining-approach-7449bc5423e7
BERT was significantly undertrained.
Improvements:
● Take more data, train longer
● Next sentence prediction objective is obsolete
● Longer sentences
● Larger batches
● Dynamically changing the masking pattern
(BERT uses a single static mask)
Result: state-of-the-art on 4/9 GLUE tasks.

DistilBERT, a distilled version of BERT
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html

ALBERT: A Lite BERT
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
https://blog.inten.to/speeding-up-bert-5528e18bb4ea

Other BERT’s are constantly coming

GPT-2
https://openai.com/blog/better-language-models/
https://github.com/openai/gpt-2
http://jalammar.github.io/illustrated-gpt2/
Language model based on the transformer decoder.
It can generate continuations of the text. It was so good,
so OpenAI treat it as a dangerous thing that can be misused.

You can play with GPT (and other models) here: https://transformer.huggingface.co/

GPT-2

GPT-2 / BERT / Transformer-XL

GPT-3
https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9
● The GPT-3 family of models is a recent upgrade of the well-known GPT-2
model, with the largest of them (175B parameters), the “GPT-3” is 100x times
larger than the largest (1.5B parameters) GPT-2.

GPT-3
https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9
● The GPT-3 architecture is mostly the same as GPT-2 one (there are minor
differences, e.g. sparse attention).
● No, you can’t download the model 😎
● And you probably can’t even train it from scratch unless you have a very
powerful infrastructure.

BART: “classic” seq2seq
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and
Comprehension, https://arxiv.org/abs/1910.13461
BERT encoder
+
GPT decoder

Language Model Zoo
● ELMo
● ULMFiT
● GPT
● BERT (BioBERT,
ClinicalBERT, …)
● ERNIE
● XLNet
● RoBERTa
● KERMIT
● ERNIE 2.0
● GPT-2
● ALBERT
● GPT-3
● …

Resources
● Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language
Processing
https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
● Dissecting BERT Part 1: Understanding the Transformer
https://medium.com/@mromerocalvo/dissecting-bert-part1-6dcf5360b07f
● Understanding BERT Part 2: BERT Specifics
https://medium.com/dissecting-bert/dissecting-bert-part2-335ff2ed9c73
● Dissecting BERT Appendix: The Decoder
https://medium.com/dissecting-bert/dissecting-bert-appendix-the-decoder-3b86f66b0e5f
● The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
● Speeding Up BERT https://blog.inten.to/speeding-up-bert-5528e18bb4ea
● Interesting papers in our Telegram channel: https://t.me/gonzo_ML

Code
● TensorFlow code and pre-trained models for BERT
https://github.com/google-research/bert
● State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
https://github.com/huggingface/transformers
● GPT-2 https://github.com/openai/gpt-2
● DeepPavlov: An open source library for deep learning end-to-end dialog
systems and chatbots
https://github.com/deepmipt/DeepPavlov
● Transformers made simple
https://github.com/ThilinaRajapakse/simpletransformers
https://medium.com/swlh/simple-transformers-multi-class-text-classification-
with-bert-roberta-xlnet-xlm-and-8b585000ce3a

Many other transformers
● Image Transformer
● Music Transformer
● Universal Transformer
● Transformer-XL
● Sparse Transformer
● Star-Transformer
● R-Transformer
● Reformer
● Compressive Transformer
● Longformer
● Extended Transformer
Construction (ETC)
● Levenstein Transformer, Insertion Transformer, Imputer, KERMIT, …
● ...

Problems with vanilla transformers
● It’s a pretty heavy model
→ hard to train, tricky training
schedule
● It has O(N2) computational
complexity attention mechanism
→ scales poorly
● It has limited context span
(mostly due to the complexity),
typically 512 tokens
→ can’t process long sequences.
● May need different implicit bias
for other types of data (e.g. image,
sound, etc)

Transformer with added recurrence: it can see the previous segment
representations, so can process longer sentences.
Transformer-XL

The Compressive Transformer keeps a fine-grained memory of past activations,
which are then compressed into coarser compressed memories.
Compressive Transformer
Compressive Transformers for Long-Range Sequence Modelling

Reformer is an optimizer transformer:
● Using less memory
● Calculating attention using LSH
(Locality-sensitive hashing)
○ O(L2) → O(L*logL)
● => can process larger sequences!
64K Sequences on One GPU!
Reformer
Reformer: The Efficient Transformer

https://twitter.com/huggingface/status/1263850138595987457

Local + Global attention. Scales linearly!
Longformer
Longformer: The Long-Document Transformer

● Another local + global attention.
● Can incorporate structured data into the model!
Extended Transformer Construction (ETC)
ETC: Encoding Long and Structured Data in Transformers

Idea:
● Apply ACT to Transformers
● Apply a variable number of repetitions for calculating each position: a
Universal Transformer (UT)
● Use dynamic attention span: Adaptive Attention Span in Transformers
Adaptive Computation Time in Transformers
Adaptive Computation Time (ACT) in Neural Networks [3/3]
https://medium.com/@moocaholic/adaptive-computation-time-act-in-neural-networks-3-3-99452b2eff18

● Two flavors of UT in the paper:
○ UT with a fixed number of repetitions.
○ UT with dynamic halting.
● The UT repeatedly refines a series of vector representations for each position
of the sequence in parallel, by combining information from different positions
using self-attention and applying a recurrent transition function across all time
steps.
○ The number of time steps, T, is arbitrary but fixed (no ACT here, fixed
number of repetitions).
○ The number of time steps, T, is dymanic (a dynamic ACT halting
mechanism to each position in the input sequence)
Universal Transformer (UT): Implementation
“Universal Transformers”,

UT with a fixed number of repetitions
“Moving Beyond Translation with the Universal Transformer”,
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html

Adaptive UT with dynamic halting
https://mostafadehghani.com/2019/05/05/universal-transformers/

● Universal Transformer is a recurrent function (not in time, but in depth) that
evolves per-symbol hidden states in parallel, based at each step on the
sequence of previous hidden states.
○ In that sense, UT is similar to architectures such as the Neural GPU
and the Neural Turing Machine.
● When running for a fixed number of steps, the Universal Transformer is
equivalent to a multi-layer Transformer with tied parameters across its layers.
● Adaptive UT: as the recurrent transition function can be applied any number
of times, this implies that adaptive UTs can have variable depth (number of
per-symbol processing steps).
● Universal Transformer can be shown to be Turing-complete (or
“computationally universal”)
Universal Transformer (UT): Notes

● The problem with the vanilla transformer is its fixed context size (or attention
span).
● It cannot be very large because of the computation cost of the attention
mechanism (it requires O(n²) computations).
● Let the layer (or even the attention head) decide the required context size on
its own.
● There are two options:
○ Learnable (the adaptive attention span): let each attention head learn it’s
own attention span independently from the other heads. It is learnable,
but still fixed after the training is done.
○ ACT-like (the dynamic attention span): changes the span dynamically
depending on the current input.
Adaptive Attention Span: Idea & Implementation
“Adaptive Attention Span in Transformers”,

The models are smaller, the performance is better.
Adaptive Attention Span: Performance

Adaptive spans (in log-scale) of every attention heads in a 12-layer model with
span limit S = 4096. Few attention heads require long attention spans
Adaptive spans are learned larger when needed

Example of average dynamic attention span as a function of the input sequence.
The span is averaged over the layers and heads.
Dynamic spans adapt to the input sequence

Image Transformer
● Local self-attention
Image Transformer, https://arxiv.org/abs/1802.05751

Sparse factorizations of the attention matrix reduces complexity to O(N*sqrt(N)).
Can generate sounds and images.
Sparse Transformer
Generating Long Sequences with Sparse Transformers
https://openai.com/blog/sparse-transformer/

Image GPT (iGPT)
Just GPT-2 trained on images unrolled into long sequences of pixels!
Waiting for GPT-3 (uses sparse attention) trained on images.
https://openai.com/blog/image-gpt/

Axial Transformer
Transformer for images and other data organized as high dimensional tensors
Axial Attention in Multidimensional Transformers

Self-attention for Image Recognition
Self-attention can even outperform convolutions for image recognition!
Exploring Self-attention for Image Recognition
https://github.com/hszhao/SAN

New algorithm for relative self-attention with dramatically reduced memory footprint.
Music Transformer
Music Transformer
https://magenta.tensorflow.org/music-transformer

Basically GPT-2 + Sparse Transformer trained on music (MIDI files).
MuseNet
https://openai.com/blog/musenet/

● Transformers are cool and produce great results!
● There are many modifications, it’s kind of LEGO, you can combine it.
● More good source code and libraries are available (Huggingface, Colab
notebooks, etc)
● Definitely more transformers to come!
● GET INVOLVED!
You CAN move things forward!
Wrap up

https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!
(yes, we’re hiring!
python/asyncio/backend dev)

Transformer Zoo

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Transformer Zoo

Similaire à Transformer Zoo (20)

Plus de Grigory Sapunov

Plus de Grigory Sapunov (20)

Dernier

Dernier (20)

Transformer Zoo