SlideShare une entreprise Scribd logo
1  sur  91
Transformer Zoo
Grigory Sapunov
DEVPARTY
27.06.2020
gs@inten.to
● Recap: Types of neural networks (FFN, CNN, RNN)
● Recap: Attention & Self-Attention
● Transformer architecture
● Transformer “language models” (GPT*, BERT)
● Transformer modifications (including transformers for images, sound and
other non-NLP tasks)
Plan
Recap: Types of neural networks
(FFN, CNN, RNN)
“Classic” types of neural networks
FFN CNN
RNN (LSTM, GRU, …)
“Classic” of seq2seq: encoder-decoder
https://www.quora.com/What-is-an-Encoder-Decoder-in-Deep-Learning
Modern seq2seq architectures
Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures,
https://arxiv.org/abs/1808.08946
Attention & Self-Attention
Encoder-Decoder shortcomings
Encoder-Decoder can be applied to N-to-M sequence, yet an Encoder reads and
encodes a source sentence into a fixed-length vector. Is one hidden state really
enough? A neural network needs to be able to compress all the necessary
information of a source sentence into a fixed-length vector.
Encoder-Decoder with Attention
https://hackernoon.com/attention-mechanism-in-neural-network-30aaf5e39512
Attention Mechanism allows the decoder to attend to different parts of the source
sentence at each step of the output generation.
Instead of encoding the input sequence into a single fixed context vector, we let
the model learn how to generate a context vector for each output time step. That
is we let the model learn what to attend based on the input sentence and what it
has produced so far.
Encoder-Decoder with Attention
https://research.googleblog.com/2016/09/a-neural-network-for-machine.html
Attention Mechanism allows the decoder to attend to different parts of the source
sentence at each step of the output generation.
Visualizing RNN attention weights αij on MT
Neural Machine Translation by Jointly Learning to Align and Translate, https://arxiv.org/abs/1409.0473
Visualizing RNN attention heat maps on QA
Teaching Machines to Read and Comprehend, https://arxiv.org/abs/1506.03340
CNN+RNN with Attention
http://kelvinxu.github.io/projects/capgen.html
Self-attention (Intra-Attention)
Each element in the sentence attends to other elements. It gives context sensitive
encodings.
Long Short-Term Memory-Networks for Machine Reading, https://arxiv.org/abs/1601.06733
Self-Attention Neural Networks (SAN):
Transformer Architecture
Attention Is All You Need, https://arxiv.org/abs/1706.03762
Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the transformer is
the unit of multi-head self-attention
mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT datasets
Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the transformer is
the unit of multi-head self-attention
mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT datasets
Working pipeline
http://jalammar.github.io/illustrated-transformer/
Input embeddings
http://jalammar.github.io/illustrated-transformer/
Encoder
http://jalammar.github.io/illustrated-transformer/
Encoder
http://jalammar.github.io/illustrated-transformer/
Multi-head self-attention mechanism
Essentially, the Multi-Head Attention is just
several attention layers stacked together with
different linear transformations of the same
input.
The transformer adopts the scaled dot-product
attention: the output is a weighted sum of the
values, where the weight assigned to each value
is determined by the dot-product of the query
with all the keys:
The input consists of queries and keys of
dimension dk, and values of dimension dv.
Scaled dot-product attention
Decoder
http://jalammar.github.io/illustrated-transformer/
Decoder
http://jalammar.github.io/illustrated-transformer/
The Final Linear and Softmax Layer
http://jalammar.github.io/illustrated-transformer/
Multi-head self-attention example (2 heads shown)
http://jalammar.github.io/illustrated-transformer/
Attention visualization
Applying the Transformer to machine translation
https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
Resources
● The Annotated Transformer
http://nlp.seas.harvard.edu/2018/04/03/attention.html
● Attention? Attention!
https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
● The Illustrated Transformer
http://jalammar.github.io/illustrated-transformer/
● Paper Dissected: “Attention is All You Need” Explained
http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/
● The Transformer – Attention is all you need.
https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/
● When Recurrent Models Don't Need to be Recurrent
https://bair.berkeley.edu/blog/2018/08/06/recurrent/
● Self-Attention Mechanisms in Natural Language Processing,
https://www.alibabacloud.com/blog/self-attention-mechanisms-in-natural-language-
processing_593968
Code
● https://github.com/huggingface/transformers
● https://github.com/ThilinaRajapakse/simpletransformers
● https://github.com/pytorch/fairseq
● https://www.tensorflow.org/tutorials/text/transformer
● https://github.com/tensorflow/models/tree/master/official/transformer
Tensor2Tensor library (the original code)
● https://github.com/tensorflow/tensor2tensor
● Running the Transformer with Tensor2Tensor
https://cloud.google.com/tpu/docs/tutorials/transformer
● https://ai.googleblog.com/2017/06/accelerating-deep-learning-research.html
BERT & Co
BERT
Bidirectional Encoder Representations from Transformers, or BERT.
BERT is designed to pre-train deep bidirectional representations by jointly
conditioning on both left and right context in all layers. As a result, the pre-trained
BERT representations can be fine-tuned with just one additional output layer to
create state-of-the-art models for a wide range of tasks, such as question
answering and language inference, without substantial task-specific architecture
modifications.
BERT uses only the encoder part of the Transformer.
Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing,
https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
Best NLP Model Ever? Google BERT Sets New Standards in 11 Language Tasks
https://medium.com/syncedreview/best-nlp-model-ever-google-bert-sets-new-standards-in-11-language-tasks-
4a2a189bc155
BERT
Bidirectional Encoder Representations from Transformers, or BERT
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,
https://arxiv.org/abs/1810.04805
Pre-training tasks:
● Masked Language Model: predict random words from within the sequence,
not the next word for a sequence of words.
● Next Sentence Prediction: give the model two sentences and ask it to
predict if the second sentence follows the first in a corpus or not.
Input =
[CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP]
BERT
BERT: masked language model
https://jalammar.github.io/illustrated-bert/
BERT: next sentence prediction
https://jalammar.github.io/illustrated-bert/
BERT
How to use:
● Fine-tuning approach: pre-train some model architecture on a LM objective
before fine-tuning that same model for a supervised downstream task.
○ Our task specific models are formed by incorporating BERT with one additional output layer,
so a minimal number of parameters need to be learned from scratch.
● Feature-based approach: learned representations are typically used as
features in a downstream model.
○ Not all NLP tasks can be easily be represented by a Transformer encoder architecture, and
therefore require a task-specific model architecture to be added.
○ There are major computational benefits to being able to pre-compute an expensive
representation of the training data once and then run many experiments with less expensive
models on top of this representation
BERT: using fine-tuning approach
BERT: using fine-tuning approach
Example: BioBERT
https://arxiv.org/abs/1901.08746
https://github.com/dmis-lab/biobert
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
Example: BioBERT
https://arxiv.org/abs/1901.08746
https://github.com/dmis-lab/biobert
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
Example: VideoBERT
https://arxiv.org/abs/1904.01766
https://ai.googleblog.com/2019/09/learning-cross-modal-temporal.html
VideoBERT: A Joint Model for Video and Language Representation Learning
Combine visual tokens (produced with the help of CNN) with text tokens (obtained
with ASR). Can use for video captioning, video to video or text to video prediction.
Example: VideoBERT
Text-to-video prediction can be used to automatically generate a set of
instructions (such as a recipe) from video, yielding video segments (tokens) that
reflect what is described at each step.
RoBERTa: A Robustly Optimized BERT
https://arxiv.org/abs/1907.11692
https://blog.inten.to/papers-roberta-a-robustly-optimized-bert-pretraining-approach-7449bc5423e7
BERT was significantly undertrained.
Improvements:
● Take more data, train longer
● Next sentence prediction objective is obsolete
● Longer sentences
● Larger batches
● Dynamically changing the masking pattern
(BERT uses a single static mask)
Result: state-of-the-art on 4/9 GLUE tasks.
DistilBERT, a distilled version of BERT
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
https://arxiv.org/abs/1910.01108
https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html
ALBERT: A Lite BERT
ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
https://arxiv.org/abs/1909.11942
https://blog.inten.to/speeding-up-bert-5528e18bb4ea
Other BERT’s are constantly coming
GPT-2
https://openai.com/blog/better-language-models/
https://github.com/openai/gpt-2
http://jalammar.github.io/illustrated-gpt2/
Language model based on the transformer decoder.
It can generate continuations of the text. It was so good,
so OpenAI treat it as a dangerous thing that can be misused.
You can play with GPT (and other models) here: https://transformer.huggingface.co/
GPT-2
http://jalammar.github.io/illustrated-gpt2/
GPT-2
http://jalammar.github.io/illustrated-gpt2/
GPT-2 / BERT / Transformer-XL
http://jalammar.github.io/illustrated-gpt2/
GPT-3
https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9
https://arxiv.org/abs/2005.14165
● The GPT-3 family of models is a recent upgrade of the well-known GPT-2
model, with the largest of them (175B parameters), the “GPT-3” is 100x times
larger than the largest (1.5B parameters) GPT-2.
GPT-3
https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9
https://arxiv.org/abs/2005.14165
● The GPT-3 architecture is mostly the same as GPT-2 one (there are minor
differences, e.g. sparse attention).
● No, you can’t download the model 😎
● And you probably can’t even train it from scratch unless you have a very
powerful infrastructure.
GPT-3
is 10 screens
higher!!!
BART: “classic” seq2seq
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and
Comprehension, https://arxiv.org/abs/1910.13461
BERT encoder
+
GPT decoder
Language Model Zoo
● ELMo
● ULMFiT
● GPT
● BERT (BioBERT,
ClinicalBERT, …)
● ERNIE
● XLNet
● RoBERTa
● KERMIT
● ERNIE 2.0
● GPT-2
● ALBERT
● GPT-3
● …
Resources
● Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language
Processing
https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
● Dissecting BERT Part 1: Understanding the Transformer
https://medium.com/@mromerocalvo/dissecting-bert-part1-6dcf5360b07f
● Understanding BERT Part 2: BERT Specifics
https://medium.com/dissecting-bert/dissecting-bert-part2-335ff2ed9c73
● Dissecting BERT Appendix: The Decoder
https://medium.com/dissecting-bert/dissecting-bert-appendix-the-decoder-3b86f66b0e5f
● The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
https://jalammar.github.io/illustrated-bert/
● Speeding Up BERT https://blog.inten.to/speeding-up-bert-5528e18bb4ea
● Interesting papers in our Telegram channel: https://t.me/gonzo_ML
Code
● TensorFlow code and pre-trained models for BERT
https://github.com/google-research/bert
● State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch.
https://github.com/huggingface/transformers
● GPT-2 https://github.com/openai/gpt-2
● DeepPavlov: An open source library for deep learning end-to-end dialog
systems and chatbots
https://github.com/deepmipt/DeepPavlov
● Transformers made simple
https://github.com/ThilinaRajapakse/simpletransformers
https://medium.com/swlh/simple-transformers-multi-class-text-classification-
with-bert-roberta-xlnet-xlm-and-8b585000ce3a
Transformer modifications
Many other transformers
● Image Transformer
● Music Transformer
● Universal Transformer
● Transformer-XL
● Sparse Transformer
● Star-Transformer
● R-Transformer
● Reformer
● Compressive Transformer
● Longformer
● Extended Transformer
Construction (ETC)
● Levenstein Transformer, Insertion Transformer, Imputer, KERMIT, …
● ...
Problems with vanilla transformers
● It’s a pretty heavy model
→ hard to train, tricky training
schedule
● It has O(N2) computational
complexity attention mechanism
→ scales poorly
● It has limited context span
(mostly due to the complexity),
typically 512 tokens
→ can’t process long sequences.
● May need different implicit bias
for other types of data (e.g. image,
sound, etc)
Transformer with added recurrence: it can see the previous segment
representations, so can process longer sentences.
Transformer-XL
https://arxiv.org/abs/1901.02860
The Compressive Transformer keeps a fine-grained memory of past activations,
which are then compressed into coarser compressed memories.
Compressive Transformer
Compressive Transformers for Long-Range Sequence Modelling
https://arxiv.org/abs/1911.05507
Reformer is an optimizer transformer:
● Using less memory
● Calculating attention using LSH
(Locality-sensitive hashing)
○ O(L2) → O(L*logL)
● => can process larger sequences!
64K Sequences on One GPU!
Reformer
Reformer: The Efficient Transformer
https://arxiv.org/abs/2001.04451
https://twitter.com/huggingface/status/1263850138595987457
Local + Global attention. Scales linearly!
Longformer
Longformer: The Long-Document Transformer
https://arxiv.org/abs/2004.05150
● Another local + global attention.
● Can incorporate structured data into the model!
Extended Transformer Construction (ETC)
ETC: Encoding Long and Structured Data in Transformers
https://arxiv.org/abs/2004.08483
Idea:
● Apply ACT to Transformers
● Apply a variable number of repetitions for calculating each position: a
Universal Transformer (UT)
● Use dynamic attention span: Adaptive Attention Span in Transformers
Adaptive Computation Time in Transformers
Adaptive Computation Time (ACT) in Neural Networks [3/3]
https://medium.com/@moocaholic/adaptive-computation-time-act-in-neural-networks-3-3-99452b2eff18
● Two flavors of UT in the paper:
○ UT with a fixed number of repetitions.
○ UT with dynamic halting.
● The UT repeatedly refines a series of vector representations for each position
of the sequence in parallel, by combining information from different positions
using self-attention and applying a recurrent transition function across all time
steps.
○ The number of time steps, T, is arbitrary but fixed (no ACT here, fixed
number of repetitions).
○ The number of time steps, T, is dymanic (a dynamic ACT halting
mechanism to each position in the input sequence)
Universal Transformer (UT): Implementation
“Universal Transformers”,
https://arxiv.org/abs/1807.03819
UT with a fixed number of repetitions
“Moving Beyond Translation with the Universal Transformer”,
https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
Adaptive UT with dynamic halting
“Universal Transformers”,
https://mostafadehghani.com/2019/05/05/universal-transformers/
● Universal Transformer is a recurrent function (not in time, but in depth) that
evolves per-symbol hidden states in parallel, based at each step on the
sequence of previous hidden states.
○ In that sense, UT is similar to architectures such as the Neural GPU
and the Neural Turing Machine.
● When running for a fixed number of steps, the Universal Transformer is
equivalent to a multi-layer Transformer with tied parameters across its layers.
● Adaptive UT: as the recurrent transition function can be applied any number
of times, this implies that adaptive UTs can have variable depth (number of
per-symbol processing steps).
● Universal Transformer can be shown to be Turing-complete (or
“computationally universal”)
Universal Transformer (UT): Notes
“Universal Transformers”,
https://arxiv.org/abs/1807.03819
● The problem with the vanilla transformer is its fixed context size (or attention
span).
● It cannot be very large because of the computation cost of the attention
mechanism (it requires O(n²) computations).
● Let the layer (or even the attention head) decide the required context size on
its own.
● There are two options:
○ Learnable (the adaptive attention span): let each attention head learn it’s
own attention span independently from the other heads. It is learnable,
but still fixed after the training is done.
○ ACT-like (the dynamic attention span): changes the span dynamically
depending on the current input.
Adaptive Attention Span: Idea & Implementation
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
The models are smaller, the performance is better.
Adaptive Attention Span: Performance
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
Adaptive spans (in log-scale) of every attention heads in a 12-layer model with
span limit S = 4096. Few attention heads require long attention spans
Adaptive spans are learned larger when needed
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
Example of average dynamic attention span as a function of the input sequence.
The span is averaged over the layers and heads.
Dynamic spans adapt to the input sequence
“Adaptive Attention Span in Transformers”,
https://arxiv.org/abs/1905.07799
Not only texts...
Image Transformer
● Local self-attention
Image Transformer, https://arxiv.org/abs/1802.05751
Sparse factorizations of the attention matrix reduces complexity to O(N*sqrt(N)).
Can generate sounds and images.
Sparse Transformer
Generating Long Sequences with Sparse Transformers
https://arxiv.org/abs/1904.10509
https://openai.com/blog/sparse-transformer/
Image GPT (iGPT)
Just GPT-2 trained on images unrolled into long sequences of pixels!
Waiting for GPT-3 (uses sparse attention) trained on images.
https://openai.com/blog/image-gpt/
Axial Transformer
Transformer for images and other data organized as high dimensional tensors
Axial Attention in Multidimensional Transformers
https://arxiv.org/abs/1912.12180
Self-attention for Image Recognition
Self-attention can even outperform convolutions for image recognition!
Exploring Self-attention for Image Recognition
https://arxiv.org/abs/2004.13621
https://github.com/hszhao/SAN
New algorithm for relative self-attention with dramatically reduced memory footprint.
Music Transformer
Music Transformer
https://arxiv.org/abs/1809.04281
https://magenta.tensorflow.org/music-transformer
Basically GPT-2 + Sparse Transformer trained on music (MIDI files).
MuseNet
https://openai.com/blog/musenet/
Wrap up
● Transformers are cool and produce great results!
● There are many modifications, it’s kind of LEGO, you can combine it.
● More good source code and libraries are available (Huggingface, Colab
notebooks, etc)
● Definitely more transformers to come!
● GET INVOLVED!
You CAN move things forward!
Wrap up
https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!
(yes, we’re hiring!
python/asyncio/backend dev)

Contenu connexe

Tendances

Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion pathVitaly Bondar
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and TransformerArvind Devaraj
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers Arvind Devaraj
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaAlexey Grigorev
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...Edge AI and Vision Alliance
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear ComplexitySangwoo Mo
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...Simplilearn
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
 
The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021Steve Omohundro
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with TransformersJulien SIMON
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 

Tendances (20)

Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
 
Deep learning for NLP and Transformer
 Deep learning for NLP  and Transformer Deep learning for NLP  and Transformer
Deep learning for NLP and Transformer
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
NLP using transformers
NLP using transformers NLP using transformers
NLP using transformers
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
Introduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga PetrovaIntroduction to Transformers for NLP - Olga Petrova
Introduction to Transformers for NLP - Olga Petrova
 
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...“How Transformers are Changing the Direction of Deep Learning Architectures,”...
“How Transformers are Changing the Direction of Deep Learning Architectures,”...
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
TensorFlow Tutorial | Deep Learning With TensorFlow | TensorFlow Tutorial For...
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Bert
BertBert
Bert
 
BERT
BERTBERT
BERT
 

Similaire à Transformer Zoo

Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Grigory Sapunov
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Vimukthi Wickramasinghe
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesDr. Fabio Baruffa
 
EclipseCon Eu 2015 - Breathe life into your Designer!
EclipseCon Eu 2015 - Breathe life into your Designer!EclipseCon Eu 2015 - Breathe life into your Designer!
EclipseCon Eu 2015 - Breathe life into your Designer!melbats
 
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
IRJET -  	  Speech to Speech Translation using Encoder Decoder ArchitectureIRJET -  	  Speech to Speech Translation using Encoder Decoder Architecture
IRJET - Speech to Speech Translation using Encoder Decoder ArchitectureIRJET Journal
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll buildMark Stoodley
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Universitat Politècnica de Catalunya
 
Video coding technology proposal by
Video coding technology proposal by Video coding technology proposal by
Video coding technology proposal by Videoguy
 
Video coding technology proposal by
Video coding technology proposal by Video coding technology proposal by
Video coding technology proposal by Videoguy
 
Video coding technology proposal by
Video coding technology proposal by Video coding technology proposal by
Video coding technology proposal by Videoguy
 
Video coding technology proposal by
Video coding technology proposal by Video coding technology proposal by
Video coding technology proposal by Videoguy
 
BERT - Part 2 Learning Notes
BERT - Part 2 Learning NotesBERT - Part 2 Learning Notes
BERT - Part 2 Learning NotesSenthil Kumar M
 
Tensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryTensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryKobe Yu
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfhelloworld28847
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ijnlc
 

Similaire à Transformer Zoo (20)

Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)
 
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
Learning New Semi-Supervised Deep Auto-encoder Features for Statistical Machi...
 
Bert.pptx
Bert.pptxBert.pptx
Bert.pptx
 
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core ArchitecturesPerformance Optimization of SPH Algorithms for Multi/Many-Core Architectures
Performance Optimization of SPH Algorithms for Multi/Many-Core Architectures
 
EclipseCon Eu 2015 - Breathe life into your Designer!
EclipseCon Eu 2015 - Breathe life into your Designer!EclipseCon Eu 2015 - Breathe life into your Designer!
EclipseCon Eu 2015 - Breathe life into your Designer!
 
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
IRJET -  	  Speech to Speech Translation using Encoder Decoder ArchitectureIRJET -  	  Speech to Speech Translation using Encoder Decoder Architecture
IRJET - Speech to Speech Translation using Encoder Decoder Architecture
 
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using KurentoFIWARE Global Summit - Real-time Media Stream Processing Using Kurento
FIWARE Global Summit - Real-time Media Stream Processing Using Kurento
 
Birendra_resume
Birendra_resumeBirendra_resume
Birendra_resume
 
Oh the compilers you'll build
Oh the compilers you'll buildOh the compilers you'll build
Oh the compilers you'll build
 
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
Advanced Neural Machine Translation (D4L2 Deep Learning for Speech and Langua...
 
Video coding technology proposal by
Video coding technology proposal by Video coding technology proposal by
Video coding technology proposal by
 
Video coding technology proposal by
Video coding technology proposal by Video coding technology proposal by
Video coding technology proposal by
 
Video coding technology proposal by
Video coding technology proposal by Video coding technology proposal by
Video coding technology proposal by
 
Video coding technology proposal by
Video coding technology proposal by Video coding technology proposal by
Video coding technology proposal by
 
Birendra_resume
Birendra_resumeBirendra_resume
Birendra_resume
 
BERT - Part 2 Learning Notes
BERT - Part 2 Learning NotesBERT - Part 2 Learning Notes
BERT - Part 2 Learning Notes
 
Tensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute LibraryTensorflow Lite and ARM Compute Library
Tensorflow Lite and ARM Compute Library
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdf
 
Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
ANALYZING ARCHITECTURES FOR NEURAL MACHINE TRANSLATION USING LOW COMPUTATIONA...
 

Plus de Grigory Sapunov

AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021Grigory Sapunov
 
What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)Grigory Sapunov
 
Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Grigory Sapunov
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
Modern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionModern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionGrigory Sapunov
 
AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)Grigory Sapunov
 
Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Grigory Sapunov
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Grigory Sapunov
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNsGrigory Sapunov
 
Введение в Deep Learning
Введение в Deep LearningВведение в Deep Learning
Введение в Deep LearningGrigory Sapunov
 
Введение в машинное обучение
Введение в машинное обучениеВведение в машинное обучение
Введение в машинное обучениеGrigory Sapunov
 
Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Grigory Sapunov
 
Artificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureArtificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureGrigory Sapunov
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Grigory Sapunov
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingGrigory Sapunov
 
Computer Vision and Deep Learning
Computer Vision and Deep LearningComputer Vision and Deep Learning
Computer Vision and Deep LearningGrigory Sapunov
 

Plus de Grigory Sapunov (20)

AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
NLP in 2020
NLP in 2020NLP in 2020
NLP in 2020
 
What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)
 
Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]
 
BERTology meets Biology
BERTology meets BiologyBERTology meets Biology
BERTology meets Biology
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Modern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionModern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 version
 
AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)
 
Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
Введение в Deep Learning
Введение в Deep LearningВведение в Deep Learning
Введение в Deep Learning
 
Введение в машинное обучение
Введение в машинное обучениеВведение в машинное обучение
Введение в машинное обучение
 
Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016
 
Artificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureArtificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and Future
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
 
Computer Vision and Deep Learning
Computer Vision and Deep LearningComputer Vision and Deep Learning
Computer Vision and Deep Learning
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 

Dernier

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)Areesha Ahmad
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 

Dernier (20)

Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 

Transformer Zoo

  • 2. ● Recap: Types of neural networks (FFN, CNN, RNN) ● Recap: Attention & Self-Attention ● Transformer architecture ● Transformer “language models” (GPT*, BERT) ● Transformer modifications (including transformers for images, sound and other non-NLP tasks) Plan
  • 3. Recap: Types of neural networks (FFN, CNN, RNN)
  • 4. “Classic” types of neural networks FFN CNN RNN (LSTM, GRU, …)
  • 5. “Classic” of seq2seq: encoder-decoder https://www.quora.com/What-is-an-Encoder-Decoder-in-Deep-Learning
  • 6. Modern seq2seq architectures Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures, https://arxiv.org/abs/1808.08946
  • 8. Encoder-Decoder shortcomings Encoder-Decoder can be applied to N-to-M sequence, yet an Encoder reads and encodes a source sentence into a fixed-length vector. Is one hidden state really enough? A neural network needs to be able to compress all the necessary information of a source sentence into a fixed-length vector.
  • 9. Encoder-Decoder with Attention https://hackernoon.com/attention-mechanism-in-neural-network-30aaf5e39512 Attention Mechanism allows the decoder to attend to different parts of the source sentence at each step of the output generation. Instead of encoding the input sequence into a single fixed context vector, we let the model learn how to generate a context vector for each output time step. That is we let the model learn what to attend based on the input sentence and what it has produced so far.
  • 10. Encoder-Decoder with Attention https://research.googleblog.com/2016/09/a-neural-network-for-machine.html Attention Mechanism allows the decoder to attend to different parts of the source sentence at each step of the output generation.
  • 11. Visualizing RNN attention weights αij on MT Neural Machine Translation by Jointly Learning to Align and Translate, https://arxiv.org/abs/1409.0473
  • 12. Visualizing RNN attention heat maps on QA Teaching Machines to Read and Comprehend, https://arxiv.org/abs/1506.03340
  • 14. Self-attention (Intra-Attention) Each element in the sentence attends to other elements. It gives context sensitive encodings. Long Short-Term Memory-Networks for Machine Reading, https://arxiv.org/abs/1601.06733
  • 15. Self-Attention Neural Networks (SAN): Transformer Architecture
  • 16. Attention Is All You Need, https://arxiv.org/abs/1706.03762
  • 17. Transformer A new simple network architecture, the Transformer: ● Is a Encoder-Decoder architecture ● Based solely on attention mechanisms (no RNN/CNN) ● The major component in the transformer is the unit of multi-head self-attention mechanism. ● Fast: only matrix multiplications ● Strong results on standard WMT datasets
  • 18. Transformer A new simple network architecture, the Transformer: ● Is a Encoder-Decoder architecture ● Based solely on attention mechanisms (no RNN/CNN) ● The major component in the transformer is the unit of multi-head self-attention mechanism. ● Fast: only matrix multiplications ● Strong results on standard WMT datasets
  • 23.
  • 24. Multi-head self-attention mechanism Essentially, the Multi-Head Attention is just several attention layers stacked together with different linear transformations of the same input.
  • 25. The transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys: The input consists of queries and keys of dimension dk, and values of dimension dv. Scaled dot-product attention
  • 28. The Final Linear and Softmax Layer http://jalammar.github.io/illustrated-transformer/
  • 29. Multi-head self-attention example (2 heads shown) http://jalammar.github.io/illustrated-transformer/
  • 31. Applying the Transformer to machine translation https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html
  • 32. Resources ● The Annotated Transformer http://nlp.seas.harvard.edu/2018/04/03/attention.html ● Attention? Attention! https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html ● The Illustrated Transformer http://jalammar.github.io/illustrated-transformer/ ● Paper Dissected: “Attention is All You Need” Explained http://mlexplained.com/2017/12/29/attention-is-all-you-need-explained/ ● The Transformer – Attention is all you need. https://mchromiak.github.io/articles/2017/Sep/12/Transformer-Attention-is-all-you-need/ ● When Recurrent Models Don't Need to be Recurrent https://bair.berkeley.edu/blog/2018/08/06/recurrent/ ● Self-Attention Mechanisms in Natural Language Processing, https://www.alibabacloud.com/blog/self-attention-mechanisms-in-natural-language- processing_593968
  • 33. Code ● https://github.com/huggingface/transformers ● https://github.com/ThilinaRajapakse/simpletransformers ● https://github.com/pytorch/fairseq ● https://www.tensorflow.org/tutorials/text/transformer ● https://github.com/tensorflow/models/tree/master/official/transformer Tensor2Tensor library (the original code) ● https://github.com/tensorflow/tensor2tensor ● Running the Transformer with Tensor2Tensor https://cloud.google.com/tpu/docs/tutorials/transformer ● https://ai.googleblog.com/2017/06/accelerating-deep-learning-research.html
  • 35. BERT Bidirectional Encoder Representations from Transformers, or BERT. BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT representations can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications. BERT uses only the encoder part of the Transformer. Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing, https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html Best NLP Model Ever? Google BERT Sets New Standards in 11 Language Tasks https://medium.com/syncedreview/best-nlp-model-ever-google-bert-sets-new-standards-in-11-language-tasks- 4a2a189bc155
  • 36. BERT Bidirectional Encoder Representations from Transformers, or BERT BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, https://arxiv.org/abs/1810.04805
  • 37. Pre-training tasks: ● Masked Language Model: predict random words from within the sequence, not the next word for a sequence of words. ● Next Sentence Prediction: give the model two sentences and ask it to predict if the second sentence follows the first in a corpus or not. Input = [CLS] the man went to [MASK] store [SEP] he bought a gallon [MASK] milk [SEP] BERT
  • 38. BERT: masked language model https://jalammar.github.io/illustrated-bert/
  • 39. BERT: next sentence prediction https://jalammar.github.io/illustrated-bert/
  • 40. BERT How to use: ● Fine-tuning approach: pre-train some model architecture on a LM objective before fine-tuning that same model for a supervised downstream task. ○ Our task specific models are formed by incorporating BERT with one additional output layer, so a minimal number of parameters need to be learned from scratch. ● Feature-based approach: learned representations are typically used as features in a downstream model. ○ Not all NLP tasks can be easily be represented by a Transformer encoder architecture, and therefore require a task-specific model architecture to be added. ○ There are major computational benefits to being able to pre-compute an expensive representation of the training data once and then run many experiments with less expensive models on top of this representation
  • 43. Example: BioBERT https://arxiv.org/abs/1901.08746 https://github.com/dmis-lab/biobert BioBERT: a pre-trained biomedical language representation model for biomedical text mining
  • 44. Example: BioBERT https://arxiv.org/abs/1901.08746 https://github.com/dmis-lab/biobert BioBERT: a pre-trained biomedical language representation model for biomedical text mining
  • 45. Example: VideoBERT https://arxiv.org/abs/1904.01766 https://ai.googleblog.com/2019/09/learning-cross-modal-temporal.html VideoBERT: A Joint Model for Video and Language Representation Learning Combine visual tokens (produced with the help of CNN) with text tokens (obtained with ASR). Can use for video captioning, video to video or text to video prediction.
  • 46. Example: VideoBERT Text-to-video prediction can be used to automatically generate a set of instructions (such as a recipe) from video, yielding video segments (tokens) that reflect what is described at each step.
  • 47. RoBERTa: A Robustly Optimized BERT https://arxiv.org/abs/1907.11692 https://blog.inten.to/papers-roberta-a-robustly-optimized-bert-pretraining-approach-7449bc5423e7 BERT was significantly undertrained. Improvements: ● Take more data, train longer ● Next sentence prediction objective is obsolete ● Longer sentences ● Larger batches ● Dynamically changing the masking pattern (BERT uses a single static mask) Result: state-of-the-art on 4/9 GLUE tasks.
  • 48. DistilBERT, a distilled version of BERT DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter https://arxiv.org/abs/1910.01108 https://ai.googleblog.com/2019/12/albert-lite-bert-for-self-supervised.html
  • 49. ALBERT: A Lite BERT ALBERT: A Lite BERT for Self-supervised Learning of Language Representations https://arxiv.org/abs/1909.11942 https://blog.inten.to/speeding-up-bert-5528e18bb4ea
  • 50. Other BERT’s are constantly coming
  • 51. GPT-2 https://openai.com/blog/better-language-models/ https://github.com/openai/gpt-2 http://jalammar.github.io/illustrated-gpt2/ Language model based on the transformer decoder. It can generate continuations of the text. It was so good, so OpenAI treat it as a dangerous thing that can be misused.
  • 52. You can play with GPT (and other models) here: https://transformer.huggingface.co/
  • 55. GPT-2 / BERT / Transformer-XL http://jalammar.github.io/illustrated-gpt2/
  • 56. GPT-3 https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9 https://arxiv.org/abs/2005.14165 ● The GPT-3 family of models is a recent upgrade of the well-known GPT-2 model, with the largest of them (175B parameters), the “GPT-3” is 100x times larger than the largest (1.5B parameters) GPT-2.
  • 57. GPT-3 https://blog.inten.to/gpt-3-language-models-are-few-shot-learners-a13d1ae8b1f9 https://arxiv.org/abs/2005.14165 ● The GPT-3 architecture is mostly the same as GPT-2 one (there are minor differences, e.g. sparse attention). ● No, you can’t download the model 😎 ● And you probably can’t even train it from scratch unless you have a very powerful infrastructure.
  • 59. BART: “classic” seq2seq BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, https://arxiv.org/abs/1910.13461 BERT encoder + GPT decoder
  • 60. Language Model Zoo ● ELMo ● ULMFiT ● GPT ● BERT (BioBERT, ClinicalBERT, …) ● ERNIE ● XLNet ● RoBERTa ● KERMIT ● ERNIE 2.0 ● GPT-2 ● ALBERT ● GPT-3 ● …
  • 61. Resources ● Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html ● Dissecting BERT Part 1: Understanding the Transformer https://medium.com/@mromerocalvo/dissecting-bert-part1-6dcf5360b07f ● Understanding BERT Part 2: BERT Specifics https://medium.com/dissecting-bert/dissecting-bert-part2-335ff2ed9c73 ● Dissecting BERT Appendix: The Decoder https://medium.com/dissecting-bert/dissecting-bert-appendix-the-decoder-3b86f66b0e5f ● The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) https://jalammar.github.io/illustrated-bert/ ● Speeding Up BERT https://blog.inten.to/speeding-up-bert-5528e18bb4ea ● Interesting papers in our Telegram channel: https://t.me/gonzo_ML
  • 62. Code ● TensorFlow code and pre-trained models for BERT https://github.com/google-research/bert ● State-of-the-art Natural Language Processing for TensorFlow 2.0 and PyTorch. https://github.com/huggingface/transformers ● GPT-2 https://github.com/openai/gpt-2 ● DeepPavlov: An open source library for deep learning end-to-end dialog systems and chatbots https://github.com/deepmipt/DeepPavlov ● Transformers made simple https://github.com/ThilinaRajapakse/simpletransformers https://medium.com/swlh/simple-transformers-multi-class-text-classification- with-bert-roberta-xlnet-xlm-and-8b585000ce3a
  • 64. Many other transformers ● Image Transformer ● Music Transformer ● Universal Transformer ● Transformer-XL ● Sparse Transformer ● Star-Transformer ● R-Transformer ● Reformer ● Compressive Transformer ● Longformer ● Extended Transformer Construction (ETC) ● Levenstein Transformer, Insertion Transformer, Imputer, KERMIT, … ● ...
  • 65. Problems with vanilla transformers ● It’s a pretty heavy model → hard to train, tricky training schedule ● It has O(N2) computational complexity attention mechanism → scales poorly ● It has limited context span (mostly due to the complexity), typically 512 tokens → can’t process long sequences. ● May need different implicit bias for other types of data (e.g. image, sound, etc)
  • 66. Transformer with added recurrence: it can see the previous segment representations, so can process longer sentences. Transformer-XL https://arxiv.org/abs/1901.02860
  • 67. The Compressive Transformer keeps a fine-grained memory of past activations, which are then compressed into coarser compressed memories. Compressive Transformer Compressive Transformers for Long-Range Sequence Modelling https://arxiv.org/abs/1911.05507
  • 68. Reformer is an optimizer transformer: ● Using less memory ● Calculating attention using LSH (Locality-sensitive hashing) ○ O(L2) → O(L*logL) ● => can process larger sequences! 64K Sequences on One GPU! Reformer Reformer: The Efficient Transformer https://arxiv.org/abs/2001.04451
  • 70. Local + Global attention. Scales linearly! Longformer Longformer: The Long-Document Transformer https://arxiv.org/abs/2004.05150
  • 71. ● Another local + global attention. ● Can incorporate structured data into the model! Extended Transformer Construction (ETC) ETC: Encoding Long and Structured Data in Transformers https://arxiv.org/abs/2004.08483
  • 72. Idea: ● Apply ACT to Transformers ● Apply a variable number of repetitions for calculating each position: a Universal Transformer (UT) ● Use dynamic attention span: Adaptive Attention Span in Transformers Adaptive Computation Time in Transformers Adaptive Computation Time (ACT) in Neural Networks [3/3] https://medium.com/@moocaholic/adaptive-computation-time-act-in-neural-networks-3-3-99452b2eff18
  • 73. ● Two flavors of UT in the paper: ○ UT with a fixed number of repetitions. ○ UT with dynamic halting. ● The UT repeatedly refines a series of vector representations for each position of the sequence in parallel, by combining information from different positions using self-attention and applying a recurrent transition function across all time steps. ○ The number of time steps, T, is arbitrary but fixed (no ACT here, fixed number of repetitions). ○ The number of time steps, T, is dymanic (a dynamic ACT halting mechanism to each position in the input sequence) Universal Transformer (UT): Implementation “Universal Transformers”, https://arxiv.org/abs/1807.03819
  • 74. UT with a fixed number of repetitions “Moving Beyond Translation with the Universal Transformer”, https://ai.googleblog.com/2018/08/moving-beyond-translation-with.html
  • 75. Adaptive UT with dynamic halting “Universal Transformers”, https://mostafadehghani.com/2019/05/05/universal-transformers/
  • 76. ● Universal Transformer is a recurrent function (not in time, but in depth) that evolves per-symbol hidden states in parallel, based at each step on the sequence of previous hidden states. ○ In that sense, UT is similar to architectures such as the Neural GPU and the Neural Turing Machine. ● When running for a fixed number of steps, the Universal Transformer is equivalent to a multi-layer Transformer with tied parameters across its layers. ● Adaptive UT: as the recurrent transition function can be applied any number of times, this implies that adaptive UTs can have variable depth (number of per-symbol processing steps). ● Universal Transformer can be shown to be Turing-complete (or “computationally universal”) Universal Transformer (UT): Notes “Universal Transformers”, https://arxiv.org/abs/1807.03819
  • 77. ● The problem with the vanilla transformer is its fixed context size (or attention span). ● It cannot be very large because of the computation cost of the attention mechanism (it requires O(n²) computations). ● Let the layer (or even the attention head) decide the required context size on its own. ● There are two options: ○ Learnable (the adaptive attention span): let each attention head learn it’s own attention span independently from the other heads. It is learnable, but still fixed after the training is done. ○ ACT-like (the dynamic attention span): changes the span dynamically depending on the current input. Adaptive Attention Span: Idea & Implementation “Adaptive Attention Span in Transformers”, https://arxiv.org/abs/1905.07799
  • 78. The models are smaller, the performance is better. Adaptive Attention Span: Performance “Adaptive Attention Span in Transformers”, https://arxiv.org/abs/1905.07799
  • 79. Adaptive spans (in log-scale) of every attention heads in a 12-layer model with span limit S = 4096. Few attention heads require long attention spans Adaptive spans are learned larger when needed “Adaptive Attention Span in Transformers”, https://arxiv.org/abs/1905.07799
  • 80. Example of average dynamic attention span as a function of the input sequence. The span is averaged over the layers and heads. Dynamic spans adapt to the input sequence “Adaptive Attention Span in Transformers”, https://arxiv.org/abs/1905.07799
  • 82. Image Transformer ● Local self-attention Image Transformer, https://arxiv.org/abs/1802.05751
  • 83. Sparse factorizations of the attention matrix reduces complexity to O(N*sqrt(N)). Can generate sounds and images. Sparse Transformer Generating Long Sequences with Sparse Transformers https://arxiv.org/abs/1904.10509 https://openai.com/blog/sparse-transformer/
  • 84. Image GPT (iGPT) Just GPT-2 trained on images unrolled into long sequences of pixels! Waiting for GPT-3 (uses sparse attention) trained on images. https://openai.com/blog/image-gpt/
  • 85. Axial Transformer Transformer for images and other data organized as high dimensional tensors Axial Attention in Multidimensional Transformers https://arxiv.org/abs/1912.12180
  • 86. Self-attention for Image Recognition Self-attention can even outperform convolutions for image recognition! Exploring Self-attention for Image Recognition https://arxiv.org/abs/2004.13621 https://github.com/hszhao/SAN
  • 87. New algorithm for relative self-attention with dramatically reduced memory footprint. Music Transformer Music Transformer https://arxiv.org/abs/1809.04281 https://magenta.tensorflow.org/music-transformer
  • 88. Basically GPT-2 + Sparse Transformer trained on music (MIDI files). MuseNet https://openai.com/blog/musenet/
  • 90. ● Transformers are cool and produce great results! ● There are many modifications, it’s kind of LEGO, you can combine it. ● More good source code and libraries are available (Huggingface, Colab notebooks, etc) ● Definitely more transformers to come! ● GET INVOLVED! You CAN move things forward! Wrap up