SlideShare a Scribd company logo
1 of 67
Download to read offline
Transformers in 2021
Grigory Sapunov
DataFest Yerevan 2021
10.09.2021
gs@inten.to
Who am I?
● MD in CS (2002), PhD in AI (2006)
● ex-Yandex News Dev. Team Leader (2007-2012)
● CTO & co-founder of Intento (2016+) and
Berkeley SkyDeck alumni (Spring 2019)
● Member of Scientific Advisory Board
at Atlas Biomed
● Google Developer Expert in
Machine Learning
● Transformer architecture understanding
○ Original paper: https://arxiv.org/abs/1706.03762
○ Great visual explanation: http://jalammar.github.io/illustrated-transformer
○ Lecture #12 from my DL course
https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course
● This talk is in some sense a follow-up talk for these two:
○ https://www.youtube.com/watch?v=KZ9NXYcXVBY (GDG DevParty)
○ https://www.youtube.com/watch?v=7e4LxIVENZA (GDG DevFest)
● Sidenote: many modern transformers are described and discussed in
our Telegram channel & chat on ML research papers:
https://t.me/gonzo_ML
Prerequisites
Recap: Transformer Architecture
Transformer
A new simple network architecture,
the Transformer:
● Is a Encoder-Decoder architecture
● Based solely on attention mechanisms
(no RNN/CNN)
● The major component in the
transformer is the unit of multi-head
self-attention mechanism.
● Fast: only matrix multiplications
● Strong results on standard WMT
datasets
Multi-head self-attention mechanism
Essentially, the Multi-Head Attention is just
several attention layers stacked together
with different linear transformations of the
same input.
The transformer adopts the scaled
dot-product attention: the output is a
weighted sum of the values, where the
weight assigned to each value is
determined by the dot-product of the
query with all the keys:
The input consists of queries and keys of
dimension dk
, and values of dimension dv
.
Scaled dot-product attention
Quadratic attention
Efficient Transformers: A Survey
https://arxiv.org/abs/2009.06732
Problems with vanilla transformers
● It’s a pretty heavy model
→ hard to train, tricky training
schedule (warm-ups, cyclic
learning rates, etc)
● O(N2
) computational
complexity attention
mechanism → scales poorly
● limited context span (mostly
due to the complexity),
typically 512 tokens
→ can’t process long sequences.
● May need different implicit bias for other types of data (e.g. image,
sound, etc)
Year 2021 directions
Directions in 2021
● (Still) Large transformers
● (Still) Efficient transformers
● New modalities:
○ more image transformers
○ audio transformers
○ transformers in biology and other domains (graphs)
● Multimodalily: CLIP, DALLE, Performer + IO, …
● Artistic applications: CLIPDraw etc
1. Large Transformers
Large models
http://faculty.washington.edu/ebender/papers/Stochastic_Parrots.pdf
Large models in 2021
● (English) GPT-Neo (2.7B), GPT-J (6B),
Jurassic-1 (7.5B/178B)
● (Russian) ruGPT-3 (13B)
● (Chinese) CPM-2 (11B/198B* - MoE),
M6 (10B/100B), Wu Dao 2.0 (1.75T*),
PangGu-α (2.6B/13B/207B)
● (Korean) HyperCLOVA (204B)
● (Code) OpenAI Codex (12B),
Google’s (up to 137B)
● ByT5 (up to 12.9B)
● XLM-R XL/XXL (3.5B/10.7B)
● DeBERTa (1.5B)
● Switch Transformer (1.6T*)
● ERNIE 3.0 (10B)
● DALL·E (12B)
● Vision MoE (14.7B*)
Scaling laws
“Scaling Laws for Neural Language Models”
https://arxiv.org/abs/2001.08361
Scaling laws
“Scaling Laws for Neural Language Models”
https://arxiv.org/abs/2001.08361
SuperGLUE
https://super.gluebenchmark.com/leaderboard
1*. Problems of Large Models
Costs
Large model training costs
“The Cost of Training NLP Models: A Concise Overview”
https://arxiv.org/abs/2004.08900
CO2
emissions
“Energy and Policy Considerations for Deep Learning in NLP”
https://arxiv.org/abs/1906.02243
Training Data Extraction
“Extracting Training Data from Large Language Models”
https://arxiv.org/abs/2012.07805
https://dl.acm.org/doi/10.1145/3442188.3445922
● Size Doesn’t Guarantee Diversity
○ Internet data overrepresenting younger users and those from developed countries.
○ Training data is sourced by scraping only specific sites (e.g. Reddit).
○ There are structural factors including moderation practices.
○ The current practice of filtering datasets can further attenuate specific voices.
● Static Data/Changing Social Views
○ The risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive
understandings.
○ Movements with no significant media attention will not be captured at all.
○ Given the compute costs it likely isn’t feasible to fully retrain LMs frequently enough.
● Encoding Bias
○ Large LMs exhibit various kinds of bias, including stereotypical associations or
negative sentiment towards specific groups.
○ Issues with training data: unreliable news sites, banned subreddits, etc.
○ Model auditing using automated systems that are not reliable themselves.
● Documentation debt
○ Datasets are both undocumented and too large to document post hoc.
“An LM is a system for haphazardly stitching together
sequences of linguistic forms it has observed in its vast
training data, according to probabilistic information
about how they combine, but without any reference to
meaning: a stochastic parrot. “
https://dl.acm.org/doi/10.1145/3442188.3445922
https://crfm.stanford.edu/
In recent years, a new successful paradigm for building AI systems has
emerged: Train one model on a huge amount of data and adapt it to
many applications. We call such a model a foundation model.
Foundation models (e.g., GPT-3) have demonstrated impressive behavior,
but can fail unexpectedly, harbor biases, and are poorly understood.
Nonetheless, they are being deployed at scale.
The Center for Research on Foundation Models (CRFM) is an
interdisciplinary initiative born out of the Stanford Institute for
Human-Centered Artificial Intelligence (HAI) that aims to make
fundamental advances in the study, development, and deployment of
foundation models.
https://arxiv.org/abs/2108.07258
https://arxiv.org/abs/2108.07258
https://arxiv.org/abs/2108.07258
2. Efficient Transformers
“Efficient Transformers: A Survey”
https://arxiv.org/abs/2009.06732
“Efficient Transformers: A Survey”
https://arxiv.org/abs/2009.06732
Some recent architectural innovations
Switch Transformers:
Mixture of Experts (MoE)
architecture with only a single
expert per feed-forward layer.
Scales well with more experts.
Adds a new dimension of
scaling: ‘expert-parallelism’ in
addition to data- and
model-parallelism.
“Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”
https://arxiv.org/abs/2101.03961
Some recent architectural innovations
Balanced assignment of experts
(BASE) layer:
A new kind of sparse expert model (similar
to MoE transformer or Switch transformer)
that algorithmically balance the
token-to-expert assignments (without any
new hyperparameters or auxiliary losses).
Distributes well across many GPUs (say,
128).
“BASE Layers: Simplifying Training of Large, Sparse Models”
https://arxiv.org/abs/2103.16716
Some recent architectural innovations
A simple yet highly accurate
approximation for vanilla attention:
● its memory usage is linear in the
input size, similar to linear attention
variants, such as Performer and RFA
● it is a drop-in replacement for vanilla
attention that does not require any
corrective pre-training
● it can also lead to significant memory
savings in the feed-forward layers after
casting them into the familiar
query-key-value framework.
“Memory-efficient Transformers via Top-k Attention”
https://arxiv.org/abs/2106.06899
Some recent architectural innovations
Expire-Span Transformer:
● learns to retain the most important
information and expire the irrelevant
information
● scales to attend over tens of
thousands of previous timesteps
efficiently, as not all states from
previous timesteps are preserved
“Not All Memories are Created Equal: Learning to Forget by Expiring”
https://arxiv.org/abs/2105.06548
3. New Modalities
Image Transformers
There were many transformers for images already:
● Image Transformer (https://arxiv.org/abs/1802.05751)
● Sparse Transformer
(https://arxiv.org/abs/1904.10509)
● Image GPT (iGPT): just a GPT-2 trained on images
unrolled into long sequences of pixels
(https://openai.com/blog/image-gpt/)
● Axial Transformer: for images and other data
organized as high dim tensors
(https://arxiv.org/abs/1912.12180).
Image Transformers
Many more emerged in 2020-2021:
● Vision Transformer (ViT)
● Data-efficient image
Transformer (DeiT)
● Bottleneck Transformers (BoTNet)
● Vision MoE (V-MoE)
● Image Processing Transformer (IPT)
● Detection Transformer (DETR)
● TransGAN
● ...
“Transformers in Vision: A Survey”
https://arxiv.org/abs/2101.01169
Some New Transformers for Images
“Bottleneck Transformers for Visual Recognition”
https://arxiv.org/abs/2101.11605
Vision Transformer (ViT)
● Image is split into patches (e.g. 16x16), flatten into a 1D sequence, then put
into a transformer encoder (similar to BERT).
“An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale”
https://arxiv.org/abs/2010.11929
Data-efficient image Transformer (DeiT)
The architecture is identical to ViT with the
only differences are the training strategies,
and the distillation token.
“Training data-efficient image transformers & distillation through attention”
https://arxiv.org/abs/2012.12877
Bottleneck Transformers (BoTNet)
● A hybrid model with ResNet +
Transformer
● Replacing internal 3x3 convolutions
inside a ResNet block (only the last
three) with Multi-head Self-Attention.
● The architecture called BoTNet scales
pretty well.
“Bottleneck Transformers for Visual Recognition”
https://arxiv.org/abs/2101.11605
Vision MoE (V-MoE)
● A sparse variant of the recent Vision Transformer (ViT) architecture for image
classification.
● .The V-MoE replaces a subset of the dense feedforward layers in ViT with
sparse MoE layers, where each image patch is “routed” to a subset of
“experts” (MLPs).
● Scales to model sizes of 15B parameters, the largest vision models to date.
“Scaling Vision with Sparse Mixture of Experts”
https://arxiv.org/abs/2106.05974
Speech and Sound Transformers
There were many transformers for sound as well:
● Speech-Transformer (https://ieeexplore.ieee.org/document/8462506)
● Conformer (https://arxiv.org/abs/2005.08100)
● Transformer-Transducer (https://arxiv.org/abs/1910.12977)
● Transformer-Transducer(https://arxiv.org/abs/2002.02562)
● Conv-Transformer Transducer (https://arxiv.org/abs/2008.05750)
● Speech-XLNet (https://arxiv.org/abs/1910.10387)
● Audio ALBERT (https://arxiv.org/abs/2005.08575)
● Emformer (https://arxiv.org/abs/2010.10759)
● wav2vec 2.0 (https://arxiv.org/abs/2006.11477)
● ...
AST: Audio Spectrogram Transformer
“AST: Audio Spectrogram Transformer”
https://arxiv.org/abs/2104.01778
A convolution-free, purely attention-based
model for audio classification.
Very close to ViT, but AST can process
variable-length audio inputs.
ACT: Audio Captioning Transformer
“Audio Captioning Transformer”
https://arxiv.org/abs/2107.09817
Another convolution-free Transformer
based on an encoder-decoder
architecture.
Multi-channel Transformer for ASR
“End-to-End Multi-Channel Transformer for Speech Recognition”
https://arxiv.org/abs/2102.03951
Transformers in Biology
Finally transformers came into biology!
● ESM-1b protein language model
(https://www.pnas.org/content/118/15/e2016239118)
● MSA Transformer for multiple sequence alignment
(https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1)
● RoseTTAFold for predicting protein structures (includes graph
transformers)
(https://www.science.org/doi/abs/10.1126/science.abj8754)
● AlphaFold2 for predicting protein structures
(https://www.nature.com/articles/s41586-021-03819-2)
ESM-1b
“Biological structure and function emerge from scaling unsupervised learning to 250 million protein
sequences”, https://www.pnas.org/content/118/15/e2016239118
RoseTTAFold
“Accurate prediction of protein structures and interactions using a 3-track network”
https://www.science.org/doi/abs/10.1126/science.abj8754
AlphaFold 2
“Highly accurate protein structure prediction with AlphaFold”
https://www.nature.com/articles/s41586-021-03819-2
AlphaFold 2: Evoformer block
“Highly accurate protein structure prediction with AlphaFold”
https://www.nature.com/articles/s41586-021-03819-2
4. Multi-Modal Transformers
https://arxiv.org/abs/2101.01169
DALL·E (OpenAI)
“Zero-Shot Text-to-Image Generation”
https://arxiv.org/abs/2102.12092
A model trained on images+text
descriptions.
Autoregressively generates image tokens
based on previous text and (optionally)
image tokens.
Technically a transformer decoder.
Image tokens are obtained with a
pretrained dVAE.
Candidates are ranked using CLIP.
CLIP (OpenAI)
“Learning Transferable Visual Models From Natural Language Supervision”
https://arxiv.org/abs/2103.00020
Uses contrastive pre-training to predict which caption goes with which image.
ALIGN (Google)
https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html
“Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision”
https://arxiv.org/abs/2102.05918
Train EfficientNet-L2 (image encoder) and BERT-large (text encoder) with a
contrastive loss on a huge noisy dataset (1.8B image-text pairs).
CLIPDraw
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”
https://arxiv.org/abs/2106.14843
You can optimize the image to better match a text description (remember
DeepDream?).
CLIPDraw
“CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders”
https://arxiv.org/abs/2106.14843
The image is rendered from a set of bezier curves.
https://twitter.com/RiversHaveWings/status/1410020043178446848
“a beautiful epic wondrous fantasy painting of the ocean”
CLIP + PixelDraw
https://www.reddit.com/r/MediaSynthesis/comments/pf7ru8/set_of_asianthemed_graphics_generated_with_clipit/
Perceiver (Google)
“Perceiver: General Perception with Iterative Attention”
https://arxiv.org/abs/2103.03206
Perceiver IO (Google)
“Perceiver IO: A General Architecture for Structured Inputs & Outputs”
https://arxiv.org/abs/2107.14795
https://ru.linkedin.com/in/grigorysapunov
gs@inten.to
Thanks!

More Related Content

What's hot

Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTSuman Debnath
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning健程 杨
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
Google's Pathways Language Model and Chain-of-Thought
Google's Pathways Language Model and Chain-of-ThoughtGoogle's Pathways Language Model and Chain-of-Thought
Google's Pathways Language Model and Chain-of-ThoughtVaclav1
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with TransformersJulien SIMON
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers leopauly
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Jeong-Gwan Lee
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersLiangqun Lu
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion pathVitaly Bondar
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentationbhavesh_physics
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear ComplexitySangwoo Mo
 

What's hot (20)

Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Attention Is All You Need
Attention Is All You NeedAttention Is All You Need
Attention Is All You Need
 
Transformer Zoo
Transformer ZooTransformer Zoo
Transformer Zoo
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Transformers
TransformersTransformers
Transformers
 
Attention in Deep Learning
Attention in Deep LearningAttention in Deep Learning
Attention in Deep Learning
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Google's Pathways Language Model and Chain-of-Thought
Google's Pathways Language Model and Chain-of-ThoughtGoogle's Pathways Language Model and Chain-of-Thought
Google's Pathways Language Model and Chain-of-Thought
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Introduction to Visual transformers
Introduction to Visual transformers Introduction to Visual transformers
Introduction to Visual transformers
 
Attention is All You Need (Transformer)
Attention is All You Need (Transformer)Attention is All You Need (Transformer)
Attention is All You Need (Transformer)
 
BERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from TransformersBERT: Bidirectional Encoder Representations from Transformers
BERT: Bidirectional Encoder Representations from Transformers
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
Stable Diffusion path
Stable Diffusion pathStable Diffusion path
Stable Diffusion path
 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
 
Self-Attention with Linear Complexity
Self-Attention with Linear ComplexitySelf-Attention with Linear Complexity
Self-Attention with Linear Complexity
 

Similar to Transformers in 2021

Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case Study
Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case StudyLeveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case Study
Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case StudyLuca Berardinelli
 
The Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsThe Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsNeo4j
 
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...Anita Graser
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...Dr. Haxel Consult
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 
Model Execution: Past, Present and Future
Model Execution: Past, Present and FutureModel Execution: Past, Present and Future
Model Execution: Past, Present and FutureBenoit Combemale
 
Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects Vladimir Alexiev, PhD, PMP
 
BigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigData_Europe
 
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...tiberiusp
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSAPRBETTER
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deakin University
 
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case eProsima
 
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudAnalyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudMOVING Project
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Enrico Daga
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolThierry Badard
 

Similar to Transformers in 2021 (20)

Msr2021 tutorial-di penta
Msr2021 tutorial-di pentaMsr2021 tutorial-di penta
Msr2021 tutorial-di penta
 
Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case Study
Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case StudyLeveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case Study
Leveraging Model-Driven Technologies for JSON Artefacts: The Shipyard Case Study
 
The Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing SystemsThe Future is Big Graphs: A Community View on Graph Processing Systems
The Future is Big Graphs: A Community View on Graph Processing Systems
 
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...
From Simple Features to Moving Features and Beyond? at OGC Member Meeting, Se...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 
Model Execution: Past, Present and Future
Model Execution: Past, Present and FutureModel Execution: Past, Present and Future
Model Execution: Past, Present and Future
 
Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects Museum Linked Open Data: Ontologies, Datasets, Projects
Museum Linked Open Data: Ontologies, Datasets, Projects
 
BigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE PlatformBigDataEurope @BDVA Summit2016 1: The BDE Platform
BigDataEurope @BDVA Summit2016 1: The BDE Platform
 
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
Confessions of an Interdisciplinary Researcher: The Case of High Performance ...
 
NLP in 2020
NLP in 2020NLP in 2020
NLP in 2020
 
Linked Open Data and Ontotext Projects
Linked Open Data and Ontotext ProjectsLinked Open Data and Ontotext Projects
Linked Open Data and Ontotext Projects
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
Sebastian Hellmann
Sebastian HellmannSebastian Hellmann
Sebastian Hellmann
 
Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1Deep learning 1.0 and Beyond, Part 1
Deep learning 1.0 and Beyond, Part 1
 
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
ROS 2 AI Integration Working Group 1: ALMA, SustainML & ROS 2 use case
 
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD CloudAnalyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
Analyzing the Evolution of Vocabulary Terms and Their Impact on the LOD Cloud
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
GeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL toolGeoKettle: A powerful open source spatial ETL tool
GeoKettle: A powerful open source spatial ETL tool
 

More from Grigory Sapunov

AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021Grigory Sapunov
 
What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)Grigory Sapunov
 
Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Grigory Sapunov
 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Grigory Sapunov
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
Modern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionModern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionGrigory Sapunov
 
AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)Grigory Sapunov
 
Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Grigory Sapunov
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Grigory Sapunov
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNsGrigory Sapunov
 
Введение в Deep Learning
Введение в Deep LearningВведение в Deep Learning
Введение в Deep LearningGrigory Sapunov
 
Введение в машинное обучение
Введение в машинное обучениеВведение в машинное обучение
Введение в машинное обучениеGrigory Sapunov
 
Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Grigory Sapunov
 
Artificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureArtificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureGrigory Sapunov
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Grigory Sapunov
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingGrigory Sapunov
 
Computer Vision and Deep Learning
Computer Vision and Deep LearningComputer Vision and Deep Learning
Computer Vision and Deep LearningGrigory Sapunov
 

More from Grigory Sapunov (20)

AI Hardware Landscape 2021
AI Hardware Landscape 2021AI Hardware Landscape 2021
AI Hardware Landscape 2021
 
What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)What's new in AI in 2020 (very short)
What's new in AI in 2020 (very short)
 
Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]Artificial Intelligence (lecture for schoolchildren) [rus]
Artificial Intelligence (lecture for schoolchildren) [rus]
 
Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)Transformer Zoo (a deeper dive)
Transformer Zoo (a deeper dive)
 
BERTology meets Biology
BERTology meets BiologyBERTology meets Biology
BERTology meets Biology
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Modern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 versionModern neural net architectures - Year 2019 version
Modern neural net architectures - Year 2019 version
 
AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)AI - Last Year Progress (2018-2019)
AI - Last Year Progress (2018-2019)
 
Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​Практический подход к выбору доменно-адаптивного NMT​
Практический подход к выбору доменно-адаптивного NMT​
 
Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018Deep Learning: Application Landscape - March 2018
Deep Learning: Application Landscape - March 2018
 
Sequence learning and modern RNNs
Sequence learning and modern RNNsSequence learning and modern RNNs
Sequence learning and modern RNNs
 
Введение в Deep Learning
Введение в Deep LearningВведение в Deep Learning
Введение в Deep Learning
 
Введение в машинное обучение
Введение в машинное обучениеВведение в машинное обучение
Введение в машинное обучение
 
Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016Введение в архитектуры нейронных сетей / HighLoad++ 2016
Введение в архитектуры нейронных сетей / HighLoad++ 2016
 
Artificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and FutureArtificial Intelligence - Past, Present and Future
Artificial Intelligence - Past, Present and Future
 
Multidimensional RNN
Multidimensional RNNMultidimensional RNN
Multidimensional RNN
 
Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016Deep Learning and the state of AI / 2016
Deep Learning and the state of AI / 2016
 
Deep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image ProcessingDeep Learning Cases: Text and Image Processing
Deep Learning Cases: Text and Image Processing
 
Computer Vision and Deep Learning
Computer Vision and Deep LearningComputer Vision and Deep Learning
Computer Vision and Deep Learning
 
Apache Spark & MLlib
Apache Spark & MLlibApache Spark & MLlib
Apache Spark & MLlib
 

Recently uploaded

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 

Recently uploaded (20)

Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Transformers in 2021

  • 1. Transformers in 2021 Grigory Sapunov DataFest Yerevan 2021 10.09.2021 gs@inten.to
  • 2. Who am I? ● MD in CS (2002), PhD in AI (2006) ● ex-Yandex News Dev. Team Leader (2007-2012) ● CTO & co-founder of Intento (2016+) and Berkeley SkyDeck alumni (Spring 2019) ● Member of Scientific Advisory Board at Atlas Biomed ● Google Developer Expert in Machine Learning
  • 3. ● Transformer architecture understanding ○ Original paper: https://arxiv.org/abs/1706.03762 ○ Great visual explanation: http://jalammar.github.io/illustrated-transformer ○ Lecture #12 from my DL course https://github.com/che-shr-cat/deep-learning-for-biology-hse-2019-course ● This talk is in some sense a follow-up talk for these two: ○ https://www.youtube.com/watch?v=KZ9NXYcXVBY (GDG DevParty) ○ https://www.youtube.com/watch?v=7e4LxIVENZA (GDG DevFest) ● Sidenote: many modern transformers are described and discussed in our Telegram channel & chat on ML research papers: https://t.me/gonzo_ML Prerequisites
  • 5. Transformer A new simple network architecture, the Transformer: ● Is a Encoder-Decoder architecture ● Based solely on attention mechanisms (no RNN/CNN) ● The major component in the transformer is the unit of multi-head self-attention mechanism. ● Fast: only matrix multiplications ● Strong results on standard WMT datasets
  • 6.
  • 7. Multi-head self-attention mechanism Essentially, the Multi-Head Attention is just several attention layers stacked together with different linear transformations of the same input.
  • 8. The transformer adopts the scaled dot-product attention: the output is a weighted sum of the values, where the weight assigned to each value is determined by the dot-product of the query with all the keys: The input consists of queries and keys of dimension dk , and values of dimension dv . Scaled dot-product attention
  • 9. Quadratic attention Efficient Transformers: A Survey https://arxiv.org/abs/2009.06732
  • 10. Problems with vanilla transformers ● It’s a pretty heavy model → hard to train, tricky training schedule (warm-ups, cyclic learning rates, etc) ● O(N2 ) computational complexity attention mechanism → scales poorly ● limited context span (mostly due to the complexity), typically 512 tokens → can’t process long sequences. ● May need different implicit bias for other types of data (e.g. image, sound, etc)
  • 12. Directions in 2021 ● (Still) Large transformers ● (Still) Efficient transformers ● New modalities: ○ more image transformers ○ audio transformers ○ transformers in biology and other domains (graphs) ● Multimodalily: CLIP, DALLE, Performer + IO, … ● Artistic applications: CLIPDraw etc
  • 15. Large models in 2021 ● (English) GPT-Neo (2.7B), GPT-J (6B), Jurassic-1 (7.5B/178B) ● (Russian) ruGPT-3 (13B) ● (Chinese) CPM-2 (11B/198B* - MoE), M6 (10B/100B), Wu Dao 2.0 (1.75T*), PangGu-α (2.6B/13B/207B) ● (Korean) HyperCLOVA (204B) ● (Code) OpenAI Codex (12B), Google’s (up to 137B) ● ByT5 (up to 12.9B) ● XLM-R XL/XXL (3.5B/10.7B) ● DeBERTa (1.5B) ● Switch Transformer (1.6T*) ● ERNIE 3.0 (10B) ● DALL·E (12B) ● Vision MoE (14.7B*)
  • 16. Scaling laws “Scaling Laws for Neural Language Models” https://arxiv.org/abs/2001.08361
  • 17. Scaling laws “Scaling Laws for Neural Language Models” https://arxiv.org/abs/2001.08361
  • 19. 1*. Problems of Large Models
  • 20. Costs
  • 21. Large model training costs “The Cost of Training NLP Models: A Concise Overview” https://arxiv.org/abs/2004.08900
  • 22. CO2 emissions “Energy and Policy Considerations for Deep Learning in NLP” https://arxiv.org/abs/1906.02243
  • 23. Training Data Extraction “Extracting Training Data from Large Language Models” https://arxiv.org/abs/2012.07805
  • 25. ● Size Doesn’t Guarantee Diversity ○ Internet data overrepresenting younger users and those from developed countries. ○ Training data is sourced by scraping only specific sites (e.g. Reddit). ○ There are structural factors including moderation practices. ○ The current practice of filtering datasets can further attenuate specific voices. ● Static Data/Changing Social Views ○ The risk of ‘value-lock’, where the LM-reliant technology reifies older, less-inclusive understandings. ○ Movements with no significant media attention will not be captured at all. ○ Given the compute costs it likely isn’t feasible to fully retrain LMs frequently enough. ● Encoding Bias ○ Large LMs exhibit various kinds of bias, including stereotypical associations or negative sentiment towards specific groups. ○ Issues with training data: unreliable news sites, banned subreddits, etc. ○ Model auditing using automated systems that are not reliable themselves. ● Documentation debt ○ Datasets are both undocumented and too large to document post hoc.
  • 26. “An LM is a system for haphazardly stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning: a stochastic parrot. “ https://dl.acm.org/doi/10.1145/3442188.3445922
  • 27. https://crfm.stanford.edu/ In recent years, a new successful paradigm for building AI systems has emerged: Train one model on a huge amount of data and adapt it to many applications. We call such a model a foundation model. Foundation models (e.g., GPT-3) have demonstrated impressive behavior, but can fail unexpectedly, harbor biases, and are poorly understood. Nonetheless, they are being deployed at scale. The Center for Research on Foundation Models (CRFM) is an interdisciplinary initiative born out of the Stanford Institute for Human-Centered Artificial Intelligence (HAI) that aims to make fundamental advances in the study, development, and deployment of foundation models.
  • 32. “Efficient Transformers: A Survey” https://arxiv.org/abs/2009.06732
  • 33. “Efficient Transformers: A Survey” https://arxiv.org/abs/2009.06732
  • 34. Some recent architectural innovations Switch Transformers: Mixture of Experts (MoE) architecture with only a single expert per feed-forward layer. Scales well with more experts. Adds a new dimension of scaling: ‘expert-parallelism’ in addition to data- and model-parallelism. “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity” https://arxiv.org/abs/2101.03961
  • 35. Some recent architectural innovations Balanced assignment of experts (BASE) layer: A new kind of sparse expert model (similar to MoE transformer or Switch transformer) that algorithmically balance the token-to-expert assignments (without any new hyperparameters or auxiliary losses). Distributes well across many GPUs (say, 128). “BASE Layers: Simplifying Training of Large, Sparse Models” https://arxiv.org/abs/2103.16716
  • 36. Some recent architectural innovations A simple yet highly accurate approximation for vanilla attention: ● its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA ● it is a drop-in replacement for vanilla attention that does not require any corrective pre-training ● it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. “Memory-efficient Transformers via Top-k Attention” https://arxiv.org/abs/2106.06899
  • 37. Some recent architectural innovations Expire-Span Transformer: ● learns to retain the most important information and expire the irrelevant information ● scales to attend over tens of thousands of previous timesteps efficiently, as not all states from previous timesteps are preserved “Not All Memories are Created Equal: Learning to Forget by Expiring” https://arxiv.org/abs/2105.06548
  • 39. Image Transformers There were many transformers for images already: ● Image Transformer (https://arxiv.org/abs/1802.05751) ● Sparse Transformer (https://arxiv.org/abs/1904.10509) ● Image GPT (iGPT): just a GPT-2 trained on images unrolled into long sequences of pixels (https://openai.com/blog/image-gpt/) ● Axial Transformer: for images and other data organized as high dim tensors (https://arxiv.org/abs/1912.12180).
  • 40. Image Transformers Many more emerged in 2020-2021: ● Vision Transformer (ViT) ● Data-efficient image Transformer (DeiT) ● Bottleneck Transformers (BoTNet) ● Vision MoE (V-MoE) ● Image Processing Transformer (IPT) ● Detection Transformer (DETR) ● TransGAN ● ...
  • 41. “Transformers in Vision: A Survey” https://arxiv.org/abs/2101.01169
  • 42. Some New Transformers for Images “Bottleneck Transformers for Visual Recognition” https://arxiv.org/abs/2101.11605
  • 43. Vision Transformer (ViT) ● Image is split into patches (e.g. 16x16), flatten into a 1D sequence, then put into a transformer encoder (similar to BERT). “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” https://arxiv.org/abs/2010.11929
  • 44. Data-efficient image Transformer (DeiT) The architecture is identical to ViT with the only differences are the training strategies, and the distillation token. “Training data-efficient image transformers & distillation through attention” https://arxiv.org/abs/2012.12877
  • 45. Bottleneck Transformers (BoTNet) ● A hybrid model with ResNet + Transformer ● Replacing internal 3x3 convolutions inside a ResNet block (only the last three) with Multi-head Self-Attention. ● The architecture called BoTNet scales pretty well. “Bottleneck Transformers for Visual Recognition” https://arxiv.org/abs/2101.11605
  • 46. Vision MoE (V-MoE) ● A sparse variant of the recent Vision Transformer (ViT) architecture for image classification. ● .The V-MoE replaces a subset of the dense feedforward layers in ViT with sparse MoE layers, where each image patch is “routed” to a subset of “experts” (MLPs). ● Scales to model sizes of 15B parameters, the largest vision models to date. “Scaling Vision with Sparse Mixture of Experts” https://arxiv.org/abs/2106.05974
  • 47. Speech and Sound Transformers There were many transformers for sound as well: ● Speech-Transformer (https://ieeexplore.ieee.org/document/8462506) ● Conformer (https://arxiv.org/abs/2005.08100) ● Transformer-Transducer (https://arxiv.org/abs/1910.12977) ● Transformer-Transducer(https://arxiv.org/abs/2002.02562) ● Conv-Transformer Transducer (https://arxiv.org/abs/2008.05750) ● Speech-XLNet (https://arxiv.org/abs/1910.10387) ● Audio ALBERT (https://arxiv.org/abs/2005.08575) ● Emformer (https://arxiv.org/abs/2010.10759) ● wav2vec 2.0 (https://arxiv.org/abs/2006.11477) ● ...
  • 48. AST: Audio Spectrogram Transformer “AST: Audio Spectrogram Transformer” https://arxiv.org/abs/2104.01778 A convolution-free, purely attention-based model for audio classification. Very close to ViT, but AST can process variable-length audio inputs.
  • 49. ACT: Audio Captioning Transformer “Audio Captioning Transformer” https://arxiv.org/abs/2107.09817 Another convolution-free Transformer based on an encoder-decoder architecture.
  • 50. Multi-channel Transformer for ASR “End-to-End Multi-Channel Transformer for Speech Recognition” https://arxiv.org/abs/2102.03951
  • 51. Transformers in Biology Finally transformers came into biology! ● ESM-1b protein language model (https://www.pnas.org/content/118/15/e2016239118) ● MSA Transformer for multiple sequence alignment (https://www.biorxiv.org/content/10.1101/2021.02.12.430858v1) ● RoseTTAFold for predicting protein structures (includes graph transformers) (https://www.science.org/doi/abs/10.1126/science.abj8754) ● AlphaFold2 for predicting protein structures (https://www.nature.com/articles/s41586-021-03819-2)
  • 52. ESM-1b “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”, https://www.pnas.org/content/118/15/e2016239118
  • 53. RoseTTAFold “Accurate prediction of protein structures and interactions using a 3-track network” https://www.science.org/doi/abs/10.1126/science.abj8754
  • 54. AlphaFold 2 “Highly accurate protein structure prediction with AlphaFold” https://www.nature.com/articles/s41586-021-03819-2
  • 55. AlphaFold 2: Evoformer block “Highly accurate protein structure prediction with AlphaFold” https://www.nature.com/articles/s41586-021-03819-2
  • 58. DALL·E (OpenAI) “Zero-Shot Text-to-Image Generation” https://arxiv.org/abs/2102.12092 A model trained on images+text descriptions. Autoregressively generates image tokens based on previous text and (optionally) image tokens. Technically a transformer decoder. Image tokens are obtained with a pretrained dVAE. Candidates are ranked using CLIP.
  • 59. CLIP (OpenAI) “Learning Transferable Visual Models From Natural Language Supervision” https://arxiv.org/abs/2103.00020 Uses contrastive pre-training to predict which caption goes with which image.
  • 60. ALIGN (Google) https://ai.googleblog.com/2021/05/align-scaling-up-visual-and-vision.html “Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision” https://arxiv.org/abs/2102.05918 Train EfficientNet-L2 (image encoder) and BERT-large (text encoder) with a contrastive loss on a huge noisy dataset (1.8B image-text pairs).
  • 61. CLIPDraw “CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders” https://arxiv.org/abs/2106.14843 You can optimize the image to better match a text description (remember DeepDream?).
  • 62. CLIPDraw “CLIPDraw: Exploring Text-to-Drawing Synthesis through Language-Image Encoders” https://arxiv.org/abs/2106.14843 The image is rendered from a set of bezier curves.
  • 65. Perceiver (Google) “Perceiver: General Perception with Iterative Attention” https://arxiv.org/abs/2103.03206
  • 66. Perceiver IO (Google) “Perceiver IO: A General Architecture for Structured Inputs & Outputs” https://arxiv.org/abs/2107.14795