1. GPT-3: Language Models are Few-Shot Learners
ALMA MATER STUDIORUM UNIVERSITY OF BOLOGNA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING – DISI
Luca Ragazzi, Giacomo Frisoni, Lorenzo Valgimigli
PhD Students, XXXVI Cycle
Department of Computer Science and Engineering – DISI
University of Bologna, Cesena, Italy
l.ragazzi@unibo.it, giacomo.frisoni@unibo.it, lorenzo.valgimigli@unibo.it
"Neural architectures: from the McCulloch-Pitts model to GPT-3" Presentation
October 29th, 2021
2. GPT-3: Language Models are Few-Shot Learners 2
Overview of GPT-3
• Generative Pre-trained Transformer – 3
• Developed by OpenAI in May 2020
• Largest neural network ever created
• Philosophy: the bigger, the better
What is the motivation around it?
3. GPT-3: Language Models are Few-Shot Learners 3
Pre-trained language models
• Current state-of-the-art models in NLP
• Trained in a semi-supervised learning with large corpora
• Both X and Y are extracted from the text without having a prior labeled dataset
• Acquire high capability for modeling the natural language (with task-agnostic architecture)
• Limitation (i): need for downstream task-specific datasets and fine-tuning
– Difficult to collect large supervised training datasets for every new task
• Limitation (ii): non-correlation with humans
– Humans do not require large supervised datasets to learn most language tasks, but only brief
directives or a tiny number of demonstrations are needed
– So, why give models a large dataset of labeled examples for every new task?
Why not try to create NLP systems to have the same fluidity and
generality as humans?
4. GPT-3: Language Models are Few-Shot Learners 4
Solution: more parameters and in-context learning
• Let models develop a broad set of skills and pattern recognition abilities during
pre-training and use them at inference time to adapt to the desired task rapidly
• Since in-context learning involves absorbing many skills and tasks within the
model's parameters, it is plausible that learning abilities correlate with model size.
OpenAI creates GPT-3 to show that very large unsupervised
language models trained with a lot of data can multitask to
the level of fine-tuned state-of-the-art models
5. GPT-3: Language Models are Few-Shot Learners 5
Model Architecture and Training Process
• The GPT-3 model architecture is the same as its GPT-2 predecessor
– Transformer-based, built using only decoder blocks (BERT opposite)
– Stronger in natural language generation (NLG), instead of creating contextual embeddings
• An auto-regressive language model
– GPT-3 is trained using next word prediction, outputting one token (wordpiece) at a time
– Differently from bidirectional models like BERT, the prediction at each step is conditioned only
on the left context (masked self-attention)
• From an architecture perspective, GPT-3 is not actually very novel!
– … So, what makes it so special and magical? It’s really big
6. GPT-3: Language Models are Few-Shot Learners 6
Trained Models
• More layers, wider layers, and more data to train on
– GPT-3 comes in eight sizes, ranging from 125M to 175B parameters
– GPT-3 175B (referenced by default) → 470x BERT-Large (345M), 117x GPT-2-Large (1.5B),
and 10x the previous record holder, Turing-NLG
– The largest model ever created (at the time of paper writing) w 96 attention layers, each with
96x128-dimension heads, and 3.2M batch size 😱
• “With great powers sizes comes great responsibilities costs” 🦸💰
– A single training run costs over $4.6M using a Tesla V100 cloud instance (3.14E23 required
FLOPS at 28 TFLOPS HW capacity for 355 GPU-years)
– Time is not the only enemy. GPT-3 needs 700GB memory to store FP32 parameters (4 Bytes
each), where the maximum memory in a single GPU is 48GB (Quadro RTX 8000)
– OpenAI used model parallelism on a high-bandwidth cluster (w V100 GPUs) by Microsoft
7. GPT-3: Language Models are Few-Shot Learners 7
Training Datasets
• Extensive training on massive unlabeled text datasets (300B tokens in total)
– Since neural networks are compressed/compiled version of the training data, the size of the
dataset should scale accordingly with the size of the model
– The author mainly use Common Crawl, a crawl of over 50B web pages (filter down for quality)
• GPT-3 has a lower data compression ratio than GPT-2
– 300/175=1.71 (GPT-3) vs 10/1.5=6.67… This raises the question: “Is it only a big memory?”
570GB of compressed plaintext (45TB before filtering)
Note: GPT-2 was trained on 40GB of Internet text (10B tokens)
Bug to ignore
some overlaps
between dev and
test sets, but
costs made re-
training unfeasible
8. GPT-3: Language Models are Few-Shot Learners 8
Zero-, one-, few-shot vs fine-tuning
• GPT-3 can perform specific tasks without any special tuning 🧙 🔮
– Most other pre-trained language models require an elaborate fine-tuning (also with
architectural changes) on thousands of samples to perform well on downstream tasks
– GPT-3 doesn’t need a fine-tuning step and directly uses a single pre-trained model for
all downstream tasks (plug-and-play 🔌), demonstrating even superior performance
• Three different evaluation settings focused on task-agnostic performance,
which allows zero, one, or a few examples to be prefixed to the input model
Fine-tuning(repeated gradient
updates using a large corpus of
example tasks) → postponed
(i) Zero-shot
(ii) One-shot
(iii) Few-shot
Contextto better
inform the model
aboutwhatit is
expected to do
"If I were to see this
text somewhere on
the Web, what will
be the most likely
next word?"
9. GPT-3: Language Models are Few-Shot Learners 9
Results - i
• Different sizes of GPT-3 were tested using
different benchmarks in various tasks (e.g.,
question answering, translation, summarization,
etc.) to study the generalization ability of such
large models. In particular, it was evaluated in
three contexts:
• zero-shot learning
• one-shot learning
• few-shot learning.
• Every time the raising of
parameters made
generalization
capabilities appear
10. GPT-3: Language Models are Few-Shot Learners 10
Results - ii
• GPT-3, the giant version, was compared to the SOTA solutions in different
datasets.
• LAMBADA, StoryCloze,HellaSwag,TriviaQA,BLUE...
• In many cases, it reaches and outperforms the previous SOTA, which are
neural models fine-tuned on the dataset.
11. GPT-3: Language Models are Few-Shot Learners 11
Limits
• Despite the great improvement of GPT-3, it still has notable weakness, it has
difficulty with common sense physics and in long text generation.
• Lose of coherence
• Contradiction
• Useless semantical repetition
• Large language model are not grounded in other domains of experience as video
or real-world physic interaction, lacking a large amount of context.
• GPT-3 works better after pretrain, it is still far from human level.
• Humans show strong zero-shot capabilities
• By now, it is impossible to say how GPT-3 learns during train
• It is even more complex understand what it learns in inference time
• Does it learn the new task from scratch? Does it reshape similar task it
learned?
• Finally, it shares some limitations common to most deep learning models
• The knowledge learned is no interpretable
• It requires large resources and time for the train
• It is strongly affected by biases in the data
12. GPT-3: Language Models are Few-Shot Learners 12
Ethical Concerns - i
• GPT-3 can be misused in dangerous
situations
• fake news generation, phishing,
fraudulent academic essay
• It is affected by biases in data on
different topics
• Gender
• Race
• Religion
13. GPT-3: Language Models are Few-Shot Learners 13
Ethical Concerns - ii
• The energy consumption of this large model is a problem that needs to be
underlined.
– GPT-3 consumed several thousand petaflop/s-day during pre-train while
GPT-2 tens petaflop/s-day.
• It is important to consider how the
resources are amortized during the
lifecycle of the model
• It consumes significant resources during
train, but it is surprisingly efficient once
trained.
• GPT-3, full version, can generate 100
pages of content from a trained model at
the cost of about 0.4 kW/hr.