SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
GPT-3: Language Models are Few-Shot Learners
ALMA MATER STUDIORUM UNIVERSITY OF BOLOGNA
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING – DISI
Luca Ragazzi, Giacomo Frisoni, Lorenzo Valgimigli
PhD Students, XXXVI Cycle
Department of Computer Science and Engineering – DISI
University of Bologna, Cesena, Italy
l.ragazzi@unibo.it, giacomo.frisoni@unibo.it, lorenzo.valgimigli@unibo.it
"Neural architectures: from the McCulloch-Pitts model to GPT-3" Presentation
October 29th, 2021
GPT-3: Language Models are Few-Shot Learners 2
Overview of GPT-3
• Generative Pre-trained Transformer – 3
• Developed by OpenAI in May 2020
• Largest neural network ever created
• Philosophy: the bigger, the better
What is the motivation around it?
GPT-3: Language Models are Few-Shot Learners 3
Pre-trained language models
• Current state-of-the-art models in NLP
• Trained in a semi-supervised learning with large corpora
• Both X and Y are extracted from the text without having a prior labeled dataset
• Acquire high capability for modeling the natural language (with task-agnostic architecture)
• Limitation (i): need for downstream task-specific datasets and fine-tuning
– Difficult to collect large supervised training datasets for every new task
• Limitation (ii): non-correlation with humans
– Humans do not require large supervised datasets to learn most language tasks, but only brief
directives or a tiny number of demonstrations are needed
– So, why give models a large dataset of labeled examples for every new task?
Why not try to create NLP systems to have the same fluidity and
generality as humans?
GPT-3: Language Models are Few-Shot Learners 4
Solution: more parameters and in-context learning
• Let models develop a broad set of skills and pattern recognition abilities during
pre-training and use them at inference time to adapt to the desired task rapidly
• Since in-context learning involves absorbing many skills and tasks within the
model's parameters, it is plausible that learning abilities correlate with model size.
OpenAI creates GPT-3 to show that very large unsupervised
language models trained with a lot of data can multitask to
the level of fine-tuned state-of-the-art models
GPT-3: Language Models are Few-Shot Learners 5
Model Architecture and Training Process
• The GPT-3 model architecture is the same as its GPT-2 predecessor
– Transformer-based, built using only decoder blocks (BERT opposite)
– Stronger in natural language generation (NLG), instead of creating contextual embeddings
• An auto-regressive language model
– GPT-3 is trained using next word prediction, outputting one token (wordpiece) at a time
– Differently from bidirectional models like BERT, the prediction at each step is conditioned only
on the left context (masked self-attention)
• From an architecture perspective, GPT-3 is not actually very novel!
– … So, what makes it so special and magical? It’s really big
GPT-3: Language Models are Few-Shot Learners 6
Trained Models
• More layers, wider layers, and more data to train on
– GPT-3 comes in eight sizes, ranging from 125M to 175B parameters
– GPT-3 175B (referenced by default) → 470x BERT-Large (345M), 117x GPT-2-Large (1.5B),
and 10x the previous record holder, Turing-NLG
– The largest model ever created (at the time of paper writing) w 96 attention layers, each with
96x128-dimension heads, and 3.2M batch size 😱
• “With great powers sizes comes great responsibilities costs” 🦸💰
– A single training run costs over $4.6M using a Tesla V100 cloud instance (3.14E23 required
FLOPS at 28 TFLOPS HW capacity for 355 GPU-years)
– Time is not the only enemy. GPT-3 needs 700GB memory to store FP32 parameters (4 Bytes
each), where the maximum memory in a single GPU is 48GB (Quadro RTX 8000)
– OpenAI used model parallelism on a high-bandwidth cluster (w V100 GPUs) by Microsoft
GPT-3: Language Models are Few-Shot Learners 7
Training Datasets
• Extensive training on massive unlabeled text datasets (300B tokens in total)
– Since neural networks are compressed/compiled version of the training data, the size of the
dataset should scale accordingly with the size of the model
– The author mainly use Common Crawl, a crawl of over 50B web pages (filter down for quality)
• GPT-3 has a lower data compression ratio than GPT-2
– 300/175=1.71 (GPT-3) vs 10/1.5=6.67… This raises the question: “Is it only a big memory?”
570GB of compressed plaintext (45TB before filtering)
Note: GPT-2 was trained on 40GB of Internet text (10B tokens)
Bug to ignore
some overlaps
between dev and
test sets, but
costs made re-
training unfeasible
GPT-3: Language Models are Few-Shot Learners 8
Zero-, one-, few-shot vs fine-tuning
• GPT-3 can perform specific tasks without any special tuning 🧙 🔮
– Most other pre-trained language models require an elaborate fine-tuning (also with
architectural changes) on thousands of samples to perform well on downstream tasks
– GPT-3 doesn’t need a fine-tuning step and directly uses a single pre-trained model for
all downstream tasks (plug-and-play 🔌), demonstrating even superior performance
• Three different evaluation settings focused on task-agnostic performance,
which allows zero, one, or a few examples to be prefixed to the input model
Fine-tuning(repeated gradient
updates using a large corpus of
example tasks) → postponed
(i) Zero-shot
(ii) One-shot
(iii) Few-shot
Contextto better
inform the model
aboutwhatit is
expected to do
"If I were to see this
text somewhere on
the Web, what will
be the most likely
next word?"
GPT-3: Language Models are Few-Shot Learners 9
Results - i
• Different sizes of GPT-3 were tested using
different benchmarks in various tasks (e.g.,
question answering, translation, summarization,
etc.) to study the generalization ability of such
large models. In particular, it was evaluated in
three contexts:
• zero-shot learning
• one-shot learning
• few-shot learning.
• Every time the raising of
parameters made
generalization
capabilities appear
GPT-3: Language Models are Few-Shot Learners 10
Results - ii
• GPT-3, the giant version, was compared to the SOTA solutions in different
datasets.
• LAMBADA, StoryCloze,HellaSwag,TriviaQA,BLUE...
• In many cases, it reaches and outperforms the previous SOTA, which are
neural models fine-tuned on the dataset.
GPT-3: Language Models are Few-Shot Learners 11
Limits
• Despite the great improvement of GPT-3, it still has notable weakness, it has
difficulty with common sense physics and in long text generation.
• Lose of coherence
• Contradiction
• Useless semantical repetition
• Large language model are not grounded in other domains of experience as video
or real-world physic interaction, lacking a large amount of context.
• GPT-3 works better after pretrain, it is still far from human level.
• Humans show strong zero-shot capabilities
• By now, it is impossible to say how GPT-3 learns during train
• It is even more complex understand what it learns in inference time
• Does it learn the new task from scratch? Does it reshape similar task it
learned?
• Finally, it shares some limitations common to most deep learning models
• The knowledge learned is no interpretable
• It requires large resources and time for the train
• It is strongly affected by biases in the data
GPT-3: Language Models are Few-Shot Learners 12
Ethical Concerns - i
• GPT-3 can be misused in dangerous
situations
• fake news generation, phishing,
fraudulent academic essay
• It is affected by biases in data on
different topics
• Gender
• Race
• Religion
GPT-3: Language Models are Few-Shot Learners 13
Ethical Concerns - ii
• The energy consumption of this large model is a problem that needs to be
underlined.
– GPT-3 consumed several thousand petaflop/s-day during pre-train while
GPT-2 tens petaflop/s-day.
• It is important to consider how the
resources are amortized during the
lifecycle of the model
• It consumes significant resources during
train, but it is surprisingly efficient once
trained.
• GPT-3, full version, can generate 100
pages of content from a trained model at
the cost of about 0.4 kW/hr.
14
GPT-3: Language Models are Few-Shot Learners
Thanks for the attention
(is all you need)

Contenu connexe

Tendances

LLM presentation final
LLM presentation finalLLM presentation final
LLM presentation final
Ruth Griffin
 
Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...
Rama Irsheidat
 

Tendances (20)

Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
Open ai’s gpt 3 language explained under 5 mins
Open ai’s gpt 3 language explained under 5 minsOpen ai’s gpt 3 language explained under 5 mins
Open ai’s gpt 3 language explained under 5 mins
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
Intro to LLMs
Intro to LLMsIntro to LLMs
Intro to LLMs
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
A Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptxA Comprehensive Review of Large Language Models for.pptx
A Comprehensive Review of Large Language Models for.pptx
 
LLM presentation final
LLM presentation finalLLM presentation final
LLM presentation final
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
 
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
Large Language Models, No-Code, and Responsible AI - Trends in Applied NLP in...
 
How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
 
BERT
BERTBERT
BERT
 
ChatGPT vs. GPT-3.pdf
ChatGPT vs. GPT-3.pdfChatGPT vs. GPT-3.pdf
ChatGPT vs. GPT-3.pdf
 
Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...Training language models to follow instructions with human feedback (Instruct...
Training language models to follow instructions with human feedback (Instruct...
 
How ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundlyHow ChatGPT and AI-assisted coding changes software engineering profoundly
How ChatGPT and AI-assisted coding changes software engineering profoundly
 
Generative Models and ChatGPT
Generative Models and ChatGPTGenerative Models and ChatGPT
Generative Models and ChatGPT
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
LLMs_talk_March23.pdf
LLMs_talk_March23.pdfLLMs_talk_March23.pdf
LLMs_talk_March23.pdf
 
Bert
BertBert
Bert
 

Similaire à gpt3_presentation.pdf

chalenges and apportunity of deep learning for big data analysis f
 chalenges and apportunity of deep learning for big data analysis f chalenges and apportunity of deep learning for big data analysis f
chalenges and apportunity of deep learning for big data analysis f
maru kindeneh
 
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Sharmila Sathish
 

Similaire à gpt3_presentation.pdf (20)

NLP in 2020
NLP in 2020NLP in 2020
NLP in 2020
 
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
Deep Learning For Practitioners,  lecture 2: Selecting the right applications...Deep Learning For Practitioners,  lecture 2: Selecting the right applications...
Deep Learning For Practitioners, lecture 2: Selecting the right applications...
 
Writing Machines: Detection and Stylometric Profiling
Writing Machines: Detection and Stylometric ProfilingWriting Machines: Detection and Stylometric Profiling
Writing Machines: Detection and Stylometric Profiling
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
 
Trustworthy Generative AI_ ICML'23 Tutorial.pptx
Trustworthy Generative AI_ ICML'23 Tutorial.pptxTrustworthy Generative AI_ ICML'23 Tutorial.pptx
Trustworthy Generative AI_ ICML'23 Tutorial.pptx
 
Performance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and MindsporePerformance Comparison between Pytorch and Mindspore
Performance Comparison between Pytorch and Mindspore
 
Artificial Intelligence Innovation The Future With OpenAI GPT-3 ARTiBA.pdf
Artificial Intelligence Innovation The Future With OpenAI GPT-3 ARTiBA.pdfArtificial Intelligence Innovation The Future With OpenAI GPT-3 ARTiBA.pdf
Artificial Intelligence Innovation The Future With OpenAI GPT-3 ARTiBA.pdf
 
chalenges and apportunity of deep learning for big data analysis f
 chalenges and apportunity of deep learning for big data analysis f chalenges and apportunity of deep learning for big data analysis f
chalenges and apportunity of deep learning for big data analysis f
 
Allganize AI seminar - GPT3 and PET
Allganize AI seminar - GPT3 and PETAllganize AI seminar - GPT3 and PET
Allganize AI seminar - GPT3 and PET
 
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
Breaking down the AI magic of ChatGPT: A technologist's lens to its powerful ...
 
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
Feature Extraction and Analysis of Natural Language Processing for Deep Learn...
 
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішенняRoman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
Roman Kyslyi: Великі мовні моделі: огляд, виклики та рішення
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdf
 
LLMs for the “GPU-Poor” - Franck Nijimbere.pdf
LLMs for the “GPU-Poor” - Franck Nijimbere.pdfLLMs for the “GPU-Poor” - Franck Nijimbere.pdf
LLMs for the “GPU-Poor” - Franck Nijimbere.pdf
 
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMs
 
The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines The Bitter Lesson of ML Pipelines
The Bitter Lesson of ML Pipelines
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learn
 
Creating smaller, faster, production-ready mobile machine learning models.
Creating smaller, faster, production-ready mobile machine learning models.Creating smaller, faster, production-ready mobile machine learning models.
Creating smaller, faster, production-ready mobile machine learning models.
 
The Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and BeyondThe Frontier of Deep Learning in 2020 and Beyond
The Frontier of Deep Learning in 2020 and Beyond
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 

Dernier

Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
Lars Albertsson
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Dernier (20)

VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 

gpt3_presentation.pdf

  • 1. GPT-3: Language Models are Few-Shot Learners ALMA MATER STUDIORUM UNIVERSITY OF BOLOGNA DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING – DISI Luca Ragazzi, Giacomo Frisoni, Lorenzo Valgimigli PhD Students, XXXVI Cycle Department of Computer Science and Engineering – DISI University of Bologna, Cesena, Italy l.ragazzi@unibo.it, giacomo.frisoni@unibo.it, lorenzo.valgimigli@unibo.it "Neural architectures: from the McCulloch-Pitts model to GPT-3" Presentation October 29th, 2021
  • 2. GPT-3: Language Models are Few-Shot Learners 2 Overview of GPT-3 • Generative Pre-trained Transformer – 3 • Developed by OpenAI in May 2020 • Largest neural network ever created • Philosophy: the bigger, the better What is the motivation around it?
  • 3. GPT-3: Language Models are Few-Shot Learners 3 Pre-trained language models • Current state-of-the-art models in NLP • Trained in a semi-supervised learning with large corpora • Both X and Y are extracted from the text without having a prior labeled dataset • Acquire high capability for modeling the natural language (with task-agnostic architecture) • Limitation (i): need for downstream task-specific datasets and fine-tuning – Difficult to collect large supervised training datasets for every new task • Limitation (ii): non-correlation with humans – Humans do not require large supervised datasets to learn most language tasks, but only brief directives or a tiny number of demonstrations are needed – So, why give models a large dataset of labeled examples for every new task? Why not try to create NLP systems to have the same fluidity and generality as humans?
  • 4. GPT-3: Language Models are Few-Shot Learners 4 Solution: more parameters and in-context learning • Let models develop a broad set of skills and pattern recognition abilities during pre-training and use them at inference time to adapt to the desired task rapidly • Since in-context learning involves absorbing many skills and tasks within the model's parameters, it is plausible that learning abilities correlate with model size. OpenAI creates GPT-3 to show that very large unsupervised language models trained with a lot of data can multitask to the level of fine-tuned state-of-the-art models
  • 5. GPT-3: Language Models are Few-Shot Learners 5 Model Architecture and Training Process • The GPT-3 model architecture is the same as its GPT-2 predecessor – Transformer-based, built using only decoder blocks (BERT opposite) – Stronger in natural language generation (NLG), instead of creating contextual embeddings • An auto-regressive language model – GPT-3 is trained using next word prediction, outputting one token (wordpiece) at a time – Differently from bidirectional models like BERT, the prediction at each step is conditioned only on the left context (masked self-attention) • From an architecture perspective, GPT-3 is not actually very novel! – … So, what makes it so special and magical? It’s really big
  • 6. GPT-3: Language Models are Few-Shot Learners 6 Trained Models • More layers, wider layers, and more data to train on – GPT-3 comes in eight sizes, ranging from 125M to 175B parameters – GPT-3 175B (referenced by default) → 470x BERT-Large (345M), 117x GPT-2-Large (1.5B), and 10x the previous record holder, Turing-NLG – The largest model ever created (at the time of paper writing) w 96 attention layers, each with 96x128-dimension heads, and 3.2M batch size 😱 • “With great powers sizes comes great responsibilities costs” 🦸💰 – A single training run costs over $4.6M using a Tesla V100 cloud instance (3.14E23 required FLOPS at 28 TFLOPS HW capacity for 355 GPU-years) – Time is not the only enemy. GPT-3 needs 700GB memory to store FP32 parameters (4 Bytes each), where the maximum memory in a single GPU is 48GB (Quadro RTX 8000) – OpenAI used model parallelism on a high-bandwidth cluster (w V100 GPUs) by Microsoft
  • 7. GPT-3: Language Models are Few-Shot Learners 7 Training Datasets • Extensive training on massive unlabeled text datasets (300B tokens in total) – Since neural networks are compressed/compiled version of the training data, the size of the dataset should scale accordingly with the size of the model – The author mainly use Common Crawl, a crawl of over 50B web pages (filter down for quality) • GPT-3 has a lower data compression ratio than GPT-2 – 300/175=1.71 (GPT-3) vs 10/1.5=6.67… This raises the question: “Is it only a big memory?” 570GB of compressed plaintext (45TB before filtering) Note: GPT-2 was trained on 40GB of Internet text (10B tokens) Bug to ignore some overlaps between dev and test sets, but costs made re- training unfeasible
  • 8. GPT-3: Language Models are Few-Shot Learners 8 Zero-, one-, few-shot vs fine-tuning • GPT-3 can perform specific tasks without any special tuning 🧙 🔮 – Most other pre-trained language models require an elaborate fine-tuning (also with architectural changes) on thousands of samples to perform well on downstream tasks – GPT-3 doesn’t need a fine-tuning step and directly uses a single pre-trained model for all downstream tasks (plug-and-play 🔌), demonstrating even superior performance • Three different evaluation settings focused on task-agnostic performance, which allows zero, one, or a few examples to be prefixed to the input model Fine-tuning(repeated gradient updates using a large corpus of example tasks) → postponed (i) Zero-shot (ii) One-shot (iii) Few-shot Contextto better inform the model aboutwhatit is expected to do "If I were to see this text somewhere on the Web, what will be the most likely next word?"
  • 9. GPT-3: Language Models are Few-Shot Learners 9 Results - i • Different sizes of GPT-3 were tested using different benchmarks in various tasks (e.g., question answering, translation, summarization, etc.) to study the generalization ability of such large models. In particular, it was evaluated in three contexts: • zero-shot learning • one-shot learning • few-shot learning. • Every time the raising of parameters made generalization capabilities appear
  • 10. GPT-3: Language Models are Few-Shot Learners 10 Results - ii • GPT-3, the giant version, was compared to the SOTA solutions in different datasets. • LAMBADA, StoryCloze,HellaSwag,TriviaQA,BLUE... • In many cases, it reaches and outperforms the previous SOTA, which are neural models fine-tuned on the dataset.
  • 11. GPT-3: Language Models are Few-Shot Learners 11 Limits • Despite the great improvement of GPT-3, it still has notable weakness, it has difficulty with common sense physics and in long text generation. • Lose of coherence • Contradiction • Useless semantical repetition • Large language model are not grounded in other domains of experience as video or real-world physic interaction, lacking a large amount of context. • GPT-3 works better after pretrain, it is still far from human level. • Humans show strong zero-shot capabilities • By now, it is impossible to say how GPT-3 learns during train • It is even more complex understand what it learns in inference time • Does it learn the new task from scratch? Does it reshape similar task it learned? • Finally, it shares some limitations common to most deep learning models • The knowledge learned is no interpretable • It requires large resources and time for the train • It is strongly affected by biases in the data
  • 12. GPT-3: Language Models are Few-Shot Learners 12 Ethical Concerns - i • GPT-3 can be misused in dangerous situations • fake news generation, phishing, fraudulent academic essay • It is affected by biases in data on different topics • Gender • Race • Religion
  • 13. GPT-3: Language Models are Few-Shot Learners 13 Ethical Concerns - ii • The energy consumption of this large model is a problem that needs to be underlined. – GPT-3 consumed several thousand petaflop/s-day during pre-train while GPT-2 tens petaflop/s-day. • It is important to consider how the resources are amortized during the lifecycle of the model • It consumes significant resources during train, but it is surprisingly efficient once trained. • GPT-3, full version, can generate 100 pages of content from a trained model at the cost of about 0.4 kW/hr.
  • 14. 14 GPT-3: Language Models are Few-Shot Learners Thanks for the attention (is all you need)