PaLM Scaling Language Modeling with Pathways - 230219 (1).pdf

•

0 j'aime•81 vues

대규모 언어 모델은 적은 양의 학습 데이터로도 탁월한 성능을 발휘하여 다양한 자연어 처리 작업에서 매우 유용하게 사용됩니다. 이에 대한 이해를 더하기 위해, 구글은 PaLM이라는 5400억 개의 매개변수를 가진 언어 모델을 새로 개발하여, 다양한 자연어 이해 및 생성 작업에서 최첨단의 성능을 보여주고 있습니다. 이 모델은 Pathways라는 새로운 ML 시스템을 이용하여 6144개의 TPU v4 칩을 사용하여 학습되었습니다. PaLM은 다양한 과제에서 뛰어난 성능을 보이며, 특히 멀티스텝 추론 작업에서 최고의 성능을 발휘하여 인간 수준 이상의 결과를 달성하였습니다. 또한 다국어 작업과 소스 코드 생성 작업에서도 강력한 성능을 보이며, 편향성 및 독성에 대한 종합적인 분석과 모델 규모에 따른 학습 데이터 기억력 연구에 대한 결과도 제공합니다. 마지막으로, 대규모 언어 모델에 대한 윤리적 고민과 이를 완화하기 위한 전략에 대해 논의합니다.

Données & analyses

PaLM: Scaling Language
Modeling with Pathways
Chowdhery, Aakanksha, et al. arXiv preprint arXiv:2204.02311
2023. 02. 19
허정원, 조해창, 박산희
1

1. Introduction
3
Gopher
LaMDA
GaLM MT NLG
175B 1.2T 137B 280B 530B

1. Introduction
(1)
(2)
(3)
(4)
4
540B
780B Tokens
Achieved through the use of Pathways
PaLM

9
•
SwiGLU = xW·sigmoid(βxW) @ xV
An improvement in quality in compute- equivalent experiments

10
•
The parallel formulation results in roughly 15% faster
training speed at large scales, since the MLP and
Attention input matrix multiplications can be fused.

12
• RoPE Embeddings
𝑓! 𝑥" ≔ 𝑊
!𝑥"
𝑓# 𝑥$ + 𝑛 ≔ 𝑊#(𝑥$ + (
𝑝%
#
)
𝑓& 𝑥$ + 𝑛 ≔ 𝑊
&(𝑥$ + (
𝑝%
&
)

13
• Vocabulary
A SentencePiece vocabulary with 256k tokens, which was chosen
to support the large number of languages in the training corpus
without excess tokenization.
The vocabulary is completely lossless and reversible.

2. Model Architecture
•
•
• cost savings
•
•
•
•
14

6.6 Multilingual Natural Language Generation
•
•
•
• 36

Recommandé

Landscape of AI/ML in 2023HyunJoon Jung

Generative Models and ChatGPTLoic Merckel

OpenAI’s GPT 3 Language Model - guest Steve OmohundroNumenta

ChatGPT_ppf.pdfssuser693b9a

A brief primer on OpenAI's GPT-3Ishan Jain

Cavalry Ventures | Deep Dive: Generative AICavalry Ventures

GANs and ApplicationsHoang Nguyen

Natural Language Processing Adarsh Saxena

Recommandé

Landscape of AI/ML in 2023HyunJoon Jung

Generative Models and ChatGPTLoic Merckel

OpenAI’s GPT 3 Language Model - guest Steve OmohundroNumenta

ChatGPT_ppf.pdfssuser693b9a

A brief primer on OpenAI's GPT-3Ishan Jain

Cavalry Ventures | Deep Dive: Generative AICavalry Ventures

GANs and ApplicationsHoang Nguyen

Natural Language Processing Adarsh Saxena

Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringSOYEON KIM

How Does Generative AI Actually Work? (a quick semi-technical introduction to...ssuser4edc93

Generative modelsBirger Moell

Let's talk about GPT: A crash course in Generative AI for researchersSteven Van Vaerenbergh

Data AugmentationMd Tajul Islam

AI Math AgentsMelanie Swan

Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks

[DSC DACH 23] ChatGPT and Beyond: How generative AI is Changing the way peopl...DataScienceConferenc1

Large Language Models BootcampData Science Dojo

LLaMA 2.pptxRkRahul16

Implications of GPT-3Raven Jiang

Generative adversarial network and its applications to speech signal and natu...宏毅李

Tensorflow presentationAhmed rebai

Generative AI: Past, Present, and Future – A Practitioner's PerspectiveHuahai Yang

Deep Generative Models Chia-Wen Cheng

GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim

Natural language processing (NLP) introductionRobert Lujo

leewayhertz.com-How to build a generative AI solution From prototyping to pro...robertsamuel23

Build and Modernize Intelligent AppsLorenzo Barbieri

VoxelNettaeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu

Contenu connexe

Tendances

Convolutional Neural Networks on Graphs with Fast Localized Spectral FilteringSOYEON KIM

How Does Generative AI Actually Work? (a quick semi-technical introduction to...ssuser4edc93

Generative modelsBirger Moell

Let's talk about GPT: A crash course in Generative AI for researchersSteven Van Vaerenbergh

Data AugmentationMd Tajul Islam

AI Math AgentsMelanie Swan

Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAILviv Startup Club

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks

[DSC DACH 23] ChatGPT and Beyond: How generative AI is Changing the way peopl...DataScienceConferenc1

Large Language Models BootcampData Science Dojo

LLaMA 2.pptxRkRahul16

Implications of GPT-3Raven Jiang

Generative adversarial network and its applications to speech signal and natu...宏毅李

Tensorflow presentationAhmed rebai

Generative AI: Past, Present, and Future – A Practitioner's PerspectiveHuahai Yang

Deep Generative Models Chia-Wen Cheng

GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim

Natural language processing (NLP) introductionRobert Lujo

leewayhertz.com-How to build a generative AI solution From prototyping to pro...robertsamuel23

Build and Modernize Intelligent AppsLorenzo Barbieri

Tendances (20)

Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering

How Does Generative AI Actually Work? (a quick semi-technical introduction to...

Generative models

Let's talk about GPT: A crash course in Generative AI for researchers

Data Augmentation

AI Math Agents

Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI

Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...

[DSC DACH 23] ChatGPT and Beyond: How generative AI is Changing the way peopl...

Large Language Models Bootcamp

LLaMA 2.pptx

Implications of GPT-3

Generative adversarial network and its applications to speech signal and natu...

Tensorflow presentation

Generative AI: Past, Present, and Future – A Practitioner's Perspective

Deep Generative Models

GPT-2: Language Models are Unsupervised Multitask Learners

Natural language processing (NLP) introduction

leewayhertz.com-How to build a generative AI solution From prototyping to pro...

Build and Modernize Intelligent Apps

Plus de taeseon ryu

VoxelNettaeseon ryu

OpineSum Entailment-based self-training for abstractive opinion summarization...taeseon ryu

3D Gaussian Splattingtaeseon ryu

JetsonTX2 Python taeseon ryu

Hyperbolic Image Embedding.pptxtaeseon ryu

MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정taeseon ryu

LLaMA Open and Efficient Foundation Language Models - 230528.pdftaeseon ryu

YOLO V6taeseon ryu

Dataset Distillation by Matching Training Trajectories taeseon ryu

RL_UpsideDowntaeseon ryu

Packed Levitated Marker for Entity and Relation Extractiontaeseon ryu

MOReL: Model-Based Offline Reinforcement Learningtaeseon ryu

Scaling Instruction-Finetuned Language Modelstaeseon ryu

Visual prompt tuningtaeseon ryu

mPLUGtaeseon ryu

variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdftaeseon ryu

Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdftaeseon ryu

The Forward-Forward Algorithmtaeseon ryu

Towards Robust and Reproducible Active Learning using Neural Networkstaeseon ryu

BRIO: Bringing Order to Abstractive Summarizationtaeseon ryu

Plus de taeseon ryu (20)

VoxelNet

OpineSum Entailment-based self-training for abstractive opinion summarization...

3D Gaussian Splatting

JetsonTX2 Python

Hyperbolic Image Embedding.pptx

MCSE_Multimodal Contrastive Learning of Sentence Embeddings_변현정

LLaMA Open and Efficient Foundation Language Models - 230528.pdf

YOLO V6

Dataset Distillation by Matching Training Trajectories

RL_UpsideDown

Packed Levitated Marker for Entity and Relation Extraction

MOReL: Model-Based Offline Reinforcement Learning

Scaling Instruction-Finetuned Language Models

Visual prompt tuning

mPLUG

variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pdf

Reinforced Genetic Algorithm Learning For Optimizing Computation Graphs.pdf

The Forward-Forward Algorithm

Towards Robust and Reproducible Active Learning using Neural Networks

BRIO: Bringing Order to Abstractive Summarization

PaLM Scaling Language Modeling with Pathways - 230219 (1).pdf

1. PaLM: Scaling Language Modeling with Pathways Chowdhery, Aakanksha, et al. arXiv preprint arXiv:2204.02311 2023. 02. 19 허정원, 조해창, 박산희 1

2. Contents • • • • • 2

3. 1. Introduction 3 Gopher LaMDA GaLM MT NLG 175B 1.2T 137B 280B 530B

4. 1. Introduction (1) (2) (3) (4) 4 540B 780B Tokens Achieved through the use of Pathways PaLM

5. The key takeaways • • • • • • 5

6. Model Architecture 6

7. 2. Model Architecture • • • • • • • 7

8. 8

9. 9 • SwiGLU = xW·sigmoid(βxW) @ xV An improvement in quality in compute- equivalent experiments

10. 10 • The parallel formulation results in roughly 15% faster training speed at large scales, since the MLP and Attention input matrix multiplications can be fused.

11. 11 • Multi-Query Attention

12. 12 • RoPE Embeddings 𝑓! 𝑥" ≔ 𝑊 !𝑥" 𝑓# 𝑥$ + 𝑛 ≔ 𝑊#(𝑥$ + ( 𝑝% # ) 𝑓& 𝑥$ + 𝑛 ≔ 𝑊 &(𝑥$ + ( 𝑝% & )

13. 13 • Vocabulary A SentencePiece vocabulary with 256k tokens, which was chosen to support the large number of languages in the training corpus without excess tokenization. The vocabulary is completely lossless and reversible.

14. 2. Model Architecture • • • cost savings • • • • 14

15. 2.1 Model Scale Hyperparameters 15

16. Model Architecture 16

17. Training 17

18. 3 Training Dataset 18

19. 4 Training Infrastructure 19

20. 4.1 Training Efficiency 20

21. 5 Training Setup • • • • • • • • 21

22. 5 Training Setup • 22

23. 5 Training Setup • 23

24. 5 Training Setup • 24

25. 5 Training Setup • 25