SlideShare une entreprise Scribd logo
1  sur  46
BERT: Pre-training of
Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin, Min-Wei Chang, Kenton Lee, Kristina Toutanova
-
Google AI Language
-
Slides by Park JeeHyun
28 FEB 19
Contents
1. Motivation
2. Language Representations
3. Basic Idea
4. Model Architecture
5. How to use BERT
6. Results
7. Findings
1. Motivation
• Goal: Build a general, pre-trained language representation
model.
• Why: This model can be adapted to various NLP tasks easily,
we do not have to re-train a model from scratch every time.
• How: ?
2. Language Representations
1) Word Representations (Word embeddings)
• word2vec, GloVe
2) Contextual Representations
• Semi-Supervised Sequence Learning
• ELMo: Deep Contextual Word Embedding
• Generative Pre-Training
3) Problem with Previous Methods
2.1) Word Representation
Ref. [2]
2.1) Word Representation
2.2) Contextual Representations
2.2) Contextual Representations
• ELMo (Embeddings from Language Models)
2.2) Contextual Representations
• ELMo
• Deep Contextualized Word Representations
↘ neural network
↘ 𝑦 = 𝑓(𝑤𝑜𝑟𝑑, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡)
↘ words as fundamental semantic unit
↘ embedding
Ref. [3]
2.2) Contextual Representations
• ELMo
Ref. [4]
2.2) Contextual Representations
• ELMo
2.2) Contextual Representations
• ELMo
2.2) Contextual Representations
• GPT (Generative Pre-Training)
Ref. [2]
2.2) Contextual Representations
• GPT
• Unsupervised pre-training
• Supervised fine-tuning
• Task-specific input transformations
2.2) Contextual Representations
• GPT
Ref. [5]
2.3) Problem with Previous Methods
Ref. [2]
2.3) Problem with Previous Methods
Ref. [6]
2.3) Problem with Previous Methods
Ref. [2]
3. Basic Idea
1) Masked Language Model
2) Next Sentence Prediction
3) Input Representation
3.1) Masked Language Model
Ref. [2]
3.1) Masked Language Model
• Two downsides to MLM approach
i. MLM creates a mismatch between pre-training and fine- tuning,
since the [MASK] token is never seen during fine-tuning.
ii. MLM predicts only 15% of tokens in each batch, which suggests
that more pre-training steps may be required for the model to
converge.
3.1) Masked Language Model
 15% & 10% = 1.5%
: It does not seem t
o harm the model’s
language understan
d-ing capability.
 to bias the representation towards the actual observed word.
Ref. [2]
3.1) Masked Language Model
3.1) Masked Language Model
Ref. [7]
3.2) Next Sentence Prediction
Ref. [2]
3.2) Next Sentence Prediction
Ref. [7]
3.3) Input Representation
Ref. [2]
4. Model Architecture
1) Transformer
2) GELUs
4.1) Transformer
Ref. [2]
4.1) Transformer
4.1) Transformer
4.2) GELUs
• Gaussian Error Linear Units
• An activation function by combining properties from
dropout, zoneout, and ReLUs.
• ReLU
• deterministically multiplying the input by zero or one.
• dropout
• stochastically multiplying the input by zero.
• zoneout
• stochastically multiplies inputs by one.
• To build a new activation function called GELU,
the authors merge these functionalities by multiplying the input by
zero or one, but the values of this zero-one mask are
stochastically determined while also dependent upon the input.
Ref. [8]
4.2) GELUs
• GELU’s zero-one mask
• multiply the neuron input 𝑥 by 𝑚 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜙(𝑥)), 𝑤ℎ𝑒𝑟𝑒 𝜙 𝑥 =
𝑃 𝑋 ≤ 𝑥 , 𝑋~𝑁(0, 1) is the cumulative distribution function of the
standard normal distribution.
Ref. [8]
5. How to use BERT
1) Fine Tuning
2) Task Specific-Models
5.1) Fine Tuning
• Requires only ONE additional output layer
Ref. [2]
5.2) Task Specific-Model
6. Results
6. Results
Ref. [9]
7. Findings
1) Is masked language modeling really more effective than sequential language modeling?
2) Is the next sentence prediction task necessary?
3) Should I use a larger BERT model (a BERT model with more parameters) whenever possible?
4) Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to
achieve high fine-tuning accuracy?
5) Does masked language modeling converge more slowly than left-to-right language modeling pretraining
(since masked language modeling only predicts 15% of the input tokens whereas left-to-right language
modeling predicts all of the tokens)?
6) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor?
Ref. [10]
7.1)
Q) Is masked language modeling really more effective than sequential language modeling?
Ans) yes.
The authors tried training the Transformer on a left-to-right (LTR) language modeling task
instead of the masked language modeling task. The results for this setup can be seen in
the third row of the table below (“LTR & No NSP”).
7.2)
Q) Is the next sentence prediction task necessary?
Ans) yes.
For natural language inference and question answering (the MNLI-m, QNLI, and SQuAD
datasets), next sentence prediction seems to help a lot. For paraphrase detection (MRPC),
the performance change is much smaller, and for sentiment analysis (SST-2) the results
are virtually the same.
7.3)
Q) Should I use a larger BERT model (a BERT model with more parameters) whenever possible?
Ans) yes.
7.4)
Q) Does BERT really need such a large amount of pre-training (128,000 words/batch *
1,000,000 steps) to achieve high fine-tuning accuracy?
Ans) yes.
BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps
compared to 500k steps.
7.5)
Q) Does masked language modeling converge more slowly than left-to-right language modeling
pretraining (since masked language modeling only predicts 15% of the input tokens whereas
left-to-right language modeling predicts all of the tokens)?
Ans) yes & no.
For MNLI task, Left-to-right language modeling does converge faster, but masked
language modeling achieves a much higher accuracy with the same number of steps.
7.6)
Q) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature
extractor?
Ans) yes.
The authors tested how a BiLSTM model that
used fixed embeddings extracted from BERT
would perform on the CoNLL-NER dataset. The
results are shown in the table aside.
It turns out that using a concatenation of
the hidden activations from the last four
layers provides very strong performance,
only 0.3 behind finetuning the entire model. For
those on a strict computational budget,
this feature extraction approach is a good
option.
References
[1] Pretrained Deep Bidirectional Transformers for Language Understanding (algorithm) | TDLS
(https://youtu.be/BhlOGGzC0Q0)
[2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
(https://nlp.stanford.edu/seminar/details/jdevlin.pdf)
[3] Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids
(http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html)
[4] Word Embedding—ELMo
(https://medium.com/@online.rajib/word-embedding-elmo-7369c8f29bfc)
[5] Improving Language Understanding by Generative Pre-Training
(https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
[6] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
(https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
[7] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
(http://jalammar.github.io/illustrated-bert/)
[8] Gaussian Error Linear Units (GELUs)
(https://arxiv.org/abs/1606.08415)
[9] GLUE Benchmark
(https://gluebenchmark.com)
[10] Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained
(http://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)

Contenu connexe

Tendances

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingMinh Pham
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarSenthil Kumar M
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)Shuntaro Yada
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language modelJiWenKim
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)WarNik Chow
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understandinggohyunwoong
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanismKhang Pham
 
Tokenization using nlp | NLP Course
Tokenization using nlp | NLP CourseTokenization using nlp | NLP Course
Tokenization using nlp | NLP CourseRAKESH P
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Yuta Niki
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Databricks
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model佳蓉 倪
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Fwdays
 

Tendances (20)

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
BERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil KumarBERT - Part 1 Learning Notes of Senthil Kumar
BERT - Part 1 Learning Notes of Senthil Kumar
 
A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)A Review of Deep Contextualized Word Representations (Peters+, 2018)
A Review of Deep Contextualized Word Representations (Peters+, 2018)
 
Pre trained language model
Pre trained language modelPre trained language model
Pre trained language model
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Transformers
TransformersTransformers
Transformers
 
NLP PPT.pptx
NLP PPT.pptxNLP PPT.pptx
NLP PPT.pptx
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Notes on attention mechanism
Notes on attention mechanismNotes on attention mechanism
Notes on attention mechanism
 
Tokenization using nlp | NLP Course
Tokenization using nlp | NLP CourseTokenization using nlp | NLP Course
Tokenization using nlp | NLP Course
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)Transformer Introduction (Seminar Material)
Transformer Introduction (Seminar Material)
 
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
Building a Pipeline for State-of-the-Art Natural Language Processing Using Hu...
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Seq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) modelSeq2Seq (encoder decoder) model
Seq2Seq (encoder decoder) model
 
Transformers in 2021
Transformers in 2021Transformers in 2021
Transformers in 2021
 
Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"Thomas Wolf "Transfer learning in NLP"
Thomas Wolf "Transfer learning in NLP"
 
Transformers AI PPT.pptx
Transformers AI PPT.pptxTransformers AI PPT.pptx
Transformers AI PPT.pptx
 
LSTM Tutorial
LSTM TutorialLSTM Tutorial
LSTM Tutorial
 

Similaire à [Paper review] BERT

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...Kyuri Kim
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Association for Computational Linguistics
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningSequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningMLAI2
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Deep Learning Italia
 
How to build a GPT model.pdf
How to build a GPT model.pdfHow to build a GPT model.pdf
How to build a GPT model.pdfStephenAmell4
 
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Fwdays
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
Improving the role of language model in statistical machine translation (Indo...
Improving the role of language model in statistical machine translation (Indo...Improving the role of language model in statistical machine translation (Indo...
Improving the role of language model in statistical machine translation (Indo...IJECEIAES
 
Natural language processing
Natural language processingNatural language processing
Natural language processingBirger Moell
 
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...MLAI2
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksSDL
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERTAbdurrahimDerric
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
BERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfBERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfsudeshnakundu10
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageJinho Choi
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageJinho Choi
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfhelloworld28847
 

Similaire à [Paper review] BERT (20)

BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
BERT- Pre-training of Deep Bidirectional Transformers for Language Understand...
 
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
Meng Zhang - 2017 - Adversarial Training for Unsupervised Bilingual Lexicon I...
 
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual LearningSequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
Sequential Reptile_Inter-Task Gradient Alignment for Multilingual Learning
 
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
Transformer Seq2Sqe Models: Concepts, Trends & Limitations (DLI)
 
srinu.pptx
srinu.pptxsrinu.pptx
srinu.pptx
 
How to build a GPT model.pdf
How to build a GPT model.pdfHow to build a GPT model.pdf
How to build a GPT model.pdf
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
Kostiantyn Omelianchuk, Oleksandr Skurzhanskyi "Building a state-of-the-art a...
 
arttt.pdf
arttt.pdfarttt.pdf
arttt.pdf
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
Improving the role of language model in statistical machine translation (Indo...
Improving the role of language model in statistical machine translation (Indo...Improving the role of language model in statistical machine translation (Indo...
Improving the role of language model in statistical machine translation (Indo...
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...Neural Mask Generator : Learning to Generate Adaptive WordMaskings for Langu...
Neural Mask Generator : Learning to Generate Adaptive Word Maskings for Langu...
 
Fast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural NetworksFast and Accurate Preordering for SMT using Neural Networks
Fast and Accurate Preordering for SMT using Neural Networks
 
Turkish language modeling using BERT
Turkish language modeling using BERTTurkish language modeling using BERT
Turkish language modeling using BERT
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
BERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdfBERT Explained_ State of the art language model for NLP.pdf
BERT Explained_ State of the art language model for NLP.pdf
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 
Transformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdfTransformer Models_ BERT vs. GPT.pdf
Transformer Models_ BERT vs. GPT.pdf
 

Plus de JEE HYUN PARK

keti companion classifier
keti companion classifierketi companion classifier
keti companion classifierJEE HYUN PARK
 
Kcc201728apr2017 170828235330
Kcc201728apr2017 170828235330Kcc201728apr2017 170828235330
Kcc201728apr2017 170828235330JEE HYUN PARK
 
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationJEE HYUN PARK
 
a deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarizationa deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarizationJEE HYUN PARK
 
Historical Finance Data
Historical Finance DataHistorical Finance Data
Historical Finance DataJEE HYUN PARK
 
Understanding lstm and its diagrams
Understanding lstm and its diagramsUnderstanding lstm and its diagrams
Understanding lstm and its diagramsJEE HYUN PARK
 
Short-Term Load Forecasting of Australian National Electricity Market by Hier...
Short-Term Load Forecasting of Australian National Electricity Market by Hier...Short-Term Load Forecasting of Australian National Electricity Market by Hier...
Short-Term Load Forecasting of Australian National Electricity Market by Hier...JEE HYUN PARK
 

Plus de JEE HYUN PARK (9)

keti companion classifier
keti companion classifierketi companion classifier
keti companion classifier
 
Kcc201728apr2017 170828235330
Kcc201728apr2017 170828235330Kcc201728apr2017 170828235330
Kcc201728apr2017 170828235330
 
neural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classificationneural based_context_representation_learning_for_dialog_act_classification
neural based_context_representation_learning_for_dialog_act_classification
 
a deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarizationa deep reinforced model for abstractive summarization
a deep reinforced model for abstractive summarization
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Historical Finance Data
Historical Finance DataHistorical Finance Data
Historical Finance Data
 
Understanding lstm and its diagrams
Understanding lstm and its diagramsUnderstanding lstm and its diagrams
Understanding lstm and its diagrams
 
KCC2017 28APR2017
KCC2017 28APR2017KCC2017 28APR2017
KCC2017 28APR2017
 
Short-Term Load Forecasting of Australian National Electricity Market by Hier...
Short-Term Load Forecasting of Australian National Electricity Market by Hier...Short-Term Load Forecasting of Australian National Electricity Market by Hier...
Short-Term Load Forecasting of Australian National Electricity Market by Hier...
 

Dernier

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx959SahilShah
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfROCENODodongVILLACER
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.eptoze12
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...asadnawaz62
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionDr.Costas Sachpazis
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 

Dernier (20)

Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Application of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptxApplication of Residue Theorem to evaluate real integrations.pptx
Application of Residue Theorem to evaluate real integrations.pptx
 
Risk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdfRisk Assessment For Installation of Drainage Pipes.pdf
Risk Assessment For Installation of Drainage Pipes.pdf
 
Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.Oxy acetylene welding presentation note.
Oxy acetylene welding presentation note.
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...complete construction, environmental and economics information of biomass com...
complete construction, environmental and economics information of biomass com...
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective IntroductionSachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
Sachpazis Costas: Geotechnical Engineering: A student's Perspective Introduction
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 

[Paper review] BERT

  • 1. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding Jacob Devlin, Min-Wei Chang, Kenton Lee, Kristina Toutanova - Google AI Language - Slides by Park JeeHyun 28 FEB 19
  • 2. Contents 1. Motivation 2. Language Representations 3. Basic Idea 4. Model Architecture 5. How to use BERT 6. Results 7. Findings
  • 3. 1. Motivation • Goal: Build a general, pre-trained language representation model. • Why: This model can be adapted to various NLP tasks easily, we do not have to re-train a model from scratch every time. • How: ?
  • 4. 2. Language Representations 1) Word Representations (Word embeddings) • word2vec, GloVe 2) Contextual Representations • Semi-Supervised Sequence Learning • ELMo: Deep Contextual Word Embedding • Generative Pre-Training 3) Problem with Previous Methods
  • 8. 2.2) Contextual Representations • ELMo (Embeddings from Language Models)
  • 9. 2.2) Contextual Representations • ELMo • Deep Contextualized Word Representations ↘ neural network ↘ 𝑦 = 𝑓(𝑤𝑜𝑟𝑑, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡) ↘ words as fundamental semantic unit ↘ embedding Ref. [3]
  • 13. 2.2) Contextual Representations • GPT (Generative Pre-Training) Ref. [2]
  • 14. 2.2) Contextual Representations • GPT • Unsupervised pre-training • Supervised fine-tuning • Task-specific input transformations
  • 16. 2.3) Problem with Previous Methods Ref. [2]
  • 17. 2.3) Problem with Previous Methods Ref. [6]
  • 18. 2.3) Problem with Previous Methods Ref. [2]
  • 19. 3. Basic Idea 1) Masked Language Model 2) Next Sentence Prediction 3) Input Representation
  • 20. 3.1) Masked Language Model Ref. [2]
  • 21. 3.1) Masked Language Model • Two downsides to MLM approach i. MLM creates a mismatch between pre-training and fine- tuning, since the [MASK] token is never seen during fine-tuning. ii. MLM predicts only 15% of tokens in each batch, which suggests that more pre-training steps may be required for the model to converge.
  • 22. 3.1) Masked Language Model  15% & 10% = 1.5% : It does not seem t o harm the model’s language understan d-ing capability.  to bias the representation towards the actual observed word. Ref. [2]
  • 24. 3.1) Masked Language Model Ref. [7]
  • 25. 3.2) Next Sentence Prediction Ref. [2]
  • 26. 3.2) Next Sentence Prediction Ref. [7]
  • 28. 4. Model Architecture 1) Transformer 2) GELUs
  • 32. 4.2) GELUs • Gaussian Error Linear Units • An activation function by combining properties from dropout, zoneout, and ReLUs. • ReLU • deterministically multiplying the input by zero or one. • dropout • stochastically multiplying the input by zero. • zoneout • stochastically multiplies inputs by one. • To build a new activation function called GELU, the authors merge these functionalities by multiplying the input by zero or one, but the values of this zero-one mask are stochastically determined while also dependent upon the input. Ref. [8]
  • 33. 4.2) GELUs • GELU’s zero-one mask • multiply the neuron input 𝑥 by 𝑚 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜙(𝑥)), 𝑤ℎ𝑒𝑟𝑒 𝜙 𝑥 = 𝑃 𝑋 ≤ 𝑥 , 𝑋~𝑁(0, 1) is the cumulative distribution function of the standard normal distribution. Ref. [8]
  • 34. 5. How to use BERT 1) Fine Tuning 2) Task Specific-Models
  • 35. 5.1) Fine Tuning • Requires only ONE additional output layer Ref. [2]
  • 39. 7. Findings 1) Is masked language modeling really more effective than sequential language modeling? 2) Is the next sentence prediction task necessary? 3) Should I use a larger BERT model (a BERT model with more parameters) whenever possible? 4) Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy? 5) Does masked language modeling converge more slowly than left-to-right language modeling pretraining (since masked language modeling only predicts 15% of the input tokens whereas left-to-right language modeling predicts all of the tokens)? 6) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor? Ref. [10]
  • 40. 7.1) Q) Is masked language modeling really more effective than sequential language modeling? Ans) yes. The authors tried training the Transformer on a left-to-right (LTR) language modeling task instead of the masked language modeling task. The results for this setup can be seen in the third row of the table below (“LTR & No NSP”).
  • 41. 7.2) Q) Is the next sentence prediction task necessary? Ans) yes. For natural language inference and question answering (the MNLI-m, QNLI, and SQuAD datasets), next sentence prediction seems to help a lot. For paraphrase detection (MRPC), the performance change is much smaller, and for sentiment analysis (SST-2) the results are virtually the same.
  • 42. 7.3) Q) Should I use a larger BERT model (a BERT model with more parameters) whenever possible? Ans) yes.
  • 43. 7.4) Q) Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to achieve high fine-tuning accuracy? Ans) yes. BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps compared to 500k steps.
  • 44. 7.5) Q) Does masked language modeling converge more slowly than left-to-right language modeling pretraining (since masked language modeling only predicts 15% of the input tokens whereas left-to-right language modeling predicts all of the tokens)? Ans) yes & no. For MNLI task, Left-to-right language modeling does converge faster, but masked language modeling achieves a much higher accuracy with the same number of steps.
  • 45. 7.6) Q) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor? Ans) yes. The authors tested how a BiLSTM model that used fixed embeddings extracted from BERT would perform on the CoNLL-NER dataset. The results are shown in the table aside. It turns out that using a concatenation of the hidden activations from the last four layers provides very strong performance, only 0.3 behind finetuning the entire model. For those on a strict computational budget, this feature extraction approach is a good option.
  • 46. References [1] Pretrained Deep Bidirectional Transformers for Language Understanding (algorithm) | TDLS (https://youtu.be/BhlOGGzC0Q0) [2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (https://nlp.stanford.edu/seminar/details/jdevlin.pdf) [3] Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids (http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html) [4] Word Embedding—ELMo (https://medium.com/@online.rajib/word-embedding-elmo-7369c8f29bfc) [5] Improving Language Understanding by Generative Pre-Training (https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf) [6] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing (https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) [7] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) (http://jalammar.github.io/illustrated-bert/) [8] Gaussian Error Linear Units (GELUs) (https://arxiv.org/abs/1606.08415) [9] GLUE Benchmark (https://gluebenchmark.com) [10] Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained (http://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)

Notes de l'éditeur

  1. Feature-based approaches
  2. s = softmax-normalized weights r = scalar parameter
  3. Fine-tuning approaches
  4. ELMo & GPT are unidirectional???
  5. Incrementally??? Deep bidirectionality vs. ELMo-style shallow bidirectionality
  6. Incrementally???
  7. Random word  The Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every input token. Additionally, because random replacement only occurs for 1.5% of all tokens (i.e., 10% of 15%), this does not seem to harm the model’s language understanding capability. Keep same  The purpose of this is to bias the representation towards the actual observed word.
  8. Embedding  Elementwise adding
  9. Transformer learns features throughout all other words in the sequences.
  10. Linear decay = why?
  11. GLUE = The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. https://gluebenchmark.com/leaderboard
  12. Yes & no???
  13. CoNLL-NER (Named Entity Recogmnition) Entities are annotated with LOC (location), ORG (organisation), PER(person) and MISC (miscellaneous).