CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
[Paper review] BERT
1. BERT: Pre-training of
Deep Bidirectional Transformers for
Language Understanding
Jacob Devlin, Min-Wei Chang, Kenton Lee, Kristina Toutanova
-
Google AI Language
-
Slides by Park JeeHyun
28 FEB 19
3. 1. Motivation
• Goal: Build a general, pre-trained language representation
model.
• Why: This model can be adapted to various NLP tasks easily,
we do not have to re-train a model from scratch every time.
• How: ?
4. 2. Language Representations
1) Word Representations (Word embeddings)
• word2vec, GloVe
2) Contextual Representations
• Semi-Supervised Sequence Learning
• ELMo: Deep Contextual Word Embedding
• Generative Pre-Training
3) Problem with Previous Methods
9. 2.2) Contextual Representations
• ELMo
• Deep Contextualized Word Representations
↘ neural network
↘ 𝑦 = 𝑓(𝑤𝑜𝑟𝑑, 𝑐𝑜𝑛𝑡𝑒𝑥𝑡)
↘ words as fundamental semantic unit
↘ embedding
Ref. [3]
21. 3.1) Masked Language Model
• Two downsides to MLM approach
i. MLM creates a mismatch between pre-training and fine- tuning,
since the [MASK] token is never seen during fine-tuning.
ii. MLM predicts only 15% of tokens in each batch, which suggests
that more pre-training steps may be required for the model to
converge.
22. 3.1) Masked Language Model
15% & 10% = 1.5%
: It does not seem t
o harm the model’s
language understan
d-ing capability.
to bias the representation towards the actual observed word.
Ref. [2]
32. 4.2) GELUs
• Gaussian Error Linear Units
• An activation function by combining properties from
dropout, zoneout, and ReLUs.
• ReLU
• deterministically multiplying the input by zero or one.
• dropout
• stochastically multiplying the input by zero.
• zoneout
• stochastically multiplies inputs by one.
• To build a new activation function called GELU,
the authors merge these functionalities by multiplying the input by
zero or one, but the values of this zero-one mask are
stochastically determined while also dependent upon the input.
Ref. [8]
33. 4.2) GELUs
• GELU’s zero-one mask
• multiply the neuron input 𝑥 by 𝑚 ~ 𝐵𝑒𝑟𝑛𝑜𝑢𝑙𝑙𝑖(𝜙(𝑥)), 𝑤ℎ𝑒𝑟𝑒 𝜙 𝑥 =
𝑃 𝑋 ≤ 𝑥 , 𝑋~𝑁(0, 1) is the cumulative distribution function of the
standard normal distribution.
Ref. [8]
34. 5. How to use BERT
1) Fine Tuning
2) Task Specific-Models
39. 7. Findings
1) Is masked language modeling really more effective than sequential language modeling?
2) Is the next sentence prediction task necessary?
3) Should I use a larger BERT model (a BERT model with more parameters) whenever possible?
4) Does BERT really need such a large amount of pre-training (128,000 words/batch * 1,000,000 steps) to
achieve high fine-tuning accuracy?
5) Does masked language modeling converge more slowly than left-to-right language modeling pretraining
(since masked language modeling only predicts 15% of the input tokens whereas left-to-right language
modeling predicts all of the tokens)?
6) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature extractor?
Ref. [10]
40. 7.1)
Q) Is masked language modeling really more effective than sequential language modeling?
Ans) yes.
The authors tried training the Transformer on a left-to-right (LTR) language modeling task
instead of the masked language modeling task. The results for this setup can be seen in
the third row of the table below (“LTR & No NSP”).
41. 7.2)
Q) Is the next sentence prediction task necessary?
Ans) yes.
For natural language inference and question answering (the MNLI-m, QNLI, and SQuAD
datasets), next sentence prediction seems to help a lot. For paraphrase detection (MRPC),
the performance change is much smaller, and for sentiment analysis (SST-2) the results
are virtually the same.
42. 7.3)
Q) Should I use a larger BERT model (a BERT model with more parameters) whenever possible?
Ans) yes.
43. 7.4)
Q) Does BERT really need such a large amount of pre-training (128,000 words/batch *
1,000,000 steps) to achieve high fine-tuning accuracy?
Ans) yes.
BERTBASE achieves almost 1.0% additional accuracy on MNLI when trained on 1M steps
compared to 500k steps.
44. 7.5)
Q) Does masked language modeling converge more slowly than left-to-right language modeling
pretraining (since masked language modeling only predicts 15% of the input tokens whereas
left-to-right language modeling predicts all of the tokens)?
Ans) yes & no.
For MNLI task, Left-to-right language modeling does converge faster, but masked
language modeling achieves a much higher accuracy with the same number of steps.
45. 7.6)
Q) Do I have to fine-tune the entire BERT model? Can’t I just use BERT as a fixed feature
extractor?
Ans) yes.
The authors tested how a BiLSTM model that
used fixed embeddings extracted from BERT
would perform on the CoNLL-NER dataset. The
results are shown in the table aside.
It turns out that using a concatenation of
the hidden activations from the last four
layers provides very strong performance,
only 0.3 behind finetuning the entire model. For
those on a strict computational budget,
this feature extraction approach is a good
option.
46. References
[1] Pretrained Deep Bidirectional Transformers for Language Understanding (algorithm) | TDLS
(https://youtu.be/BhlOGGzC0Q0)
[2] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
(https://nlp.stanford.edu/seminar/details/jdevlin.pdf)
[3] Improving a Sentiment Analyzer using ELMo — Word Embeddings on Steroids
(http://www.realworldnlpbook.com/blog/improving-sentiment-analyzer-using-elmo.html)
[4] Word Embedding—ELMo
(https://medium.com/@online.rajib/word-embedding-elmo-7369c8f29bfc)
[5] Improving Language Understanding by Generative Pre-Training
(https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf)
[6] Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing
(https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html)
[7] The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)
(http://jalammar.github.io/illustrated-bert/)
[8] Gaussian Error Linear Units (GELUs)
(https://arxiv.org/abs/1606.08415)
[9] GLUE Benchmark
(https://gluebenchmark.com)
[10] Paper Dissected: “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” Explained
(http://mlexplained.com/2019/01/07/paper-dissected-bert-pre-training-of-deep-bidirectional-transformers-for-language-understanding-explained/)
Notes de l'éditeur
Feature-based approaches
s = softmax-normalized weights
r = scalar parameter
Fine-tuning approaches
ELMo & GPT are unidirectional???
Incrementally???
Deep bidirectionality vs. ELMo-style shallow bidirectionality
Incrementally???
Random word
The Transformer encoder does not know which words it will be asked to predict or which have been replaced by random words, so it is forced to keep a distributional contextual representation of every input token.
Additionally, because random replacement only occurs for 1.5% of all tokens (i.e., 10% of 15%), this does not seem to harm the model’s language understanding capability.
Keep same The purpose of this is to bias the representation towards the actual observed word.
Embedding Elementwise adding
Transformer learns features throughout all other words in the sequences.
Linear decay = why?
GLUE = The General Language Understanding Evaluation (GLUE) benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems.
https://gluebenchmark.com/leaderboard
Yes & no???
CoNLL-NER (Named Entity Recogmnition)
Entities are annotated with LOC (location), ORG (organisation), PER(person) and MISC (miscellaneous).