2. Concept
1. A pre-trained language representation utilizing the architecture of Transformer on 2 tasks
a. Randomly mask some words within a sequence and let the model try to predict that masked words
b. Predict if a pair of sequences is actually one next to other in a larger context: “next sentence
prediction”
2. Can be used as transfer learning (similar to pre-trained on ImageNet in Computer Vision)
a. Pre-train on a large corpus as un-supervised learning to learn the language representation
b. Fine-tune the model for specific tasks: text classification, Name entity recognition, SQuAD
2019/12/16 PHAM QUANG KHANG 2
Devlin et al,. 2018
3. Architecture
1. Encoder: encoder from transformer
a. Base model: N = 12, Hidden dim=768, Heads=12
b. Large model: N=24, Hidden dime=1024, Heads=16
2. Embedding:
2019/12/16 PHAM QUANG KHANG 3
Token Embedding
Multi-head
Attention
Add & Norm
Feed Forward
Add & Norm
N×
Positional Embedding
Segment Embedding
Linear + Softmax
Output
5. Fine-tuning on SQuAD
Use output hidden states to predict start and end span
Apply 1 Linear(output=2) onto output hidden
state vectors T’i
Output is predictions of starting and ending
positions of answer within input paragraph
Objective function is log-likelihood of correct
start and end positions
2019/12/16 PHAM QUANG KHANG 5
6. Result on SQuAD
SQuAD 1.1: new SOTA SQuAD 2.0: being used as pre-trained model
2019/12/16 PHAM QUANG KHANG 6
https://rajpurkar.github.io/SQuAD-explorer/
7. Improving from BERT
ROBERTA
1. Train longer, bigger batches, more data
2. Remove next-sentence-prediction task
3. Longer sequences
4. Dynamic changing masks
ALBERT
1. Factorized embedding params
2. Cross-layer param sharing
3. Inter-sentence coherence loss
2019/12/16 PHAM QUANG KHANG 7
8. References
1. Devlin et al,. BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding
2. https://colab.research.google.com/github/tensorflow/tpu/blob/master/tools/colab/bert_fi
netuning_with_cloud_tpus.ipynb
3. https://github.com/google-research/bert
4. Pytorch version: https://github.com/huggingface/pytorch-pretrained-BERT
5. Liu et al,. RoBERTa: A Robustly Optimized BERT Pretraining Approach
6. Lan et al,. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations
2019/12/16 PHAM QUANG KHANG 8