SigOpt Machine Learning Engineer Meghana Ravikumar explains how she reduced the size of a BERT natural language model trained on the SQUAD 2.0 question-answer database, to reduce its size while maintaining performance using a "distillation" process optimized with SigOpt's Experiment Management functionality.
3. SigOpt. Confidential.
Two main questions
3
Can we understand the
trade-offs made during
model compression?
Can we find a model
architecture that fits our
needs?
5. SigOpt. Confidential.
Distilling BERT for Question Answering
5
BERT
Pre-trained for language
modeling
Student Model
SQUAD 2.0
SQUAD 2.0
Soft
target
loss
Hard
target
loss
BERT
Fine-tuned for SQUAD 2.0
Trained Student Model
For more on distillation: Hinton et al 2015, DistilBERT
10. SigOpt. Confidential.
Establishing a Baseline
10
BERT
Pre-trained for language
modeling
Student Model
SQUAD 2.0
SQUAD 2.0
Soft
target
loss
Hard
target
loss
BERT
Fine-tuned for SQUAD 2.0
Trained Student Model
?
?
?
? ?
11. SigOpt. Confidential.
Establishing a Baseline: Training from scratch
11
BERT
Pre-trained for language
modeling
DistilBERT
SQUAD 2.0
SQUAD 2.0
Standard
soft target
loss
Standard
hard
target
loss
BERT
Fine-tuned for SQUAD 2.0
Trained Model
? ?
?
?
?
13. SigOpt. Confidential.
Establishing a Baseline: Warm starting the model
13
BERT
Pre-trained for language
modeling
DistilBERT
SQUAD 2.0
SQUAD 2.0
Standard soft
target loss
Standard
Hard target
loss
BERT
Fine-tuned for SQUAD 2.0
Trained Model
DistilBERT
Pre-trained for language
modeling
Pretrained Weights
?
?
28. SigOpt. Confidential.
Check out our
YouTube channel:
Learn more about SigOpt
Read our research and product blog.
See more videos here.
Sign up to try out SigOpt
for free!
Join the Experiment Management
beta
Click Here
Read the full work on Nvidia’s
dev blog