Recent advances in deep learning have facilitated the demand of neural models for real applications. In practice, these applications often need to be deployed with limited resources while keeping high accuracy. This paper touches the core of neural models in NLP, word embeddings, and presents a new embedding distillation framework that remarkably reduces the dimension of word embeddings without compromising accuracy. A novel distillation ensemble approach is also proposed that trains a high-efficient student model using multiple teacher models. In our approach, the teacher models play roles only during training such that the student model operates on its own without getting supports from the teacher models during decoding, which makes it eighty times faster and lighter than other typical ensemble methods. All models are evaluated on seven document classification datasets and show significant advantage over the teacher models for most cases. Our analysis depicts insightful transformation of word embeddings from distillation and suggests a future direction to ensemble approaches using neural models.
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning
1. The Pupil Has Become the Master:
Teacher-Student Model-Based
Word Embedding Distillation with
Ensemble Learning
Bonggun Shin, Hao Yang and Jinho D. Choi
IJCAI 2019
2. Typical NLP Models
• Input: a sequence of
tokens
• Each token is
transformed into a vector
• Once vectorized, any
neural network can be
attached
• Output: a class or a
sequence (depends on
the task)
Embeddings
CNN/LSTM
Output
Input
3. • The majority of model space is
occupied by word embeddings
• NLP model compression ->
Embedding compression
• Neural model compression
methods
• Weight pruning
• Weight quantization
• Lossless compression
• Distillation
Why is Compressing an
Embedding Beneficial?
Embeddings
Output
Input
CNN/LSTM
4. Embedding Compression
• V: the number of vocabulary
• D: the size of an original embedding vector
• D’: the size of a compressed embedding vector
Embedding
V
D Compressed Embedding
V
D’
5. Embedding Encoding
• Proposed by [Mou et al., 2016] - 8 times compression with performance loss
• Not a teacher-student model - trains a single model with an encoding (projection) layer
• Encoding (projection) layer: Pre-trained large embedding -> smaller embedding
• Discards the large embedding at model deployment
Embedding Encoding
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
Projection Projection
Model Deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
L
SO
WOR
NET
LARGE E
SMA
Projection
Projection Pro
6. We Propose
• Embedding Distillation
• Single teacher
• Better performance with a 8-times smaller model size
• Distillation Ensemble
• Ten teachers
• Better performance with a 80-times smaller model size
7. Background
Teacher Student Model
• Teacher - (Dataset.X, Dataset.Y), Student - (Dataset.X, Teacher.Output)
• Logit matching [Ba and Caruana, 2014]
• Noisy logit matching [Sau and Balasubramanian, 2016]
• Softmax tau matching [Hinton et al., 2014]
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
8. Dataset
• Seven document classification tasks
• Pre-trained word embeddings (Trained by Word2Vec [Mikolov et al., 2013])
• Sentiment analysis (SST-*, MR and CR) - Amazon review data (2.6M)
• Others - Wikipedia + New York Times
All models are evaluated on seven document classification
datasets in Table 1. MR, SST-*, and CR are targeted at the
task of sentiment analysis while Subj, TREC, and MPQA are
targeted at the classifications of subjectivity, question types,
and opinion polarity, respectively. About 10% of the training
sets are split into development sets for SST-* and TREC, and
about 10/20% of the provided resources are divided into de-
velopment/evaluation sets for the other datasets, respectively.
Dataset C TRN DEV TST
MR [Pang and Lee, 2005] 2 7,684 990 1,988
SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210
SST-2 [Socher et al., 2013] 2 6,920 872 1,821
Subj [Pang and Lee, 2004] 2 7,199 907 1,894
TREC [Li and Roth, 2002] 6 4,952 500 500
CR [Hu and Liu, 2004] 2 2,718 340 717
MPQA [Wiebe et al., 2005] 2 7,636 955 2,015
Table 1: Seven datasets used for our experiments. C: num-
ber of classes, TRN/DEV/TST: number of instances in train-
ing/development/evaluation set.
5.2 Word Embeddings
For sentiment analysis, raw text from the Amazon Review
dataset2
is used to train word embeddings, resulting 2.67M
word vectors. For the other tasks, combined text from
Wikipedia and the New York Times Annotated corpus3
are
used, resulting 1.96M word vectors. For each group, two sets
beddings. The
the size of the l
embedding lay
which is the dim
layered projecti
various combina
and deeper laye
5.4 Pre-trai
An autoencode
400-dimension
two-layered pro
and the decode
upper and lowe
by using pre-tr
are not reporte
consistently see
for the projectio
5.5 Embedd
Six models are
show the effect
work: logit ma
softmax tau m
based embeddin
els with the aut
two layered pro
9. Embedding Distillation
Model
• Train a teacher using (X,Y)
• Distill to a student with a projection to smaller embeddings using (X,
Teacher.Logit)
• Discard the large embedding at model deployment
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
Model deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX W
LAR
S
Projection
Projection
10. Embedding Distillation
Result
• Proposed models outperform ENC
• Pre-training (PT) and 2-layered projection (2L) boost the performance
• Differences among LM vs NLM vs STM are marginal
11. LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
R. LOGIT
Model deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
Projection
Projection
Distillation Ensemble
Model
• Train multiple (for example, 10) teachers
• Calculate representing logits for each input dataset X
• Distill to a student with a projection to smaller embeddings using (X, Representing.Logit)
• Discard the large embedding at model deployment
12. Representing Logit
0
0.25
0.5
0.75
1
C1 C2 C3
0
0.25
0.5
0.75
1
C1 C2 C3
0
0.25
0.5
0.75
1
C1 C2 C3
Teacher1 Teacher2 Teacher3
0
0.25
0.5
0.75
1
C1 C2 C3
RAE
0
0.25
0.5
0.75
1
C1 C2 C3
RDE
Routing by
Agreement
Ensemble:
Put more weights
on majority
opinions
Routing by
Disagreement
Ensemble:
Put more weights
on minority
opinions
13. Distillation Ensemble
Result
• Teachers: 5 CNN-based and 5 LSTM-based models
• RDE significantly outperforms the teachers, if the dataset is sufficiently large
• Enough data samples -> The more chance to explore different insights from minor
opinions.
0.780
0.790
0.800
Teacher RAE RDE
MR
0.470
0.480
0.490
0.500
Teacher RAE RDE
SST-1
0.850
0.860
0.870
Teacher RAE RDE
SST-2
0.920
0.925
0.930
0.935
Teacher RAE RDE
Subj
0.885
0.890
0.895
0.900
0.905
Teacher RAE RDE
MPQA
5 while n iterations do
6 c softmax(w)
7 zrep
=
PT
t=1 ct · zt
8 s squash(zrep
)
9 if not last iteration then
10 for t 2 {1, . . . , T} do
11 wt wt + kxt
· s
12 return zrep
5 Experiments
5.1 Datasets
All models are evaluated on seven document classification
datasets in Table 1. MR, SST-*, and CR are targeted at the
task of sentiment analysis while Subj, TREC, and MPQA are
targeted at the classifications of subjectivity, question types,
and opinion polarity, respectively. About 10% of the training
sets are split into development sets for SST-* and TREC, and
about 10/20% of the provided resources are divided into de-
velopment/evaluation sets for the other datasets, respectively.
Dataset C TRN DEV TST
MR [Pang and Lee, 2005] 2 7,684 990 1,988
SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210
SST-2 [Socher et al., 2013] 2 6,920 872 1,821
Subj [Pang and Lee, 2004] 2 7,199 907 1,894
TREC [Li and Roth, 2002] 6 4,952 500 500
CR [Hu and Liu, 2004] 2 2,718 340 717
MPQA [Wiebe et al., 2005] 2 7,636 955 2,015
Table 1: Seven datasets used for our experiments. C: num-
ber of classes, TRN/DEV/TST: number of instances in train-
ing/development/evaluation set.
5.2 Word Embeddings
For sentiment analysis, raw text from the Amazon Review
dataset2
is used to train word embeddings, resulting 2.67M
word vectors. For the other tasks, combined text from
Wikipedia and the New York Times Annotated corpus3
are
models ad
dent mode
the the pro
teacher m
last outpu
which bec
0.2 is appl
For both C
same mod
this limite
remarkabl
Each st
layers. Th
hidden lay
knowledg
beddings.
the size of
embeddin
which is th
layered pr
various co
and deepe
5.4 Pre
An autoen
400-dimen
two-layere
and the d
upper and
by using p
are not re
consistent
for the pro
5.5 Em
Six mode
show the
work: log
softmax t
based emb
els with th
0.900
0.910
0.920
0.930
0.940
Teacher RAE RDE
TREC
0.810
0.820
0.830
0.840
0.850
Teacher RAE RDE
CR
14. Discussion
• Outperforms the previous distillation model
• Gives compatible accuracy to the teacher models with x8
and x80 smaller embeddings
• Our distillation ensemble approach consistently shows
more robust results when the size of training data is
sufficiently large