SlideShare une entreprise Scribd logo
1  sur  14
Télécharger pour lire hors ligne
The Pupil Has Become the Master:
Teacher-Student Model-Based
Word Embedding Distillation with
Ensemble Learning
Bonggun Shin, Hao Yang and Jinho D. Choi

IJCAI 2019
Typical NLP Models
• Input: a sequence of
tokens

• Each token is
transformed into a vector

• Once vectorized, any
neural network can be
attached

• Output: a class or a
sequence (depends on
the task)
Embeddings
CNN/LSTM
Output
Input
• The majority of model space is
occupied by word embeddings

• NLP model compression ->
Embedding compression

• Neural model compression
methods

• Weight pruning

• Weight quantization

• Lossless compression

• Distillation
Why is Compressing an
Embedding Beneficial?
Embeddings
Output
Input
CNN/LSTM
Embedding Compression
• V: the number of vocabulary

• D: the size of an original embedding vector

• D’: the size of a compressed embedding vector
Embedding
V
D Compressed Embedding
V
D’
Embedding Encoding
• Proposed by [Mou et al., 2016] - 8 times compression with performance loss

• Not a teacher-student model - trains a single model with an encoding (projection) layer 

• Encoding (projection) layer: Pre-trained large embedding -> smaller embedding

• Discards the large embedding at model deployment
Embedding Encoding
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
Projection Projection
Model Deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
L
SO
WOR
NET
LARGE E
SMA
Projection
Projection Pro
We Propose
• Embedding Distillation

• Single teacher

• Better performance with a 8-times smaller model size

• Distillation Ensemble

• Ten teachers

• Better performance with a 80-times smaller model size
Background
Teacher Student Model
• Teacher - (Dataset.X, Dataset.Y), Student - (Dataset.X, Teacher.Output)

• Logit matching [Ba and Caruana, 2014]

• Noisy logit matching [Sau and Balasubramanian, 2016]

• Softmax tau matching [Hinton et al., 2014]
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
GOLD
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
NOISE
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
STUDENT
MODEL
TEACHER
MODEL
LOGIT
SOFTMAX
LOGIT
SOFTMAX
WORD INDEX WORD INDEX
Dataset
• Seven document classification tasks

• Pre-trained word embeddings (Trained by Word2Vec [Mikolov et al., 2013])

• Sentiment analysis (SST-*, MR and CR) - Amazon review data (2.6M)

• Others - Wikipedia + New York Times
All models are evaluated on seven document classification
datasets in Table 1. MR, SST-*, and CR are targeted at the
task of sentiment analysis while Subj, TREC, and MPQA are
targeted at the classifications of subjectivity, question types,
and opinion polarity, respectively. About 10% of the training
sets are split into development sets for SST-* and TREC, and
about 10/20% of the provided resources are divided into de-
velopment/evaluation sets for the other datasets, respectively.
Dataset C TRN DEV TST
MR [Pang and Lee, 2005] 2 7,684 990 1,988
SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210
SST-2 [Socher et al., 2013] 2 6,920 872 1,821
Subj [Pang and Lee, 2004] 2 7,199 907 1,894
TREC [Li and Roth, 2002] 6 4,952 500 500
CR [Hu and Liu, 2004] 2 2,718 340 717
MPQA [Wiebe et al., 2005] 2 7,636 955 2,015
Table 1: Seven datasets used for our experiments. C: num-
ber of classes, TRN/DEV/TST: number of instances in train-
ing/development/evaluation set.
5.2 Word Embeddings
For sentiment analysis, raw text from the Amazon Review
dataset2
is used to train word embeddings, resulting 2.67M
word vectors. For the other tasks, combined text from
Wikipedia and the New York Times Annotated corpus3
are
used, resulting 1.96M word vectors. For each group, two sets
beddings. The
the size of the l
embedding lay
which is the dim
layered projecti
various combina
and deeper laye
5.4 Pre-trai
An autoencode
400-dimension
two-layered pro
and the decode
upper and lowe
by using pre-tr
are not reporte
consistently see
for the projectio
5.5 Embedd
Six models are
show the effect
work: logit ma
softmax tau m
based embeddin
els with the aut
two layered pro
Embedding Distillation
Model
• Train a teacher using (X,Y)

• Distill to a student with a projection to smaller embeddings using (X,
Teacher.Logit)

• Discard the large embedding at model deployment
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
Model deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX W
LAR
S
Projection
Projection
Embedding Distillation
Result
• Proposed models outperform ENC

• Pre-training (PT) and 2-layered projection (2L) boost the performance

• Differences among LM vs NLM vs STM are marginal
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
SMALL EMB
Projection
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
R. LOGIT
Model deployment
LOGIT
SOFTMAX
WORD INDEX
NETWORK
LARGE EMBEDDING
GOLD
SMALL EMB
LOGIT
SOFTMAX
NETWORK
SMALL EMB
WORD INDEX
TEACHER
MODEL
LOGIT
SOFTMAX
WORD INDEX
Projection
Projection
Distillation Ensemble
Model
• Train multiple (for example, 10) teachers

• Calculate representing logits for each input dataset X

• Distill to a student with a projection to smaller embeddings using (X, Representing.Logit)

• Discard the large embedding at model deployment
Representing Logit
0
0.25
0.5
0.75
1
C1 C2 C3
0
0.25
0.5
0.75
1
C1 C2 C3
0
0.25
0.5
0.75
1
C1 C2 C3
Teacher1 Teacher2 Teacher3
0
0.25
0.5
0.75
1
C1 C2 C3
RAE
0
0.25
0.5
0.75
1
C1 C2 C3
RDE
Routing by 

Agreement

Ensemble: 

Put more weights

on majority

opinions

Routing by 

Disagreement

Ensemble: 

Put more weights

on minority

opinions
Distillation Ensemble
Result
• Teachers: 5 CNN-based and 5 LSTM-based models

• RDE significantly outperforms the teachers, if the dataset is sufficiently large

• Enough data samples -> The more chance to explore different insights from minor
opinions.
0.780
0.790
0.800
Teacher RAE RDE
MR
0.470
0.480
0.490
0.500
Teacher RAE RDE
SST-1
0.850
0.860
0.870
Teacher RAE RDE
SST-2
0.920
0.925
0.930
0.935
Teacher RAE RDE
Subj
0.885
0.890
0.895
0.900
0.905
Teacher RAE RDE
MPQA
5 while n iterations do
6 c softmax(w)
7 zrep
=
PT
t=1 ct · zt
8 s squash(zrep
)
9 if not last iteration then
10 for t 2 {1, . . . , T} do
11 wt wt + kxt
· s
12 return zrep
5 Experiments
5.1 Datasets
All models are evaluated on seven document classification
datasets in Table 1. MR, SST-*, and CR are targeted at the
task of sentiment analysis while Subj, TREC, and MPQA are
targeted at the classifications of subjectivity, question types,
and opinion polarity, respectively. About 10% of the training
sets are split into development sets for SST-* and TREC, and
about 10/20% of the provided resources are divided into de-
velopment/evaluation sets for the other datasets, respectively.
Dataset C TRN DEV TST
MR [Pang and Lee, 2005] 2 7,684 990 1,988
SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210
SST-2 [Socher et al., 2013] 2 6,920 872 1,821
Subj [Pang and Lee, 2004] 2 7,199 907 1,894
TREC [Li and Roth, 2002] 6 4,952 500 500
CR [Hu and Liu, 2004] 2 2,718 340 717
MPQA [Wiebe et al., 2005] 2 7,636 955 2,015
Table 1: Seven datasets used for our experiments. C: num-
ber of classes, TRN/DEV/TST: number of instances in train-
ing/development/evaluation set.
5.2 Word Embeddings
For sentiment analysis, raw text from the Amazon Review
dataset2
is used to train word embeddings, resulting 2.67M
word vectors. For the other tasks, combined text from
Wikipedia and the New York Times Annotated corpus3
are
models ad
dent mode
the the pro
teacher m
last outpu
which bec
0.2 is appl
For both C
same mod
this limite
remarkabl
Each st
layers. Th
hidden lay
knowledg
beddings.
the size of
embeddin
which is th
layered pr
various co
and deepe
5.4 Pre
An autoen
400-dimen
two-layere
and the d
upper and
by using p
are not re
consistent
for the pro
5.5 Em
Six mode
show the
work: log
softmax t
based emb
els with th
0.900
0.910
0.920
0.930
0.940
Teacher RAE RDE
TREC
0.810
0.820
0.830
0.840
0.850
Teacher RAE RDE
CR
Discussion
• Outperforms the previous distillation model

• Gives compatible accuracy to the teacher models with x8
and x80 smaller embeddings

• Our distillation ensemble approach consistently shows
more robust results when the size of training data is
sufficiently large

Contenu connexe

Tendances

UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based LearningUMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
Yun Huang
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
Valentina Paunovic
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
Lifeng (Aaron) Han
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Ahmed Hani Ibrahim
 

Tendances (20)

Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
2015EDM: Feature-Aware Student Knowledge Tracing Tutorial
 
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based LearningUMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
UMAP16: A Framework for Dynamic Knowledge Modeling in Textbook-Based Learning
 
ODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLPODSC East: Effective Transfer Learning for NLP
ODSC East: Effective Transfer Learning for NLP
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
[Paper Reading] Supervised Learning of Universal Sentence Representations fro...
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Towards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositoriesTowards advanced data retrieval from learning objects repositories
Towards advanced data retrieval from learning objects repositories
 
Disentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative ModelsDisentangled Representation Learning of Deep Generative Models
Disentangled Representation Learning of Deep Generative Models
 
Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...Transformation Functions for Text Classification: A case study with StackOver...
Transformation Functions for Text Classification: A case study with StackOver...
 
Information Retrieval with Deep Learning
Information Retrieval with Deep LearningInformation Retrieval with Deep Learning
Information Retrieval with Deep Learning
 
Evaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymyEvaluation of subjective answers using glsa enhanced with contextual synonymy
Evaluation of subjective answers using glsa enhanced with contextual synonymy
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
 
Deep learning for natural language embeddings
Deep learning for natural language embeddingsDeep learning for natural language embeddings
Deep learning for natural language embeddings
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding ApproachText Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
Text Representation & Fixed-Size Ordinally-Forgetting Encoding Approach
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
 

Similaire à The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning

Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Lifeng (Aaron) Han
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Lucidworks
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Emil Lupu
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
pathsproject
 
DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19
Yong Siang (Ivan) Tan
 

Similaire à The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning (20)

AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning ModelChinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
Chinese Named Entity Recognition with Graph-based Semi-supervised Learning Model
 
Concept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based ModelsConcept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based Models
 
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
Enriching Solr with Deep Learning for a Question Answering System - Sanket Sh...
 
Chounta@paws
Chounta@pawsChounta@paws
Chounta@paws
 
IRJET - Response Analysis of Educational Videos
IRJET - Response Analysis of Educational VideosIRJET - Response Analysis of Educational Videos
IRJET - Response Analysis of Educational Videos
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
 
A pilot on Semantic Textual Similarity
A pilot on Semantic Textual SimilarityA pilot on Semantic Textual Similarity
A pilot on Semantic Textual Similarity
 
How to conduct systematic literature review
How to conduct systematic literature reviewHow to conduct systematic literature review
How to conduct systematic literature review
 
Toward a Traceable, Explainable and fair JD/Resume Recommendation System
Toward a Traceable, Explainable and fair JD/Resume Recommendation SystemToward a Traceable, Explainable and fair JD/Resume Recommendation System
Toward a Traceable, Explainable and fair JD/Resume Recommendation System
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
Dominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender SystemsDominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender Systems
 
Stress Sentimental Analysis Using Machine learning (Reddit): A Review
Stress Sentimental Analysis Using Machine learning (Reddit): A ReviewStress Sentimental Analysis Using Machine learning (Reddit): A Review
Stress Sentimental Analysis Using Machine learning (Reddit): A Review
 
DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19DataScience SG | Undergrad Series | 26th Sep 19
DataScience SG | Undergrad Series | 26th Sep 19
 
Model-Driven Spreadsheet Development
Model-Driven Spreadsheet DevelopmentModel-Driven Spreadsheet Development
Model-Driven Spreadsheet Development
 
Preliminary Exam Slides
Preliminary Exam SlidesPreliminary Exam Slides
Preliminary Exam Slides
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
Text Document Classification System
Text Document Classification SystemText Document Classification System
Text Document Classification System
 

Plus de Jinho Choi

Plus de Jinho Choi (20)

Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal ...
 
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
Analysis of Hierarchical Multi-Content Text Classification Model on B-SHARP D...
 
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...Competence-Level Prediction and Resume & Job Description Matching Using Conte...
Competence-Level Prediction and Resume & Job Description Matching Using Conte...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
The Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference ResolutionThe Myth of Higher-Order Inference in Coreference Resolution
The Myth of Higher-Order Inference in Coreference Resolution
 
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
Noise Pollution in Hospital Readmission Prediction: Long Document Classificat...
 
Abstract Meaning Representation
Abstract Meaning RepresentationAbstract Meaning Representation
Abstract Meaning Representation
 
Semantic Role Labeling
Semantic Role LabelingSemantic Role Labeling
Semantic Role Labeling
 
CKY Parsing
CKY ParsingCKY Parsing
CKY Parsing
 
CS329 - WordNet Similarities
CS329 - WordNet SimilaritiesCS329 - WordNet Similarities
CS329 - WordNet Similarities
 
CS329 - Lexical Relations
CS329 - Lexical RelationsCS329 - Lexical Relations
CS329 - Lexical Relations
 
Automatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue ManagementAutomatic Knowledge Base Expansion for Dialogue Management
Automatic Knowledge Base Expansion for Dialogue Management
 
Attention is All You Need for AMR Parsing
Attention is All You Need for AMR ParsingAttention is All You Need for AMR Parsing
Attention is All You Need for AMR Parsing
 
Graph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to DialogueGraph-to-Text Generation and its Applications to Dialogue
Graph-to-Text Generation and its Applications to Dialogue
 
Real-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue UnderstandingReal-time Coreference Resolution for Dialogue Understanding
Real-time Coreference Resolution for Dialogue Understanding
 
Topological Sort
Topological SortTopological Sort
Topological Sort
 
Tries - Put
Tries - PutTries - Put
Tries - Put
 
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's DiseaseMulti-modal Embedding Learning for Early Detection of Alzheimer's Disease
Multi-modal Embedding Learning for Early Detection of Alzheimer's Disease
 
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue ContextsBuilding Widely-Interpretable Semantic Networks for Dialogue Contexts
Building Widely-Interpretable Semantic Networks for Dialogue Contexts
 
How to make Emora talk about Sports Intelligently
How to make Emora talk about Sports IntelligentlyHow to make Emora talk about Sports Intelligently
How to make Emora talk about Sports Intelligently
 

Dernier

Dernier (20)

GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 

The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning

  • 1. The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning Bonggun Shin, Hao Yang and Jinho D. Choi IJCAI 2019
  • 2. Typical NLP Models • Input: a sequence of tokens • Each token is transformed into a vector • Once vectorized, any neural network can be attached • Output: a class or a sequence (depends on the task) Embeddings CNN/LSTM Output Input
  • 3. • The majority of model space is occupied by word embeddings • NLP model compression -> Embedding compression • Neural model compression methods • Weight pruning • Weight quantization • Lossless compression • Distillation Why is Compressing an Embedding Beneficial? Embeddings Output Input CNN/LSTM
  • 4. Embedding Compression • V: the number of vocabulary • D: the size of an original embedding vector • D’: the size of a compressed embedding vector Embedding V D Compressed Embedding V D’
  • 5. Embedding Encoding • Proposed by [Mou et al., 2016] - 8 times compression with performance loss • Not a teacher-student model - trains a single model with an encoding (projection) layer • Encoding (projection) layer: Pre-trained large embedding -> smaller embedding • Discards the large embedding at model deployment Embedding Encoding LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING GOLD SMALL EMB LOGIT SOFTMAX NETWORK SMALL EMB WORD INDEX TEACHER MODEL LOGIT SOFTMAX WORD INDEX LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING SMALL EMB Projection Projection Projection Model Deployment LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING GOLD SMALL EMB LOGIT SOFTMAX NETWORK SMALL EMB WORD INDEX TEACHER MODEL LOGIT SOFTMAX WORD INDEX L SO WOR NET LARGE E SMA Projection Projection Pro
  • 6. We Propose • Embedding Distillation • Single teacher • Better performance with a 8-times smaller model size • Distillation Ensemble • Ten teachers • Better performance with a 80-times smaller model size
  • 7. Background Teacher Student Model • Teacher - (Dataset.X, Dataset.Y), Student - (Dataset.X, Teacher.Output) • Logit matching [Ba and Caruana, 2014] • Noisy logit matching [Sau and Balasubramanian, 2016] • Softmax tau matching [Hinton et al., 2014] GOLD STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX NOISE STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX GOLD STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX NOISE STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX GOLD STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX NOISE STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX STUDENT MODEL TEACHER MODEL LOGIT SOFTMAX LOGIT SOFTMAX WORD INDEX WORD INDEX
  • 8. Dataset • Seven document classification tasks • Pre-trained word embeddings (Trained by Word2Vec [Mikolov et al., 2013]) • Sentiment analysis (SST-*, MR and CR) - Amazon review data (2.6M) • Others - Wikipedia + New York Times All models are evaluated on seven document classification datasets in Table 1. MR, SST-*, and CR are targeted at the task of sentiment analysis while Subj, TREC, and MPQA are targeted at the classifications of subjectivity, question types, and opinion polarity, respectively. About 10% of the training sets are split into development sets for SST-* and TREC, and about 10/20% of the provided resources are divided into de- velopment/evaluation sets for the other datasets, respectively. Dataset C TRN DEV TST MR [Pang and Lee, 2005] 2 7,684 990 1,988 SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210 SST-2 [Socher et al., 2013] 2 6,920 872 1,821 Subj [Pang and Lee, 2004] 2 7,199 907 1,894 TREC [Li and Roth, 2002] 6 4,952 500 500 CR [Hu and Liu, 2004] 2 2,718 340 717 MPQA [Wiebe et al., 2005] 2 7,636 955 2,015 Table 1: Seven datasets used for our experiments. C: num- ber of classes, TRN/DEV/TST: number of instances in train- ing/development/evaluation set. 5.2 Word Embeddings For sentiment analysis, raw text from the Amazon Review dataset2 is used to train word embeddings, resulting 2.67M word vectors. For the other tasks, combined text from Wikipedia and the New York Times Annotated corpus3 are used, resulting 1.96M word vectors. For each group, two sets beddings. The the size of the l embedding lay which is the dim layered projecti various combina and deeper laye 5.4 Pre-trai An autoencode 400-dimension two-layered pro and the decode upper and lowe by using pre-tr are not reporte consistently see for the projectio 5.5 Embedd Six models are show the effect work: logit ma softmax tau m based embeddin els with the aut two layered pro
  • 9. Embedding Distillation Model • Train a teacher using (X,Y) • Distill to a student with a projection to smaller embeddings using (X, Teacher.Logit) • Discard the large embedding at model deployment LOGIT SOFTMAX NETWORK SMALL EMB WORD INDEX TEACHER MODEL LOGIT SOFTMAX WORD INDEX LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING SMALL EMB Projection Model deployment LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING GOLD SMALL EMB LOGIT SOFTMAX NETWORK SMALL EMB WORD INDEX TEACHER MODEL LOGIT SOFTMAX WORD INDEX W LAR S Projection Projection
  • 10. Embedding Distillation Result • Proposed models outperform ENC • Pre-training (PT) and 2-layered projection (2L) boost the performance • Differences among LM vs NLM vs STM are marginal
  • 11. LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING SMALL EMB Projection TEACHER MODEL LOGIT SOFTMAX WORD INDEX R. LOGIT Model deployment LOGIT SOFTMAX WORD INDEX NETWORK LARGE EMBEDDING GOLD SMALL EMB LOGIT SOFTMAX NETWORK SMALL EMB WORD INDEX TEACHER MODEL LOGIT SOFTMAX WORD INDEX Projection Projection Distillation Ensemble Model • Train multiple (for example, 10) teachers • Calculate representing logits for each input dataset X • Distill to a student with a projection to smaller embeddings using (X, Representing.Logit) • Discard the large embedding at model deployment
  • 12. Representing Logit 0 0.25 0.5 0.75 1 C1 C2 C3 0 0.25 0.5 0.75 1 C1 C2 C3 0 0.25 0.5 0.75 1 C1 C2 C3 Teacher1 Teacher2 Teacher3 0 0.25 0.5 0.75 1 C1 C2 C3 RAE 0 0.25 0.5 0.75 1 C1 C2 C3 RDE Routing by Agreement Ensemble: Put more weights on majority opinions Routing by Disagreement Ensemble: Put more weights on minority opinions
  • 13. Distillation Ensemble Result • Teachers: 5 CNN-based and 5 LSTM-based models • RDE significantly outperforms the teachers, if the dataset is sufficiently large • Enough data samples -> The more chance to explore different insights from minor opinions. 0.780 0.790 0.800 Teacher RAE RDE MR 0.470 0.480 0.490 0.500 Teacher RAE RDE SST-1 0.850 0.860 0.870 Teacher RAE RDE SST-2 0.920 0.925 0.930 0.935 Teacher RAE RDE Subj 0.885 0.890 0.895 0.900 0.905 Teacher RAE RDE MPQA 5 while n iterations do 6 c softmax(w) 7 zrep = PT t=1 ct · zt 8 s squash(zrep ) 9 if not last iteration then 10 for t 2 {1, . . . , T} do 11 wt wt + kxt · s 12 return zrep 5 Experiments 5.1 Datasets All models are evaluated on seven document classification datasets in Table 1. MR, SST-*, and CR are targeted at the task of sentiment analysis while Subj, TREC, and MPQA are targeted at the classifications of subjectivity, question types, and opinion polarity, respectively. About 10% of the training sets are split into development sets for SST-* and TREC, and about 10/20% of the provided resources are divided into de- velopment/evaluation sets for the other datasets, respectively. Dataset C TRN DEV TST MR [Pang and Lee, 2005] 2 7,684 990 1,988 SST-1 [Socher et al., 2013] 5 8,544 1,101 2,210 SST-2 [Socher et al., 2013] 2 6,920 872 1,821 Subj [Pang and Lee, 2004] 2 7,199 907 1,894 TREC [Li and Roth, 2002] 6 4,952 500 500 CR [Hu and Liu, 2004] 2 2,718 340 717 MPQA [Wiebe et al., 2005] 2 7,636 955 2,015 Table 1: Seven datasets used for our experiments. C: num- ber of classes, TRN/DEV/TST: number of instances in train- ing/development/evaluation set. 5.2 Word Embeddings For sentiment analysis, raw text from the Amazon Review dataset2 is used to train word embeddings, resulting 2.67M word vectors. For the other tasks, combined text from Wikipedia and the New York Times Annotated corpus3 are models ad dent mode the the pro teacher m last outpu which bec 0.2 is appl For both C same mod this limite remarkabl Each st layers. Th hidden lay knowledg beddings. the size of embeddin which is th layered pr various co and deepe 5.4 Pre An autoen 400-dimen two-layere and the d upper and by using p are not re consistent for the pro 5.5 Em Six mode show the work: log softmax t based emb els with th 0.900 0.910 0.920 0.930 0.940 Teacher RAE RDE TREC 0.810 0.820 0.830 0.840 0.850 Teacher RAE RDE CR
  • 14. Discussion • Outperforms the previous distillation model • Gives compatible accuracy to the teacher models with x8 and x80 smaller embeddings • Our distillation ensemble approach consistently shows more robust results when the size of training data is sufficiently large