SlideShare a Scribd company logo
1 of 30
Human Interface Laboratory
Speech to text adaptation:
Towards an efficient cross-modal distillation
2020. 10. 26, @Interspeech
Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim
Contents
• Motivation
• Task and Dataset
• Related Work
• Method
• Result and Discussion
• Conclusion
1
Motivation
• Text and speech : Two main medium of communication
• But, Text resources >> Speech resources
 Why?
• Difficult to control the generation
and storage of the recordings
2
“THIS IS A SPEECH”
Difference in search result with ‘English’ in ELRA catalog
Motivation
• Pretrained language models
 Mainly developed for the text-based systems
• ELMo, BERT, GPTs …
 Bases on huge amount of raw corpus
• Trained with simple but non-task-specific objectives
• Pretrained speech models?
 Recently suggested
• SpeechBERT, Speech XLNet …
 Why not prevalent?
• Difficulties in problem setting
– What is the correspondence of the tokens?
• Requires much high resources than text data
3
Motivation
• How to leverage pretrained LMs (or the inference thereof) in
speech processing?
 Direct use?
• Only if the ASR output are accurate
 Training LMs with erroneous speech transcriptions?
• Okay, but cannot cover all the possible cases, and requires script for various
scenarios
 Distillation?
4
(Hinton et al., 2015)
Task and Dataset
• Task: Spoken language understanding
 Literally – Understanding spoken language?
 In literature – Intent identification and slot filling
 Our hypothesis:
• On either case, abstracted speech data will meet the abstracted representation
of text, in semantic pathways
5
Lugosch et al. (2019)
Hemphill et al. (1990)
Allen (1980)
Task and Dataset
• Freely available benchmark!
 Fluent speech command
• 16kHz single channel 30,043 audio files
• Each audio labeled with three slots: action / object / location
• 248 different phrases spoken by 97 speakers (77/10/10)
• Multi-label classification problem
 Why Fluent speech command? (suggested in Lugosch et al., 2019)
• Google speech command:
– Only short keywords, thus not an SLU
• ATIS
– Not publicly available
• Grabo, Domonica, Pactor
– Free, but only a small number of speakers and phrases
• Snips audio
– Variety of phrases, but less audio
6
Related Work
• ASR-NLU pipelines
 Conventional approaches
 Best if an accurate ASR is guaranteed
 Easier to interpret the issue and enhance partial modules
• End-to-end SLU
 Less prone to ASR errors
 Non-textual information might be preserved as well
• Pretrained LMs
 Takes advantage of massive textual knowledge
 High performance, freely available modules
• Knowledge distillation
 Adaptive to various training schemes
 Cross-modal application is probable
7
Related Work
• ASR-NLU pipelines
 Conventional approaches
 Best if an accurate ASR is guaranteed
 Easier to interpret the issue and enhance partial modules
• End-to-end SLU
 Less prone to ASR errors
 Non-textual information might be preserved as well
• Pretrained LMs
 Takes advantage of massive textual knowledge
 High performance, freely available modules
• Knowledge distillation
 Adaptive to various training schemes
 Cross-modal application is probable
8
Related Work
• End-to-end SLU
 Lugosch, Loren, et al. "Speech Model Pre-training for End-to-End Spoken
Language Understanding." INTERSPEECH 2019.
9
Related Work
• End-to-end SLU
 Wang, Pengwei, et al. "Large-Scale Unsupervised Pre-Training for End-to-
End Spoken Language Understanding," ICASSP 2020.
10
Related Work
11
• Pretrained LMs
 Transformer architectures
Related Work
• End-to-end speech processing + PLM
 Chuang, Yung-Sung, et al.
“SpeechBERT: Cross-Modal
Pre-Trained Language Model
for End-to-End Spoken Question
Answering.“ INTERSPEECH 2020.
12
Related Work
• End-to-end speech processing + KD
 Liu, Yuchen, et al. "End-to-End
Speech Translation with Knowledge
Distillation." INTERSPEECH 2019.
13
Method
• End-to-end SLU+ PLM + Cross-modal KD
14
Method
• End-to-end SLU
 Backbone: Lugosch et al. (2019)
• Phoneme module (SincNet layer)
• Word module
– BiGRU-based, with dropout/pooling
• Intent module
– Consequent prediction of three slots
– Also implemented with BiGRU
15
(Ravanelli and Bengio, 2018)
From previous ver. of Wang et al. (2020)
Method
• End-to-end SLU
16
Method
• PLM
 Fine-tuning the pretrained model
• BERT-Base (Devlin et al., 2018)
– Bidirectional encoder representations from Transformers (BERT)
• Hugging Face PyTorch wrapper
17
Method
• PLM
 Fine-tuning with FSC ground truth scripts!
18
Method
• Cross-modal KD
 Distillation as a teacher-student learning
• Loss1 = f answer, inferences
• Loss2 = g inferences , inferencet
• Different input, same task?
– e.g., speech translation
19
𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2
Distilled knowledge
(Liu et al., 2019)
Method
• Cross-modal KD
 What determines the loss?
• WHO TEACHES
• HOW IS THE LOSS CALCULATED
– MAE, MSE
• HOW MUCH THE GUIDANCE
INFLUENCES (SCHEDULING)
20
Method
• Cross-modal KD
21
Result and Discussion
• Teacher performance
 GT-based, high-performance
 Not encouraging for ASR result
• Why ASR-NLU baseline is
borrowed (Wang et al., 2019)
• Comparison with the baseline
 Distillation is successful for
flexible teacher influence
 Reaches high performance
only with a simple distillation
 Professor model does not
necessarily dominate, but
Hybrid model is effective with
MAE as loss function
22
Result and Discussion
• Teacher performance
 GT-based, high-performance
 Not encouraging for ASR result
• Why ASR-NLU baseline is
borrowed (Wang et al., 2019)
• Comparison with the baseline
 Distillation is successful for
flexible teacher influence
 Reaches high performance
only with a simple distillation
 Professor model does not
necessarily dominate, but
Hybrid model is effective with
MAE as loss function
23
Result and Discussion
• Comparison with the baseline (cont’d)
 Better teacher performance does not guarantee the high quality distillation
• In correspondence with the recent findings in image processing and ASR
distillation
– Tutor might be better than professor?
 MAE overall better than MSE
• Probable correspondence with SpeechBERT
• Why?
– Different nature of input
– MSE might amplify the gap
and lead to collapse
» Partly observed in
data shortage scenarios
24
(Chuang et al., 2019)
Result and Discussion
• Data shortage scenario
 MSE collapse is more explicit
 Scheduling also matters
• Exp. better than Tri. and err
shows that
– Warm up and decay is powerful
– Teacher influence does not
necessarily have to last long
• However, less mechanical
approach is still anticipated
– e.g., Entropy-based?
 Overall result suggests that
distillation from fine-tuned LM
helps student learn some information regarding uncertainty that is difficult
to obtain from speech-only end-to-end system?
25
Result and Discussion
• Discussion
 Is this cross-modal or multi-modal?
• Probably; though text (either ASR output or GT) comes from the speech, the
format are different by Waveform and Unicode
 Is this knowledge sharing?
• Also yes; though we exploit logit-level information, the different aspect of
uncertainty derived from each modality might affect the distillation process,
making the process as knowledge sharing rather than optimization
 To engage in paralinguistic properties?
• Further study; Frame-level acoustic information can be residual connected to
compensate for the loss; this might not leverage much from the text-based LMs
26
Conclusion
• Cross-modal distillation works in SLU, even if teacher input
modality is explicitly different from that of student
• Simple distillation from fine-tuned LM helps student learn some
uncertainty that is not probable from speech-only training
• MAE loss is effective in speech to text adaptation, possibly with
warm-up and decay scheduling of KD loss
27
Reference (in order of appearance)
• Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint
arXiv:1503.02531 (2015).
• Allen, James F., and C. Raymond Perrault. "Analyzing intention in utterances." Artificial intelligence 15.3 (1980): 143-178.
• Hemphill, Charles T., John J. Godfrey, and George R. Doddington. "The ATIS spoken language systems pilot corpus."
Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990. 1990.
• Lugosch, Loren, et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." arXiv preprint
arXiv:1904.03670 (2019).
• Wang, Pengwei, et al. "Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding." ICASSP
2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020.
• Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018).
• Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you
need. In Advances in neural information processing systems (pp. 5998-6008).
• Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint
arXiv:1810.04805 (2018).
• Chuang, Yung-Sung, Chi-Liang Liu, and Hung-Yi Lee. "SpeechBERT: Cross-modal pre-trained language model for end-to-
end spoken question answering." arXiv preprint arXiv:1910.11559 (2019).
• Liu, Yuchen, et al. "End-to-end speech translation with knowledge distillation." arXiv preprint arXiv:1904.08075 (2019).
• Ravanelli, Mirco, and Yoshua Bengio. "Speaker recognition from raw waveform with sincnet." 2018 IEEE Spoken Language
Technology Workshop (SLT). IEEE, 2018.
• Wolf, Thomas, et al. "HuggingFace's Transformers: State-of-the-art Natural Language Processing." ArXiv (2019): arXiv-
1910.
28
Thank you!
EndOfPresentation

More Related Content

What's hot

Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoSebastian Ruder
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
Nonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
Nonparametric Bayesian Word Discovery for Symbol Emergence in RoboticsNonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
Nonparametric Bayesian Word Discovery for Symbol Emergence in RoboticsTadahiro Taniguchi
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Assistive Technology
Assistive TechnologyAssistive Technology
Assistive Technologyjpuglia
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languageshs0041
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...RajkiranVeluri
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingYoung Seok Kim
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
 

What's hot (20)

1909 paclic
1909 paclic1909 paclic
1909 paclic
 
Multi-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer CalixtoMulti-modal Neural Machine Translation - Iacer Calixto
Multi-modal Neural Machine Translation - Iacer Calixto
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
Nonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
Nonparametric Bayesian Word Discovery for Symbol Emergence in RoboticsNonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
Nonparametric Bayesian Word Discovery for Symbol Emergence in Robotics
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Assistive Technology
Assistive TechnologyAssistive Technology
Assistive Technology
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
Deep learning for nlp
Deep learning for nlpDeep learning for nlp
Deep learning for nlp
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...Natural Language Processing, Techniques, Current Trends and Applications in I...
Natural Language Processing, Techniques, Current Trends and Applications in I...
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Plug play language_models
Plug play language_modelsPlug play language_models
Plug play language_models
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 

Similar to 2010 INTERSPEECH

Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...Jekaterina Novikova, PhD
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Fwdays
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPMachine Learning Prague
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignAnubhav Jain
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model reviewSeoung-Ho Choi
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.Lifeng (Aaron) Han
 
Open Creativity Scoring Tutorial
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring TutorialDenisDumas2
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Association for Computational Linguistics
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSebastian Ruder
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 

Similar to 2010 INTERSPEECH (20)

2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
Natural Language Processing: From Human-Robot Interaction to Alzheimer’s Dete...
 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Tomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLPTomáš Mikolov - Distributed Representations for NLP
Tomáš Mikolov - Distributed Representations for NLP
 
Applications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and DesignApplications of Large Language Models in Materials Discovery and Design
Applications of Large Language Models in Materials Discovery and Design
 
Gpt1 and 2 model review
Gpt1 and 2 model reviewGpt1 and 2 model review
Gpt1 and 2 model review
 
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
ADAPT Centre and My NLP journey: MT, MTE, QE, MWE, NER, Treebanks, Parsing.
 
Open Creativity Scoring Tutorial
Open Creativity Scoring TutorialOpen Creativity Scoring Tutorial
Open Creativity Scoring Tutorial
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
Philippe Langlais - 2017 - Users and Data: The Two Neglected Children of Bili...
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
CoLing 2016
CoLing 2016CoLing 2016
CoLing 2016
 
Successes and Frontiers of Deep Learning
Successes and Frontiers of Deep LearningSuccesses and Frontiers of Deep Learning
Successes and Frontiers of Deep Learning
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 

More from WarNik Chow

2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inpersonWarNik Chow
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech datasetWarNik Chow
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2eWarNik Chow
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminarWarNik Chow
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate SpeechWarNik Chow
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLPWarNik Chow
 
2008 [lang con2020] act!
2008 [lang con2020] act!2008 [lang con2020] act!
2008 [lang con2020] act!WarNik Chow
 
2007 CogSci 2020 poster
2007 CogSci 2020 poster2007 CogSci 2020 poster
2007 CogSci 2020 posterWarNik Chow
 

More from WarNik Chow (20)

2312 PACLIC
2312 PACLIC2312 PACLIC
2312 PACLIC
 
2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
 
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
 
2211 AACL
2211 AACL2211 AACL
2211 AACL
 
2210 CODI
2210 CODI2210 CODI
2210 CODI
 
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
 
2206 Modupop!
2206 Modupop!2206 Modupop!
2206 Modupop!
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
 
2106 PRSLLS
2106 PRSLLS2106 PRSLLS
2106 PRSLLS
 
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
 
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
 
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
 
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP
 
2008 [lang con2020] act!
2008 [lang con2020] act!2008 [lang con2020] act!
2008 [lang con2020] act!
 
2007 CogSci 2020 poster
2007 CogSci 2020 poster2007 CogSci 2020 poster
2007 CogSci 2020 poster
 

Recently uploaded

Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxJuliansyahHarahap1
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...Health
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxnuruddin69
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 

Recently uploaded (20)

Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Work-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptxWork-Permit-Receiver-in-Saudi-Aramco.pptx
Work-Permit-Receiver-in-Saudi-Aramco.pptx
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
+97470301568>> buy weed in qatar,buy thc oil qatar,buy weed and vape oil in d...
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Bridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptxBridge Jacking Design Sample Calculation.pptx
Bridge Jacking Design Sample Calculation.pptx
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 

2010 INTERSPEECH

  • 1. Human Interface Laboratory Speech to text adaptation: Towards an efficient cross-modal distillation 2020. 10. 26, @Interspeech Won Ik Cho, Donghyun Kwak, Ji Won Yoon, Nam Soo Kim
  • 2. Contents • Motivation • Task and Dataset • Related Work • Method • Result and Discussion • Conclusion 1
  • 3. Motivation • Text and speech : Two main medium of communication • But, Text resources >> Speech resources  Why? • Difficult to control the generation and storage of the recordings 2 “THIS IS A SPEECH” Difference in search result with ‘English’ in ELRA catalog
  • 4. Motivation • Pretrained language models  Mainly developed for the text-based systems • ELMo, BERT, GPTs …  Bases on huge amount of raw corpus • Trained with simple but non-task-specific objectives • Pretrained speech models?  Recently suggested • SpeechBERT, Speech XLNet …  Why not prevalent? • Difficulties in problem setting – What is the correspondence of the tokens? • Requires much high resources than text data 3
  • 5. Motivation • How to leverage pretrained LMs (or the inference thereof) in speech processing?  Direct use? • Only if the ASR output are accurate  Training LMs with erroneous speech transcriptions? • Okay, but cannot cover all the possible cases, and requires script for various scenarios  Distillation? 4 (Hinton et al., 2015)
  • 6. Task and Dataset • Task: Spoken language understanding  Literally – Understanding spoken language?  In literature – Intent identification and slot filling  Our hypothesis: • On either case, abstracted speech data will meet the abstracted representation of text, in semantic pathways 5 Lugosch et al. (2019) Hemphill et al. (1990) Allen (1980)
  • 7. Task and Dataset • Freely available benchmark!  Fluent speech command • 16kHz single channel 30,043 audio files • Each audio labeled with three slots: action / object / location • 248 different phrases spoken by 97 speakers (77/10/10) • Multi-label classification problem  Why Fluent speech command? (suggested in Lugosch et al., 2019) • Google speech command: – Only short keywords, thus not an SLU • ATIS – Not publicly available • Grabo, Domonica, Pactor – Free, but only a small number of speakers and phrases • Snips audio – Variety of phrases, but less audio 6
  • 8. Related Work • ASR-NLU pipelines  Conventional approaches  Best if an accurate ASR is guaranteed  Easier to interpret the issue and enhance partial modules • End-to-end SLU  Less prone to ASR errors  Non-textual information might be preserved as well • Pretrained LMs  Takes advantage of massive textual knowledge  High performance, freely available modules • Knowledge distillation  Adaptive to various training schemes  Cross-modal application is probable 7
  • 9. Related Work • ASR-NLU pipelines  Conventional approaches  Best if an accurate ASR is guaranteed  Easier to interpret the issue and enhance partial modules • End-to-end SLU  Less prone to ASR errors  Non-textual information might be preserved as well • Pretrained LMs  Takes advantage of massive textual knowledge  High performance, freely available modules • Knowledge distillation  Adaptive to various training schemes  Cross-modal application is probable 8
  • 10. Related Work • End-to-end SLU  Lugosch, Loren, et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." INTERSPEECH 2019. 9
  • 11. Related Work • End-to-end SLU  Wang, Pengwei, et al. "Large-Scale Unsupervised Pre-Training for End-to- End Spoken Language Understanding," ICASSP 2020. 10
  • 12. Related Work 11 • Pretrained LMs  Transformer architectures
  • 13. Related Work • End-to-end speech processing + PLM  Chuang, Yung-Sung, et al. “SpeechBERT: Cross-Modal Pre-Trained Language Model for End-to-End Spoken Question Answering.“ INTERSPEECH 2020. 12
  • 14. Related Work • End-to-end speech processing + KD  Liu, Yuchen, et al. "End-to-End Speech Translation with Knowledge Distillation." INTERSPEECH 2019. 13
  • 15. Method • End-to-end SLU+ PLM + Cross-modal KD 14
  • 16. Method • End-to-end SLU  Backbone: Lugosch et al. (2019) • Phoneme module (SincNet layer) • Word module – BiGRU-based, with dropout/pooling • Intent module – Consequent prediction of three slots – Also implemented with BiGRU 15 (Ravanelli and Bengio, 2018) From previous ver. of Wang et al. (2020)
  • 18. Method • PLM  Fine-tuning the pretrained model • BERT-Base (Devlin et al., 2018) – Bidirectional encoder representations from Transformers (BERT) • Hugging Face PyTorch wrapper 17
  • 19. Method • PLM  Fine-tuning with FSC ground truth scripts! 18
  • 20. Method • Cross-modal KD  Distillation as a teacher-student learning • Loss1 = f answer, inferences • Loss2 = g inferences , inferencet • Different input, same task? – e.g., speech translation 19 𝑇𝑜𝑡𝑎𝑙 𝐿𝑜𝑠𝑠 = 𝐿𝑜𝑠𝑠1 + 𝐿𝑜𝑠𝑠2 Distilled knowledge (Liu et al., 2019)
  • 21. Method • Cross-modal KD  What determines the loss? • WHO TEACHES • HOW IS THE LOSS CALCULATED – MAE, MSE • HOW MUCH THE GUIDANCE INFLUENCES (SCHEDULING) 20
  • 23. Result and Discussion • Teacher performance  GT-based, high-performance  Not encouraging for ASR result • Why ASR-NLU baseline is borrowed (Wang et al., 2019) • Comparison with the baseline  Distillation is successful for flexible teacher influence  Reaches high performance only with a simple distillation  Professor model does not necessarily dominate, but Hybrid model is effective with MAE as loss function 22
  • 24. Result and Discussion • Teacher performance  GT-based, high-performance  Not encouraging for ASR result • Why ASR-NLU baseline is borrowed (Wang et al., 2019) • Comparison with the baseline  Distillation is successful for flexible teacher influence  Reaches high performance only with a simple distillation  Professor model does not necessarily dominate, but Hybrid model is effective with MAE as loss function 23
  • 25. Result and Discussion • Comparison with the baseline (cont’d)  Better teacher performance does not guarantee the high quality distillation • In correspondence with the recent findings in image processing and ASR distillation – Tutor might be better than professor?  MAE overall better than MSE • Probable correspondence with SpeechBERT • Why? – Different nature of input – MSE might amplify the gap and lead to collapse » Partly observed in data shortage scenarios 24 (Chuang et al., 2019)
  • 26. Result and Discussion • Data shortage scenario  MSE collapse is more explicit  Scheduling also matters • Exp. better than Tri. and err shows that – Warm up and decay is powerful – Teacher influence does not necessarily have to last long • However, less mechanical approach is still anticipated – e.g., Entropy-based?  Overall result suggests that distillation from fine-tuned LM helps student learn some information regarding uncertainty that is difficult to obtain from speech-only end-to-end system? 25
  • 27. Result and Discussion • Discussion  Is this cross-modal or multi-modal? • Probably; though text (either ASR output or GT) comes from the speech, the format are different by Waveform and Unicode  Is this knowledge sharing? • Also yes; though we exploit logit-level information, the different aspect of uncertainty derived from each modality might affect the distillation process, making the process as knowledge sharing rather than optimization  To engage in paralinguistic properties? • Further study; Frame-level acoustic information can be residual connected to compensate for the loss; this might not leverage much from the text-based LMs 26
  • 28. Conclusion • Cross-modal distillation works in SLU, even if teacher input modality is explicitly different from that of student • Simple distillation from fine-tuned LM helps student learn some uncertainty that is not probable from speech-only training • MAE loss is effective in speech to text adaptation, possibly with warm-up and decay scheduling of KD loss 27
  • 29. Reference (in order of appearance) • Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015). • Allen, James F., and C. Raymond Perrault. "Analyzing intention in utterances." Artificial intelligence 15.3 (1980): 143-178. • Hemphill, Charles T., John J. Godfrey, and George R. Doddington. "The ATIS spoken language systems pilot corpus." Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990. 1990. • Lugosch, Loren, et al. "Speech Model Pre-training for End-to-End Spoken Language Understanding." arXiv preprint arXiv:1904.03670 (2019). • Wang, Pengwei, et al. "Large-Scale Unsupervised Pre-Training for End-to-End Spoken Language Understanding." ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020. • Peters, Matthew E., et al. "Deep contextualized word representations." arXiv preprint arXiv:1802.05365 (2018). • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems (pp. 5998-6008). • Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint arXiv:1810.04805 (2018). • Chuang, Yung-Sung, Chi-Liang Liu, and Hung-Yi Lee. "SpeechBERT: Cross-modal pre-trained language model for end-to- end spoken question answering." arXiv preprint arXiv:1910.11559 (2019). • Liu, Yuchen, et al. "End-to-end speech translation with knowledge distillation." arXiv preprint arXiv:1904.08075 (2019). • Ravanelli, Mirco, and Yoshua Bengio. "Speaker recognition from raw waveform with sincnet." 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018. • Wolf, Thomas, et al. "HuggingFace's Transformers: State-of-the-art Natural Language Processing." ArXiv (2019): arXiv- 1910. 28

Editor's Notes

  1. .