Transcription Errors in Context of Intent Detection and Slot Filling

Transcription Errors in Context of Intent Detection and
Slot Filling
Raphael Schumann
Institute for Computational Linguistics
Heidelberg University
rschuman@cl.uni-heidelberg.de
15.11.18
Raphael Schumann Transcription Errors in Context of NLU 15.11.18 1 / 75

Outline
1 Introduction
2 Transcription Error and NLU (Schumann and Angkititrakul, 2018)
Model
Data
Baseline
Evaluation Metrics
Results
3 End-to-End SLU (Haghani et al., 2018)
End-to-End Architectures
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Spoken Language Understanding Pipeline
Figure: icons: [1]
ASR errors get propagated to NLU component
ASR as black box:
jointly train in domain language model and robust NLU
train ASR from scratch:
End-to-End SLU model

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

High Level Architecture

Encoder
bidirectional RNN with LSTM cell
ht = [fht, bht] at each timestep t = {1, ..., Tx }
encodes input sequence x to vector st
0 [3]:
s0 = tanh(Ws[fhTx
, bh1])

Intent Decoder
text classiﬁcation on encoded input sequence x
intent attention vector ci weighted sum over all ht
intent label yi predicted by feed-forward network on [ci , st
0]

Intent Decoder Detail

Intent Decoder

Word Decoder

Word Decoder
language model
RNN with LSTM cell
initial state is set to st
0

Word Decoder
Input at each decoding timestep i:
predicted intent label yi
attention vector cw
i weighted sum over all ht
previous emitted corrected word yw
i−1

Word Decoder Detail

Word Decoder

Word Decoder
learns distribution over possible ASR errors
output sequence is sampled during training

Word Decoder
encode new word sequence into hidden states h and sw
0
encoders share weights

Word Decoder

Slot Decoder

Slot Decoder
Input at each decoding timestep i:
predicted intent label yi
attention vector cs
i weighted sum over all ht
previous emitted slot token ys
i−1
corrected word encoder hidden state hi

Slot Decoder Detail

Full Model
language model conditioned on predicted intent
shared word embeddings across model
weights shared between both encoders
sample [4] output of LM to second encoder

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

ATIS
Airline Travel Information Systems (ATIS) dataset [5]
18 diﬀerent intent labels
128 diﬀerent slot labels

ATIS Instance
Input:
words show me ﬂights from boston to new york
Labels:
intent ﬂight
slots O O O O
B-fromloc
.city name
O
B-toloc
.city name
I-toloc
.city name

Hypotheses Extended ATIS
create ASR hypothesis from audio
add noise to reach ASR performance of ∼ 14% word error rate
use top-3 hypotheses to form new instances

Hypotheses Extended ATIS Instance
Input:
words show flights from boston to no work
Labels:
intent flight
words show me flights from boston to new york
slots O O O O
B-fromloc
.city name
O
B-toloc
.city name
I-toloc
.city name

Data
train dev test unique words
ATIS 4085 893 893 950
extended 11841 2583 2606 3178

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Baseline 1
Figure: Intent Detection + Slot Filling [6] trained on gold transcription only

Baseline 2
subsequent models:
Figure: Language Model
Figure: Intent Detection + Slot Filling [6]

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Evaluation Metrics
WER: word error rate
Slot F1: F1-score following CoNLL Chunking Shared Task [7] using
the in/out/begin schema [8]
Intent Error: percentage of incorrect intent labels

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Results
Models WER (%) Slot (F1) Intent Error (%)
Joint Slot&Detection 14.55 84.26 5.80
Language Model +
Joint Slot&Detection 10.43 86.85 5.20
Joint Model 10.55 87.13 5.04
Table: Experimental results on the hypotheses extended ATIS dataset.
average of 10 runs

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Figure: Direct model
P(S|X)

Figure: Joint model
P(S, W|X) = P(S|W, X)P(W|X)

Figure: Multitask model
P(S, W|X) = P(S|X)P(W|X)

Figure: Multistage model
P(S, W|X) = P(S|W)P(W|X)

Figure: Multistage (Argmax) model
Figure: Multistage (SampledSoftmax) model

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Data
human transcribed Google Home queries
24M train
16K test
5 domains (MEDIA, MEDIA CONTROL, PRODUCTIVITY,
DELIGHT, NONE)
20 intents (SET ALARM, SELF NOTE, ...)
2 arguments (DATETIME, SUBJECT)
Transcript Serialized Semantics
”can you set an alarm for 2 p.m.” <DOMAIN><PRODUCTIVITY><INTENT><SET ALARM><DATETIME>2 p.m.
”remind me to buy milk” <DOMAIN><PRODUCTIVITY><INTENT><ADD REMINDER><SUBJECT>buy m
”next song please” <DOMAIN><MEDIA CONTROL>
”how old is barack obama” <DOMAIN><NONE>

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Evaluation Metric
F1 for Domain
F1 for Intent
WER for Arguments
Transcript Serialized Semantics
”can you set an alarm for 2 p.m.” <DOMAIN><PRODUCTIVITY><INTENT><SET ALARM><DATETIME>2 p.m.
”remind me to buy milk” <DOMAIN><PRODUCTIVITY><INTENT><ADD REMINDER><SUBJECT>buy m
”next song please” <DOMAIN><MEDIA CONTROL>
”how old is barack obama” <DOMAIN><NONE>

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Results
Model Domain F1 Intent F1 Arg WER
Baseline 96.6 95.1 15.04
Direct 96.2 94.2 18.22
Joint 96.8 95.7 14.93
Multitask 96.7 95.8 15.02
Multistage (ArgMax) 96.5 95.4 14.84
Multistage (SampledSoftmax) 96.5 95.2 12.29

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Joint Transcript Error Correction and NLU
Figure: Mix of joint and multistage model
P(I, S, W|X) = P(S|WI)P(W|IX)P(I|X)
omit intent decoder and pretend its combined with slot decoder

Joint Transcript Error Correction and NLU
Figure: Multistage model

Similarity
Figure: End-to-End SLU
Figure: LM + NLU

Similarity

Results
Model Intent F1 Arg WER
Baseline 95.1 15.04
Multistage (ArgMax) 95.4 14.84
Multistage (SampledSoftmax) 95.2 12.29
Table: End-to-End SLU
Model Intent Error Slot F1
Baseline 5.80 84.26
Multistage (ArgMax) 5.20 86.85
Multistage (SampledSoftmax) 5.04 87.13
Table: LM + NLU

Results
Figure: Joint Transcript Error Correction and NLU
Figure: End-to-End SLU
Word Decoder of the ﬁrst model learns a distribution over possible
errors in the transcriptions
Semantic decoder is exposed to a variety of (sampled) incorrect
transcriptions

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Conclusion
Figure: icons: [1]
bridging the gap between ASR and NLU for
black-box ASR
End-to-End SLU

Conclusion
Figure: icons: [1]
important to train NLU with sampled transcriptions
learn distribution over transcriptions of ”black box” ASR

Outline
1 Introduction
Model
Data
Baseline
Evaluation Metrics
Results
Data
Evaluation Metrics
Results
4 Compare
Results
Conclusion
Challenges

Diﬀerentiable Sampling
Figure: Goyal et al., 2017 [10][11][12]

Diﬀerentiable Sampling

Beam Search

Combined Beam Search 1

Thank You!

Bibliography I
[1] M. Aguilar, A. Shirazi, and S. Keating, Voice, voice, write,
[2] R. Schumann and P. Angkititrakul, “Incorporating asr errors with
attention-based, jointly trained rnn for intent detection and slot
ﬁlling,” in 2018 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Apr. 2018, pp. 6059–6063.
doi: 10.1109/ICASSP.2018.8461598.
[3] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
by jointly learning to align and translate,” CoRR,
vol. abs/1409.0473, 2014. arXiv: 1409.0473. [Online]. Available:
http://arxiv.org/abs/1409.0473.
[4] S. Bengio, O. Vinyals, N. Jaitly, and N. M. Shazeer, “Scheduled
sampling for sequence prediction with recurrent neural networks,” in
Advances in Neural Information Processing Systems, NIPS, 2015.
[Online]. Available: http://arxiv.org/abs/1506.03099.

Bibliography II
[5] C. T. Hemphill, J. J. Godfrey, and G. R. Doddington, “The atis
spoken language systems pilot corpus,” in Proceedings of the
Workshop on Speech and Natural Language, ser. HLT ’90, Hidden
Valley, Pennsylvania: Association for Computational Linguistics,
1990, pp. 96–101. doi: 10.3115/116580.116613. [Online].
Available: https://doi.org/10.3115/116580.116613.
[6] B. Liu and I. Lane, “Attention-based recurrent neural network
models for joint intent detection and slot ﬁlling,” CoRR,
vol. abs/1609.01454, 2016. [Online]. Available:
http://arxiv.org/abs/1609.01454.

Bibliography III
[7] E. F. Tjong Kim Sang and S. Buchholz, “Introduction to the
conll-2000 shared task: Chunking,” in Proceedings of the 2Nd
Workshop on Learning Language in Logic and the 4th Conference on
Computational Natural Language Learning - Volume 7, ser. ConLL
’00, Lisbon, Portugal: Association for Computational Linguistics,
2000, pp. 127–132. doi: 10.3115/1117601.1117631. [Online].
Available: https://doi.org/10.3115/1117601.1117631.
[8] L. A. Ramshaw and M. P. Marcus, “Text chunking using
transformation-based learning,” CoRR, vol. cmp-lg/9505040, 1995.
[Online]. Available: http://arxiv.org/abs/cmp-lg/9505040.
[9] P. Haghani, A. Narayanan, M. Bacchiani, G. Chuang, N. Gaur,
P. Moreno, R. Prabhavalkar, Z. Qu, and A. Waters, “From audio to
semantics: Approaches to end-to-end spoken language
understanding,” arXiv preprint arXiv:1809.09190, 2018.

Bibliography IV
[10] K. Goyal, C. Dyer, and T. Berg-Kirkpatrick, “Diﬀerentiable
scheduled sampling for credit assignment,” in Proceedings of the
55th Annual Meeting of the Association for Computational
Linguistics (Volume 2: Short Papers), Vancouver, Canada:
Association for Computational Linguistics, 2017, pp. 366–371. doi:
10.18653/v1/P17-2058. [Online]. Available:
http://aclweb.org/anthology/P17-2058.
[11] C. J. Maddison, A. Mnih, and Y. W. Teh, “The Concrete
Distribution: A Continuous Relaxation of Discrete Random
Variables,” in International Conference on Learning
Representations, 2017.
[12] E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with
gumbel-softmax,” , 2017. [Online]. Available:
https://arxiv.org/abs/1611.01144.

Transcription Errors in Context of Intent Detection and Slot Filling

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Transcription Errors in Context of Intent Detection and Slot Filling

Similaire à Transcription Errors in Context of Intent Detection and Slot Filling (20)

Dernier

Dernier (20)

Transcription Errors in Context of Intent Detection and Slot Filling