9. The IBM 2015 English Conversational Telephone
Speech Recognition System[Saon, et al., 2015]
9
⾳音声データ
特徴ベクトル列列
⾳音素列列
テキスト
FMLLR(= feature-space
Maximum Likelihood
Logistic Regression)
Gaussian Mixture Model
⽂文脈⾃自由⽂文法
⾔言語モデル
⾳音素列列(重複あり)
Hidden Markov Model
DNNとCNNを組み合わせた
ネットワーク
RNN⾔言語モデル
10. ネットワーク構成
10
l 2系統のニューラルネットワーク
- CNNを含む
- 線形レイヤのみ
l 出⼒力力はHMMの状態(state)
etworks [22] which
0-dimensional FM-
100-dimensional i-
backwards in time
the DNN have ex-
n layer is recurrent
dden layers (3 with
ns) and one output
of cross-entropy on
quence discrimina-
n [15]. The perfor-
s their score fusion
5’00 test set (SWB
ecode with a frame-
o the softmax with
WER CH
CE ST
18.4 17.9
18.5 17.0
17.7 16.3
17.4 16.3
17.0 16.1
resulting matrix by the number of models (assuming uniform
weights). An example of a joint CNN/DNN model initialized
in such a way is illustrated in Figure 1. For convenience, we
have indicated the sizes of the weight matrices in the oval boxes
and the dimensionality of the layers is attached to the arrows.
11 x 40 + 100
2048 x 512
512 x 32000
512
2048 x 2048
2048 x 2048
2048 x 2048
2048
2048
2048
540 x 2048
2048
DNN
2048 x 2048
2048 x 2048
2048 x 2048
2048 x 2048
2048 x 512
2048
2048
2048
2048
1536 x 256
243 x 128
128x11x3
256x8x1
3 x 40 x 11
2048 x 2048
2048 x 2048
2048 x 2048
2048 x 2048
2048 x 512
2048
2048
2048
2048
512 x 32000
512
1536 x 256
243 x 128
256x8x1
3 x 40 x 11
128x11x3
CNN
2048 x 512
2048 x 2048
2048 x 2048
2048 x 2048
2048
2048
2048
540 x 2048
2048
11 x 40 + 100
1024x 32000
512 512
Joint CNN/DNN
Convolution
12. 精度度⽐比較
SWB (8.8% to 8.0%) and 1.2% on CallHome (15.3% to 14.1%).
Lastly, in Table 6 we compare our results with those ob-
tained by various other systems from the literature. For clarity,
we also specify the type of training data that was used for acous-
tic modeling in each case.
System AM training data SWB CH
Vesely et al. [8] SWB 12.6 24.1
Seide et al. [9] SWB+Fisher+other 13.1 –
Hannun et al. [10] SWB+Fisher 12.6 19.3
Zhou et al. [11] SWB 14.2 –
Maas et al. [12] SWB 14.3 26.0
Maas et al. [12] SWB+Fisher 15.0 23.0
Soltau et al. [13] SWB 10.4 19.1⇤
This system SWB+Fisher+CH 8.0 14.1
Table 6: Comparison of word error rates on Hub5’00 (SWB12
提案⼿手法
DeepSpeech(後述)
13. Discriminative Method for Recurrent Neural
Network Language Models [Tachioka, et al., 2015]
13
⾳音声データ
特徴ベクトル列列
⾳音素列列
テキスト
MFCC
Gaussian Mixture Model
⽂文脈⾃自由⽂文法
⾔言語モデル
⾳音素列列(重複あり)
Hidden Markov Model
DNN
RNN⾔言語モデル
14. Discriminative Method for Recurrent Neural
Network Language Models [Tachioka, et al., 2015]
l ⽇日本語話し⾔言葉葉コーパス(CSJ)のテストセットのWER
- baseline 11.31% -> 10.49%
- そのうちE2というデータセットでは9.84%を達成
14
17. End-‐‑‒to-‐‑‒end Continuous Speech Recognition using Attention-‐‑‒
based Recurrent NN: First Results [Chorowski, et al., 2014]
l ⾳音声データから⾳音素列列を⽣生成する
- ⾳音素列列の⽣生成のためにHMMやオートマトンなどの中間状態の
⽣生成は⾏行行わない
l Attention-mechanismを⽤用いている
- Encoder: ⾳音声データを読みこむ
- Decoder: ⾳音素列列を⽣生成する
17
18. Attention Mechanism
18
AL DESCRIPTION
x1 x2 x3 xT
+
αt,1
αt,2 αt,3
αt,T
yt-1 yt
h1 h2 h3 hT
h1 h2 h3 hT
st-1 st
Figure 1: The graphical illus-
re, we define each conditional probability
, yi 1, x) = g(yi 1, si, ci), (4)
en state for time i, computed by
= f(si 1, yi 1, ci).
unlike the existing encoder–decoder ap-
the probability is conditioned on a distinct
target word yi.
depends on a sequence of annotations
n encoder maps the input sentence. Each
ormation about the whole input sequence
e parts surrounding the i-th word of the
in in detail how the annotations are com-
Encoder: ⼊入⼒力力列列を読み込むRNN
Context: Encoderから読み込んだ
情報が埋め込まれる
Decoder: Contextの情報と合わせ
て出⼒力力を⽣生成するRNN
19. ネットワーク構造
The output predictions
are computed with a
Maxout network using two
filters per unit.
Input sequence:
frames of 40 fMLLR features.
Deep Maxout network reads
11 frames (440 features) and uses
3 hidden layers of 1024
maxout units each using 5 filters.
BiRNN:
Input is 1024 features per frame
Each recurrent layer has 512
hidden units, thus the annotation
is 1024-dimensional.
Context: a score is computed to match
the previous hidden state to all input
annotations. The context is a weighted
combination of the most closely matching
annotations.
The BiRNN is used
to initialize the first
state of the decoder.
Encoder RNN:
computes an annotation
for each input frame.
Decoder RNN:
Recurrently predicts
the next phoneme,
input annotations
are accessed through
a context computed
separately for each
output.
+
Figure 1: Proposed model architecture. The system contains three parts: an encoder that computes
annotations of input frames (learned features that may depend on the whole sequence), an attention
19
⾳音声データ
MLP
Bidirectional RNN
Context: Encoder側の情
報を埋め込んだベクトル
RNN
MLP
21. Deep Speech: Scaling up end-‐‑‒to-‐‑‒end speech
recognition [Hannun, et al., 2014]
l Baidu Researchのグループが発表
l ⾳音声データから⾳音素列列を経ずに直接テキストを⽣生成
- Connectionist Temporal Classification関数
l Switchboard の Word Error Rate 12.6%を達成
21
22. ネットワーク構成
Once we have computed a prediction for P(ct|x), we compute the CTC loss [13] L(ˆy, y) to measure
the error in prediction. During training, we can evaluate the gradient rˆyL(ˆy, y) with respect to
the network outputs given the ground-truth character sequence y. From this point, computing the
gradient with respect to all of the model parameters may be done via back-propagation through the
rest of the network. We use Nesterov’s Accelerated gradient method for training [41].3
Figure 1: Structure of our RNN model and notation.
22
MLP with clipped
ReLU
Bidirectional RNN
with clipped ReLU
Softmax
CTC 損失関数
Log filterbank
27. Connectionist Temporal Classificationにおけ
る勾配
PTER 7. CONNECTIONIST TEMPORAL CLASSIFICATION
27
)
@at
k
= yk0 kk0 yk0 yk
ubstitute (7.33) and (7.31) into (7.32) to obtain
@L(x, z)
@at
k
= yt
k
1
p(z|x)
X
u2B(z,k)
↵(t, u) (t, u)
he ‘error signal’ backpropagated through the network during tr
ated in Figure 7.4.
Decoding
network is trained, we would ideally label some unknown in
by choosing the most probable labelling l⇤
:
l⇤
= arg max
l
p(l|x)
時刻t, 出⼒力力Aにおける勾配
時刻t, 出⼒力力kにおける値
全パスの⽣生起確率率率
時刻t, 出⼒力力kに対応する点
を通る全パスの⽣生起確率率率
29. 適⽤用時
l Bidirectional RNNの部分はそのまま適⽤用する
l 全ての時刻分の出⼒力力を求めた後は⾔言語モデルを考慮し
ながら最も⽣生起確率率率が⾼高い⽂文字列列を動的計画法で求める
29
es of transcriptions directly from the RNN (left) with errors tha
e model (right).
P(c|x) of our RNN we perform a search to find the sequence of c
able according to both the RNN output and the language model (
the string of characters as words). Specifically, we aim to find
ombined objective:
Q(c) = log(P(c|x)) + ↵ log(Plm(c)) + word count(c)
re tunable parameters (set by cross-validation) that control the
guage model constraint and the length of the sentence. The te
e sequence c according to the N-gram model. We maximize th
beam search algorithm, with a typical beam size in the range 1
escribed by Hannun et al. [16].
⽣生起確率率率 ⾔言語モデルによる確率率率 単語の⻑⾧長さ