音声認識と深層学習

Deep Learningと⾳音声認識識
PFIセミナー 2015/7/16
株式会社 Preferred Infrastructure
⻄西⿃鳥⽻羽⼆二郎郎

⾃自⼰己紹介
l  ⻄西⿃鳥⽻羽⼆二郎郎(にしとばじろう)
l  所属: Preferred Infrastructure 製品事業部
-  製品の導⼊入⽀支援
-  プリセールス
-  サポート
-  研究開発
2

前回
l  「Deep Learningと⾃自然⾔言語処理理」
-  http://research.preferred.jp/2014/12/deep-learning-nlp/
-  ⾃自然⾔言語処理理でDeep Learningを使ってどんなことが⾏行行われて
いるかを紹介
l  Twitter上でのポジティブ反応
l  多くの知り合いに「お前が旅⾏行行⾏行行くとサーバー壊れる
ことだけは分かった」と⾔言われる
3

Deep Learningによる⾳音声認識識⼿手法
l  ⾳音声認識識でDeep Learningを使ってどんなことが⾏行行わ
れているかを紹介
-  精度度向上
-  構造がシンプルに
4

精度度向上
データセット DNN以前 DNN CNN/RNN
TIMIT (PER) 24.8% 23.0% 17.7%
SwitchBoard Bank (WER) 23.6% 15.8% 8.0%
Voice Search (SER) 36.2% 30.1%
5
PER(Phoneme Error Rate) : ⾳音素列列で⽐比較した時の正解率率率
WER(Word Error Rate) : テキスト化した後の単語で⽐比較した時の正解率率率
SER(Sentence Error Rate) : テキスト化した後の⽂文章単位で⽐比較した時の正解率率率

⼀一般的な⾳音声認識識
6
⾳音声データ
特徴ベクトル列列
⾳音素列列
テキスト
MFCC / FMLLR
Gaussian Mixture Model
⽂文脈⾃自由⽂文法
⾔言語モデル
⾳音素列列(重複あり)
Hidden Markov Model

DNNによる⾳音声認識識
7
⾳音声データ
⾳音素列列
テキスト
MFCC / FMLLR
⾔言語モデル
Hidden Markov Model
Deep Neural Network
Convolutional Neural Network

DNNによる⾳音声認識識
8
⾳音声データ
⾳音素列列
テキスト
MFCC / FMLLR
⾔言語モデル
Hidden Markov Model
Deep Neural Network
Convolutional Neural Network
Recurrent Neural Network

The IBM 2015 English Conversational Telephone
Speech Recognition System[Saon, et al., 2015]
9
⾳音声データ
⾳音素列列
テキスト
FMLLR(= feature-space
Maximum Likelihood
Logistic Regression)
⾔言語モデル
Hidden Markov Model
DNNとCNNを組み合わせた
ネットワーク
RNN⾔言語モデル

ネットワーク構成
10
l  2系統のニューラルネットワーク
-  CNNを含む
-  線形レイヤのみ
l  出⼒力力はHMMの状態(state)
etworks [22] which
0-dimensional FM-
100-dimensional i-
backwards in time
the DNN have ex-
n layer is recurrent
dden layers (3 with
ns) and one output
of cross-entropy on
quence discrimina-
n [15]. The perfor-
s their score fusion
5’00 test set (SWB
ecode with a frame-
o the softmax with
WER CH
CE ST
18.4 17.9
18.5 17.0
17.7 16.3
17.4 16.3
17.0 16.1
resulting matrix by the number of models (assuming uniform
weights). An example of a joint CNN/DNN model initialized
in such a way is illustrated in Figure 1. For convenience, we
have indicated the sizes of the weight matrices in the oval boxes
and the dimensionality of the layers is attached to the arrows.
11 x 40 + 100
2048 x 512
512 x 32000
512
2048 x 2048
2048 x 2048
2048 x 2048
2048
2048
2048
540 x 2048
2048
DNN
2048 x 2048
2048 x 2048
2048 x 2048
2048 x 2048
2048 x 512
2048
2048
2048
2048
1536 x 256
243 x 128
128x11x3
256x8x1
3 x 40 x 11
2048 x 2048
2048 x 2048
2048 x 2048
2048 x 2048
2048 x 512
2048
2048
2048
2048
512 x 32000
512
1536 x 256
243 x 128
256x8x1
3 x 40 x 11
128x11x3
CNN
2048 x 512
2048 x 2048
2048 x 2048
2048 x 2048
2048
2048
2048
540 x 2048
2048
11 x 40 + 100
1024x 32000
512 512
Joint CNN/DNN
Convolution

⾔言語モデル
l  ⾳音素列列から⾔言語モデルを⽤用いてテキストを作成
-  n(4)-gram
-  model M
-  NNLM
l  ⾔言語モデルの学習データには⾳音声データの書き起こし
のみを使⽤用
-  Switchboard
-  Fisher corpus
-  Callhome
11

精度度⽐比較
SWB (8.8% to 8.0%) and 1.2% on CallHome (15.3% to 14.1%).
Lastly, in Table 6 we compare our results with those ob-
tained by various other systems from the literature. For clarity,
we also specify the type of training data that was used for acous-
tic modeling in each case.
System AM training data SWB CH
Vesely et al. [8] SWB 12.6 24.1
Seide et al. [9] SWB+Fisher+other 13.1 –
Hannun et al. [10] SWB+Fisher 12.6 19.3
Zhou et al. [11] SWB 14.2 –
Maas et al. [12] SWB 14.3 26.0
Maas et al. [12] SWB+Fisher 15.0 23.0
Soltau et al. [13] SWB 10.4 19.1⇤
This system SWB+Fisher+CH 8.0 14.1
Table 6: Comparison of word error rates on Hub5’00 (SWB12
提案⼿手法
DeepSpeech(後述)

Discriminative Method for Recurrent Neural
Network Language Models [Tachioka, et al., 2015]
13
⾳音声データ
⾳音素列列
テキスト
MFCC
⾔言語モデル
Hidden Markov Model
DNN
RNN⾔言語モデル

Discriminative Method for Recurrent Neural
Network Language Models [Tachioka, et al., 2015]
l  ⽇日本語話し⾔言葉葉コーパス(CSJ)のテストセットのWER
-  baseline 11.31% -> 10.49%
-  そのうちE2というデータセットでは9.84%を達成
14

構成が簡単に
l  他の分野を⾒見見てみるとDeep Learningは中間状態を⾃自
動で⾏行行う処理理で強みを発揮している
-  画像とキャプションのマルチモーダル
-  バイリンガルな⾔言語処理理
-  翻訳
l  ⾳音声認識識でも同様なことが起こっている
-  ⾳音データからHMMを経由せずに⾳音素列列を⽣生成する
-  ⾳音データから直接テキストを⽣生成する
15

End-‐‑‒to-‐‑‒end
16
⾳音声データ
⾳音素列列
テキスト
⾳音声データ
⾳音素列列
テキスト
RNN

End-‐‑‒to-‐‑‒end Continuous Speech Recognition using Attention-‐‑‒
based Recurrent NN: First Results [Chorowski, et al., 2014]
l  ⾳音声データから⾳音素列列を⽣生成する
-  ⾳音素列列の⽣生成のためにHMMやオートマトンなどの中間状態の
⽣生成は⾏行行わない
l  Attention-mechanismを⽤用いている
-  Encoder: ⾳音声データを読みこむ
-  Decoder: ⾳音素列列を⽣生成する
17

Attention Mechanism
18
AL DESCRIPTION
x1 x2 x3 xT
+
αt,1
αt,2 αt,3
αt,T
yt-1 yt
h1 h2 h3 hT
h1 h2 h3 hT
st-1 st
Figure 1: The graphical illus-
re, we deﬁne each conditional probability
, yi 1, x) = g(yi 1, si, ci), (4)
en state for time i, computed by
= f(si 1, yi 1, ci).
unlike the existing encoder–decoder ap-
the probability is conditioned on a distinct
target word yi.
depends on a sequence of annotations
n encoder maps the input sentence. Each
ormation about the whole input sequence
e parts surrounding the i-th word of the
in in detail how the annotations are com-
Encoder: ⼊入⼒力力列列を読み込むRNN
Context: Encoderから読み込んだ
情報が埋め込まれる
Decoder: Contextの情報と合わせ
て出⼒力力を⽣生成するRNN

ネットワーク構造
The output predictions
are computed with a
Maxout network using two
filters per unit.
Input sequence:
frames of 40 fMLLR features.
Deep Maxout network reads
11 frames (440 features) and uses
3 hidden layers of 1024
maxout units each using 5 filters.
BiRNN:
Input is 1024 features per frame
Each recurrent layer has 512
hidden units, thus the annotation
is 1024-dimensional.
Context: a score is computed to match
the previous hidden state to all input
annotations. The context is a weighted
combination of the most closely matching
annotations.
The BiRNN is used
to initialize the first
state of the decoder.
Encoder RNN:
computes an annotation
for each input frame.
Decoder RNN:
Recurrently predicts
the next phoneme,
input annotations
are accessed through
a context computed
separately for each
output.
+
Figure 1: Proposed model architecture. The system contains three parts: an encoder that computes
annotations of input frames (learned features that may depend on the whole sequence), an attention
19
⾳音声データ
MLP
Bidirectional RNN
Context: Encoder側の情
報を埋め込んだベクトル
RNN
MLP

End-‐‑‒to-‐‑‒end
20
⾳音声データ
⾳音素列列
テキスト
⾳音声データ
テキスト
RNN

Deep Speech: Scaling up end-‐‑‒to-‐‑‒end speech
recognition [Hannun, et al., 2014]
l  Baidu Researchのグループが発表
l  ⾳音声データから⾳音素列列を経ずに直接テキストを⽣生成
-  Connectionist Temporal Classiﬁcation関数
l  Switchboard の Word Error Rate 12.6%を達成
21

ネットワーク構成
Once we have computed a prediction for P(ct|x), we compute the CTC loss [13] L(ˆy, y) to measure
the error in prediction. During training, we can evaluate the gradient rˆyL(ˆy, y) with respect to
the network outputs given the ground-truth character sequence y. From this point, computing the
gradient with respect to all of the model parameters may be done via back-propagation through the
rest of the network. We use Nesterov’s Accelerated gradient method for training [41].3
Figure 1: Structure of our RNN model and notation.
22
MLP with clipped
ReLU
Bidirectional RNN
with clipped ReLU
Softmax
CTC 損失関数
Log filterbank

Connectionist Temporal Classiﬁcation(CTC)
[Graves, et al., 2006]
l  ⼊入⼒力力と出⼒力力の系列列⻑⾧長が違う時に⽤用いられる損失関数
l  任意のRNNやLSTM等の出⼒力力に適⽤用できる
l  blank(空⽩白⽂文字)を導⼊入し、正解⽂文字列列を順番に⽣生成す
る確率率率を求める
-  CAT
l  _C_A_T_
l  ____CCCCA___TT
-  aab
l  a_ab_
l  _aa__abb
PFI Confidential
23

Connectionist Temporal Classiﬁcation
PTER 7. CONNECTIONIST TEMPORAL CLASSIFICATION
24
各時刻での⽣生起確率率率
時刻t
⿊黒丸は空⽩白⽂文字
遷移

Connectionist Temporal Classiﬁcation
25
_C__AT

Connectionist Temporal Classiﬁcationのコス
ト関数PTER 7. CONNECTIONIST TEMPORAL CLASSIFICATION
26
コスト関数: 全パスの⽣生起
確率率率の負の対数尤度度

Connectionist Temporal Classiﬁcationにおけ
る勾配
27
)
@at
k
= yk0 kk0 yk0 yk
ubstitute (7.33) and (7.31) into (7.32) to obtain
@L(x, z)
@at
k
= yt
k
1
p(z|x)
X
u2B(z,k)
↵(t, u) (t, u)
he ‘error signal’ backpropagated through the network during tr
ated in Figure 7.4.
Decoding
network is trained, we would ideally label some unknown in
by choosing the most probable labelling l⇤
:
l⇤
= arg max
l
p(l|x)
時刻t, 出⼒力力Aにおける勾配
時刻t, 出⼒力力kにおける値
全パスの⽣生起確率率率
時刻t, 出⼒力力kに対応する点
を通る全パスの⽣生起確率率率

Connectionist Temporal Classiﬁcationにおけ
る学習
l  各時刻、出⼒力力における勾配(デルタ)を求めた後はBack
Propagation Through Time(BPTT)を⽤用いて学習する
28

適⽤用時
l  Bidirectional RNNの部分はそのまま適⽤用する
l  全ての時刻分の出⼒力力を求めた後は⾔言語モデルを考慮し
ながら最も⽣生起確率率率が⾼高い⽂文字列列を動的計画法で求める
29
es of transcriptions directly from the RNN (left) with errors tha
e model (right).
P(c|x) of our RNN we perform a search to find the sequence of c
able according to both the RNN output and the language model (
the string of characters as words). Specifically, we aim to find
ombined objective:
Q(c) = log(P(c|x)) + ↵ log(Plm(c)) + word count(c)
re tunable parameters (set by cross-validation) that control the
guage model constraint and the length of the sentence. The te
e sequence c according to the N-gram model. We maximize th
beam search algorithm, with a typical beam size in the range 1
escribed by Hannun et al. [16].
⽣生起確率率率⾔言語モデルによる確率率率単語の⻑⾧長さ

まとめ
l  ⾳音声認識識においてはDeep Learningを⽤用いることによ
って精度度が向上している
-  Switchboard bankのWER 24.8% -> 8.0%
l  最近はEnd-to-endの⼿手法が提案されている
-  ⾳音素列列などの中間状態はDNNがよしなにやってくれる
-  ⾳音声への⼊入⾨門がやりやすくなっている
30

特徴抽出が簡単に
l  MFCC(Mel Frequency Cepstral Coefficient)
1.  ⾳音声データをフレーム(通常20ms〜～40ms)に分割する
2.  各フレーム毎のデータに離離散フーリエ変換を⾏行行う
3.  Mel filterbankを適⽤用する
l  Mel数(⼈人の聴覚特性を反映した数字)を考慮したフィルタ
4.  対数を取る
5.  離離散コサイン変換を⾏行行う
6.  低い次元から12個抽出する
32

l  Log Filterbank
33

l  Mel Filterbank
34

音声認識と深層学習

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à 音声認識と深層学習

Similaire à 音声認識と深層学習 (20)

Plus de Preferred Networks

Plus de Preferred Networks (20)

音声認識と深層学習