Automatic speech recognition system using deep learning

Automatic Speech Recognition System using Deep
Learning
Ankan Dutta
14MCEI03
Guided By
Dr. Sharada Valiveti
Institute of Technology
Nirma University
May 16, 2016
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 1 / 39

Introduction
Deﬁnition
Development of Automatic Speech Recognition System using Deep
Learning Techniques
Scope
Automatic Speech Recognition allows computers to interpret human
speech
Lower barrier for computer interactions
Speech recognition allows converting speech to text

Introduction
Objectives
The audio signals should be converted in the form of
MFCCs(Mel-Frequency Ceptral Coeﬃcients)[1]
Implementing Convolutional Neural Network[2, 3] for audio feature
extraction
Then using Gaussian Mixture Model - Hidden Markov Model[4]for
recognition of audio signal.

1st Review
Applications of Speech Recognition System
Dimensions of Speech Recognition System.
Generic Block Diagram of the system fig.1.
Performance Evaluation of the system.
Conventional Speech Recognition Systems.
Required Machine Learning Techniques.
MFCCs fig. 2
Hidden Markov Models fig.3
Gaussian Mixture Models fig. 4
Deep De-noising Auto-Encoders fig. 5
Convolutional Neural Networks fig. 6

1st Review
Literature Survey on various proposed
Audio Features Extraction Mechanisms.
Visual Features Extraction Mechanisms.
Integration of Audio and Visual Systems.
Architectures for the System.

Mechanisms for Audio Features Extraction [5, 6, 7]
Approaches:
Using DNN-HMM in recognition phase.
Using DNN in feature-extraction phase and GMM-HMM in
recognition phase.

Generic Block Diagram
Figure: System architecture

MFCC[1]
Figure: MFCC

Hidden Markov Model
Figure: Hidden Markov Model

Gaussian Mixture Model
Figure: Gaussian Mixture Model

Deep De-noising Auto-Encoder
Figure: Deep De-noising Auto-Encoder

Convolutional Neural Network
Figure: Convolutional Neural Network

2nd Review
Audio Visual Speech Recognition.
Architecture of the Model which we will use in our implementation
ﬁg. 7
Required Tools and Datasets.
A Basic implementation Using KALDI.

Architecture of the Model
Figure: System architecture

Requirements
For Automatic Speech Recognition,
Nvidia CUDA 7.5 (System should have a Nvidia GPU)[8]
We are using KALDI Speech Recognition Toolkit For our
implementation.
For Automatic Speaker Recognition,
Python

KALDI’s Dependencies [9]
For Using Kaldi following libraries and tools have to be
installed:
OpenFst : Most of the compilation is done with it, and it is very
heavily used.
IRSTLM : It is a language modeling Toolkit.
sph2pipe : It is for converting .sph ﬁles to .wav ﬁles, it is required
for using LDC datasets.
sclite : It is not that important but still may arise as one of the
dependencies, so it is better to install it
ATLAS : Its a linear algebra library. It will only work if your CPU
throttling is not enabled.
CLAPACK : This also a linear algebra library. If one doesn’t have
ATLAS ,CLAPACK can be used as an alternative.
SRILM : SRILM is a toolkit for building and applying statistical
language models (LMs), primarily for use in speech recognition,
statistical tagging and segmentation, and machine translation.

Dataset
We have made our own dataset according to our requirement of the
project. Vocabulary size of our dataset is very small i.e numbers from 0
to 9. This same dataset is used in both of our implementation,
automatic speech recognition, and speaker identification. Sentences
that contain only digits are perfect in this case.
10 different speakers (ASR systems must be trained and tested on
different speakers, the more speakers you have the better),
each speaker says 10 sentences,
100 sentences/utterances (in 100 *.wav files placed in
10 folders related to particular speakers - 10 *.wav files in each
folder),
300 words (digits from zero to nine),
each sentence/utterance consist of 3 words.

Implementation
Automatic Speech Recognition
Automatic Speaker Recognition

Procedure for making a Automatic Speech Recognition
System Using KALDI
Data Preparation
Audio Data
Acoustic Modelling
Language Modelling
Project Finalization
Tools Attachment
Scoring Script
Conﬁguration Files
Running Scripts Creation
Getting Results

Data Preparation
Audio Data : We are using our own Digits dataset for the
implementation of our ASR system.
TASK : Go to kaldi-trunk/egs/digits directory and create
’digitsaudio’ folder. In kalditrunk/egs/digits/digitsaudio create two
folders: ’train’ and ’test’.

Data Preparation
Acoustic Modelling
TASK: In kaldi-trunk/egs/digits directory, create a folder ’data’.
Then create ’test’ and ’train’ subfolders inside.
spk2gender : This file informs about speakers gender.
PATTERN: < speakerID >< gender >
wav.scp : This file connects every utterance (sentence said by one
person during particular recording session) with an audio file
related to this utterance.
PATTERN: < uterranceID >< full − path − to − audio − file >

Data Preparation
Acoustic Modelling
text : This file contains every utterance matched with its text
transcription.
PATTERN: < uterranceID >< text − transcription >
utt2spk : This file tells the ASR system which utterance belongs to
particular speaker.
PATTERN: < uterranceID >< speakerID >
corpus.txt: In kaldi-trunk/egs/digits/data/local create a file
corpus.txt which should contain every single utterance transcription
that can occur in our ASR system (in our case it will be 100 lines
from 100 audio files).
PATTERN: < text − transcription >

Data Preparation
Language Modelling
TASK: In kaldi-trunk/egs/digits/data/local directory, create a
folder ’dict’. Then create ’test’ and ’train’ subfolders inside.
lexicon.txt : This file contains every word from our dictionary with
its ’phone transcriptions’ (taken from /egs/voxforge).
PATTERN: < word >< phone1 >< phone2 > ...
nonsilence-phones.txt : This file lists nonsilence phones that are
present in our project.
PATTERN:< phone >
silence-phones.txt : This file lists silence phones.
PATTERN:< phone >
optional-silence.txt : This file lists optional silence phones.
PATTERN: < phone >

Tools Attachment
From kaldi-trunk/egs/wsj/s5 copy two folders (with the whole
content) - ’utils’ and ’steps’ - and put them in our
kaldi-trunk/egs/digits directory.
Scoring Script
This script will help you to get decoding results.
TASK: From kaldi-trunk/egs/voxforge/local copy the script score.sh
into exactly same location in our project
(kaldi-trunk/egs/digits/local).

Configuration Files
TASK:In kaldi-trunk/egs/digits create a folder ’conf’. Inside
kaldi-trunk/egs/digits/conf create two files (for some configuration
modifications in decoding and mfcc feature extraction processes -
taken from /egs/voxforge)
decode.config
mfcc.conf

Running Script Creation
Our last job is to prepare running scripts to create ASR system of
our choice.
path.sh
cmd.sh
run.sh

Results Automatic Speech Recognition System
When we execute the run.sh ﬁle the training is done and results logs
are generated for the decoding process are found in ’log’ folder. Fig. 8,
9, 10 shows the training process and ﬁgure 11 shows the results of the
prediction on the test data and accuracy.

(Training)
Figure: Training Screenshot 1

(Training)

(Prediction)
Figure: Prediction of our ASR System

Procedure for making a Automatic Speaker Recognition
System Using K-NN in Python
We have used our own dataset in our K-NN implementation.
First we have extracted the mfcc features using matlab code.
Then we have to arrange this mfcc feature in .csv form.
Now this .csv feature ﬁle is given as input to our K-NN
implementation in python.

Result of the Automatic Speaker Recognition System
When we execute the program the dataset is split into two parts
training and testing. After the training is completed and speaker
prediction is done on the testing data. Then the accuracy is measured
on the prediction. Figure 13 shows the result of the accuracy is 80%
when we took nine-speaker and all of the mfcc features generated for
all the ten samples. Figure 12 shows the mfcc features.

Figure: MFCC features generated from Digits dataset

Figure: Prediction and Accuracy of Speaker Identiﬁcation on Digits dataset

Conclusion
For Automatic Speech Recognition,
For implementing our own ASR system we used KALDI speech
recognition framework.
We trained our system on our own dataset digits.
Digits is our own dataset containing 10 speakers each speaking 10
sentences, every sentence contains 3 words. The vocabulary of our
dataset is from 0 to 9.
As a result, we have achieved an accuracy rate of 72% for our
Speech Recognition System.
Our ASR system is the text-dependent system as it has a limited
vocabulary of 0 to 9.
Higher recognition rate gain can be achieved with a larger dataset.
For Automatic Speaker Recognition,
We have also implemented Speaker identification using K-NN
classification in Python using the same dataset.
After the training of our K-NN classifier, we achieved an accuracy
of 80% for our Speaker Identification system.

Bibliography I
J. Luettin, N. Thacker, S. W. Beet, et al., “Visual speech
recognition using active shape models and hidden markov models,”
in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96.
Conference Proceedings., 1996 IEEE International Conference on,
vol. 2, pp. 817–820, IEEE, 1996.
O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and
D. Yu, “Convolutional neural networks for speech recognition,”
Audio, Speech, and Language Processing, IEEE/ACM Transactions
on, vol. 22, no. 10, pp. 1533–1545, 2014.
Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, and R. Bowden,
“Improving visual features for lip-reading.,” in AVSP, pp. 7–3,
2010.

Bibliography II
S. Renals, N. Morgan, H. Bourlard, M. Cohen, and H. Franco,
“Connectionist probability estimators in hmm speech recognition,”
Speech and Audio Processing, IEEE Transactions on, vol. 2, no. 1,
pp. 161–174, 1994.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol,
“Extracting and composing robust features with denoising
autoencoders,” in Proceedings of the 25th international conference
on Machine learning, pp. 1096–1103, ACM, 2008.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.
Manzagol, “Stacked denoising autoencoders: Learning useful
representations in a deep network with a local denoising criterion,”
The Journal of Machine Learning Research, vol. 11, pp. 3371–3408,
2010.

Bibliography III
K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata,
“Audio-visual speech recognition using deep learning,” Applied
Intelligence, vol. 42, no. 4, pp. 722–737, 2015.
ETSI/SAGE, “Specification of the 3GPP Confidentiality and
Integrity Algorithms 128-EEA3 & 128-EIA3. Document 1:
128-EEA3 and 128-EIA3 Specification.”
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motl´ıˇcek, Y. Qian, P. Schwarz, et al.,
“The kaldi speech recognition toolkit,” 2011.

Automatic speech recognition system using deep learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Automatic speech recognition system using deep learning

Similar to Automatic speech recognition system using deep learning (20)

Recently uploaded

Recently uploaded (20)

Automatic speech recognition system using deep learning