SlideShare a Scribd company logo
1 of 39
Download to read offline
Automatic Speech Recognition System using Deep
Learning
Ankan Dutta
14MCEI03
Guided By
Dr. Sharada Valiveti
Institute of Technology
Nirma University
May 16, 2016
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 1 / 39
Introduction
Definition
Development of Automatic Speech Recognition System using Deep
Learning Techniques
Scope
Automatic Speech Recognition allows computers to interpret human
speech
Lower barrier for computer interactions
Speech recognition allows converting speech to text
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 2 / 39
Introduction
Objectives
The audio signals should be converted in the form of
MFCCs(Mel-Frequency Ceptral Coefficients)[1]
Implementing Convolutional Neural Network[2, 3] for audio feature
extraction
Then using Gaussian Mixture Model - Hidden Markov Model[4]for
recognition of audio signal.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 3 / 39
1st Review
Applications of Speech Recognition System
Dimensions of Speech Recognition System.
Generic Block Diagram of the system fig.1.
Performance Evaluation of the system.
Conventional Speech Recognition Systems.
Required Machine Learning Techniques.
MFCCs fig. 2
Hidden Markov Models fig.3
Gaussian Mixture Models fig. 4
Deep De-noising Auto-Encoders fig. 5
Convolutional Neural Networks fig. 6
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 4 / 39
1st Review
Literature Survey on various proposed
Audio Features Extraction Mechanisms.
Visual Features Extraction Mechanisms.
Integration of Audio and Visual Systems.
Architectures for the System.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 5 / 39
Mechanisms for Audio Features Extraction [5, 6, 7]
Approaches:
Using DNN-HMM in recognition phase.
Using DNN in feature-extraction phase and GMM-HMM in
recognition phase.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 6 / 39
Generic Block Diagram
Figure: System architecture
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 7 / 39
MFCC[1]
Figure: MFCC
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 8 / 39
Hidden Markov Model
Figure: Hidden Markov Model
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 9 / 39
Gaussian Mixture Model
Figure: Gaussian Mixture Model
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 10 / 39
Deep De-noising Auto-Encoder
Figure: Deep De-noising Auto-Encoder
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 11 / 39
Convolutional Neural Network
Figure: Convolutional Neural Network
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 12 / 39
2nd Review
Audio Visual Speech Recognition.
Architecture of the Model which we will use in our implementation
fig. 7
Required Tools and Datasets.
A Basic implementation Using KALDI.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 13 / 39
Architecture of the Model
Figure: System architecture
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 14 / 39
Requirements
For Automatic Speech Recognition,
Nvidia CUDA 7.5 (System should have a Nvidia GPU)[8]
We are using KALDI Speech Recognition Toolkit For our
implementation.
For Automatic Speaker Recognition,
Python
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 15 / 39
KALDI’s Dependencies [9]
For Using Kaldi following libraries and tools have to be
installed:
OpenFst : Most of the compilation is done with it, and it is very
heavily used.
IRSTLM : It is a language modeling Toolkit.
sph2pipe : It is for converting .sph files to .wav files, it is required
for using LDC datasets.
sclite : It is not that important but still may arise as one of the
dependencies, so it is better to install it
ATLAS : Its a linear algebra library. It will only work if your CPU
throttling is not enabled.
CLAPACK : This also a linear algebra library. If one doesn’t have
ATLAS ,CLAPACK can be used as an alternative.
SRILM : SRILM is a toolkit for building and applying statistical
language models (LMs), primarily for use in speech recognition,
statistical tagging and segmentation, and machine translation.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 16 / 39
Dataset
We have made our own dataset according to our requirement of the
project. Vocabulary size of our dataset is very small i.e numbers from 0
to 9. This same dataset is used in both of our implementation,
automatic speech recognition, and speaker identification. Sentences
that contain only digits are perfect in this case.
10 different speakers (ASR systems must be trained and tested on
different speakers, the more speakers you have the better),
each speaker says 10 sentences,
100 sentences/utterances (in 100 *.wav files placed in
10 folders related to particular speakers - 10 *.wav files in each
folder),
300 words (digits from zero to nine),
each sentence/utterance consist of 3 words.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 17 / 39
Implementation
Automatic Speech Recognition
Automatic Speaker Recognition
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 18 / 39
Procedure for making a Automatic Speech Recognition
System Using KALDI
Data Preparation
Audio Data
Acoustic Modelling
Language Modelling
Project Finalization
Tools Attachment
Scoring Script
Configuration Files
Running Scripts Creation
Getting Results
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 19 / 39
Data Preparation
Audio Data : We are using our own Digits dataset for the
implementation of our ASR system.
TASK : Go to kaldi-trunk/egs/digits directory and create
’digitsaudio’ folder. In kalditrunk/egs/digits/digitsaudio create two
folders: ’train’ and ’test’.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 20 / 39
Data Preparation
Acoustic Modelling
TASK: In kaldi-trunk/egs/digits directory, create a folder ’data’.
Then create ’test’ and ’train’ subfolders inside.
spk2gender : This file informs about speakers gender.
PATTERN: < speakerID >< gender >
wav.scp : This file connects every utterance (sentence said by one
person during particular recording session) with an audio file
related to this utterance.
PATTERN: < uterranceID >< full − path − to − audio − file >
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 21 / 39
Data Preparation
Acoustic Modelling
text : This file contains every utterance matched with its text
transcription.
PATTERN: < uterranceID >< text − transcription >
utt2spk : This file tells the ASR system which utterance belongs to
particular speaker.
PATTERN: < uterranceID >< speakerID >
corpus.txt: In kaldi-trunk/egs/digits/data/local create a file
corpus.txt which should contain every single utterance transcription
that can occur in our ASR system (in our case it will be 100 lines
from 100 audio files).
PATTERN: < text − transcription >
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 22 / 39
Data Preparation
Language Modelling
TASK: In kaldi-trunk/egs/digits/data/local directory, create a
folder ’dict’. Then create ’test’ and ’train’ subfolders inside.
lexicon.txt : This file contains every word from our dictionary with
its ’phone transcriptions’ (taken from /egs/voxforge).
PATTERN: < word >< phone1 >< phone2 > ...
nonsilence-phones.txt : This file lists nonsilence phones that are
present in our project.
PATTERN:< phone >
silence-phones.txt : This file lists silence phones.
PATTERN:< phone >
optional-silence.txt : This file lists optional silence phones.
PATTERN: < phone >
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 23 / 39
Project Finalization
Tools Attachment
From kaldi-trunk/egs/wsj/s5 copy two folders (with the whole
content) - ’utils’ and ’steps’ - and put them in our
kaldi-trunk/egs/digits directory.
Scoring Script
This script will help you to get decoding results.
TASK: From kaldi-trunk/egs/voxforge/local copy the script score.sh
into exactly same location in our project
(kaldi-trunk/egs/digits/local).
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 24 / 39
Project Finalization
Configuration Files
TASK:In kaldi-trunk/egs/digits create a folder ’conf’. Inside
kaldi-trunk/egs/digits/conf create two files (for some configuration
modifications in decoding and mfcc feature extraction processes -
taken from /egs/voxforge)
decode.config
mfcc.conf
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 25 / 39
Running Script Creation
Our last job is to prepare running scripts to create ASR system of
our choice.
path.sh
cmd.sh
run.sh
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 26 / 39
Results Automatic Speech Recognition System
When we execute the run.sh file the training is done and results logs
are generated for the decoding process are found in ’log’ folder. Fig. 8,
9, 10 shows the training process and figure 11 shows the results of the
prediction on the test data and accuracy.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 27 / 39
Results Automatic Speech Recognition System
(Training)
Figure: Training Screenshot 1
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 28 / 39
Results Automatic Speech Recognition System
(Training)
Figure: Training Screenshot 2
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 29 / 39
Results Automatic Speech Recognition System
(Training)
Figure: Training Screenshot 3
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 30 / 39
Results Automatic Speech Recognition System
(Prediction)
Figure: Prediction of our ASR System
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 31 / 39
Procedure for making a Automatic Speaker Recognition
System Using K-NN in Python
We have used our own dataset in our K-NN implementation.
First we have extracted the mfcc features using matlab code.
Then we have to arrange this mfcc feature in .csv form.
Now this .csv feature file is given as input to our K-NN
implementation in python.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 32 / 39
Result of the Automatic Speaker Recognition System
When we execute the program the dataset is split into two parts
training and testing. After the training is completed and speaker
prediction is done on the testing data. Then the accuracy is measured
on the prediction. Figure 13 shows the result of the accuracy is 80%
when we took nine-speaker and all of the mfcc features generated for
all the ten samples. Figure 12 shows the mfcc features.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 33 / 39
Result of the Automatic Speaker Recognition System
Figure: MFCC features generated from Digits dataset
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 34 / 39
Result of the Automatic Speaker Recognition System
Figure: Prediction and Accuracy of Speaker Identification on Digits dataset
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 35 / 39
Conclusion
For Automatic Speech Recognition,
For implementing our own ASR system we used KALDI speech
recognition framework.
We trained our system on our own dataset digits.
Digits is our own dataset containing 10 speakers each speaking 10
sentences, every sentence contains 3 words. The vocabulary of our
dataset is from 0 to 9.
As a result, we have achieved an accuracy rate of 72% for our
Speech Recognition System.
Our ASR system is the text-dependent system as it has a limited
vocabulary of 0 to 9.
Higher recognition rate gain can be achieved with a larger dataset.
For Automatic Speaker Recognition,
We have also implemented Speaker identification using K-NN
classification in Python using the same dataset.
After the training of our K-NN classifier, we achieved an accuracy
of 80% for our Speaker Identification system.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 36 / 39
Bibliography I
J. Luettin, N. Thacker, S. W. Beet, et al., “Visual speech
recognition using active shape models and hidden markov models,”
in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96.
Conference Proceedings., 1996 IEEE International Conference on,
vol. 2, pp. 817–820, IEEE, 1996.
O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and
D. Yu, “Convolutional neural networks for speech recognition,”
Audio, Speech, and Language Processing, IEEE/ACM Transactions
on, vol. 22, no. 10, pp. 1533–1545, 2014.
Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, and R. Bowden,
“Improving visual features for lip-reading.,” in AVSP, pp. 7–3,
2010.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 37 / 39
Bibliography II
S. Renals, N. Morgan, H. Bourlard, M. Cohen, and H. Franco,
“Connectionist probability estimators in hmm speech recognition,”
Speech and Audio Processing, IEEE Transactions on, vol. 2, no. 1,
pp. 161–174, 1994.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol,
“Extracting and composing robust features with denoising
autoencoders,” in Proceedings of the 25th international conference
on Machine learning, pp. 1096–1103, ACM, 2008.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A.
Manzagol, “Stacked denoising autoencoders: Learning useful
representations in a deep network with a local denoising criterion,”
The Journal of Machine Learning Research, vol. 11, pp. 3371–3408,
2010.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 38 / 39
Bibliography III
K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata,
“Audio-visual speech recognition using deep learning,” Applied
Intelligence, vol. 42, no. 4, pp. 722–737, 2015.
ETSI/SAGE, “Specification of the 3GPP Confidentiality and
Integrity Algorithms 128-EEA3 & 128-EIA3. Document 1:
128-EEA3 and 128-EIA3 Specification.”
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motl´ıˇcek, Y. Qian, P. Schwarz, et al.,
“The kaldi speech recognition toolkit,” 2011.
Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 39 / 39

More Related Content

What's hot

Speech Recognition
Speech RecognitionSpeech Recognition
Speech RecognitionHugo Moreno
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCCHira Shaukat
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentationhimanshubhatti
 
Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)BushraShaikh44
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversionankit_saluja
 
Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Vidyut Singhania
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminarDiptimaya Sarangi
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overviewsajanazoya
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text MiningSushanti Acharya
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech RecognitionAhmed Moawad
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognitionCharu Joshi
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniquessonukumar142
 
The Use of Artificial Intelligence and Machine Learning in Speech Recognition
The Use of Artificial Intelligence and Machine Learning in Speech RecognitionThe Use of Artificial Intelligence and Machine Learning in Speech Recognition
The Use of Artificial Intelligence and Machine Learning in Speech RecognitionUniphore
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceIlhaan Marwat
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySrijanKumar18
 
speech processing and recognition basic in data mining
speech processing and recognition basic in  data miningspeech processing and recognition basic in  data mining
speech processing and recognition basic in data miningJimit Rupani
 

What's hot (20)

Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speaker recognition using MFCC
Speaker recognition using MFCCSpeaker recognition using MFCC
Speaker recognition using MFCC
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Speech Recognition System
Speech Recognition SystemSpeech Recognition System
Speech Recognition System
 
Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)Mel frequency cepstral coefficient (mfcc)
Mel frequency cepstral coefficient (mfcc)
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Optical Character Recognition (OCR)
Optical Character Recognition (OCR)Optical Character Recognition (OCR)
Optical Character Recognition (OCR)
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overview
 
Natural Language Processing using Text Mining
Natural Language Processing using Text MiningNatural Language Processing using Text Mining
Natural Language Processing using Text Mining
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniques
 
The Use of Artificial Intelligence and Machine Learning in Speech Recognition
The Use of Artificial Intelligence and Machine Learning in Speech RecognitionThe Use of Artificial Intelligence and Machine Learning in Speech Recognition
The Use of Artificial Intelligence and Machine Learning in Speech Recognition
 
Speaker Recognition
Speaker RecognitionSpeaker Recognition
Speaker Recognition
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail Inteligence
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
speech processing and recognition basic in data mining
speech processing and recognition basic in  data miningspeech processing and recognition basic in  data mining
speech processing and recognition basic in data mining
 

Viewers also liked

Kaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source codeKaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source codeXavier Anguera
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 
Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemREHMAT ULLAH
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech RecognitionYogesh Vijay
 
MASK: Robust Local Features for Audio Fingerprinting
MASK: Robust Local Features for Audio FingerprintingMASK: Robust Local Features for Audio Fingerprinting
MASK: Robust Local Features for Audio FingerprintingXavier Anguera
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitShubham Verma
 
Deep Learning for Large Scale Biodiversity Monitoring
Deep Learning for Large Scale Biodiversity MonitoringDeep Learning for Large Scale Biodiversity Monitoring
Deep Learning for Large Scale Biodiversity MonitoringDavid J. Klein
 
Dynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesDynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesRoger Gomes
 
음성인식기술을 이용한 일본드라마 감성분석
음성인식기술을 이용한 일본드라마  감성분석음성인식기술을 이용한 일본드라마  감성분석
음성인식기술을 이용한 일본드라마 감성분석cyberemotion
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition systemAlok Tiwari
 
Ai based character recognition and speech synthesis
Ai based character recognition and speech  synthesisAi based character recognition and speech  synthesis
Ai based character recognition and speech synthesisAnkita Jadhao
 
Noise Adaptive Training for Robust Automatic Speech Recognition
Noise Adaptive Training for Robust Automatic Speech RecognitionNoise Adaptive Training for Robust Automatic Speech Recognition
Noise Adaptive Training for Robust Automatic Speech Recognitionأحلام انصارى
 
경영정보기술과제
경영정보기술과제경영정보기술과제
경영정보기술과제beatm98
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learninghoai_ln
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by IqbalIqbal
 
Blackboard Pattern
Blackboard PatternBlackboard Pattern
Blackboard Patterntcab22
 
Blackboard architecture pattern
Blackboard architecture patternBlackboard architecture pattern
Blackboard architecture patternaish006
 
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...Yandex
 

Viewers also liked (20)

Kaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source codeKaldi-voice: Your personal speech recognition server using open source code
Kaldi-voice: Your personal speech recognition server using open source code
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition system
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech Recognition
 
MASK: Robust Local Features for Audio Fingerprinting
MASK: Robust Local Features for Audio FingerprintingMASK: Robust Local Features for Audio Fingerprinting
MASK: Robust Local Features for Audio Fingerprinting
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
 
Deep Learning for Large Scale Biodiversity Monitoring
Deep Learning for Large Scale Biodiversity MonitoringDeep Learning for Large Scale Biodiversity Monitoring
Deep Learning for Large Scale Biodiversity Monitoring
 
Dynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devicesDynamic time warping and PIC 16F676 for control of devices
Dynamic time warping and PIC 16F676 for control of devices
 
음성인식기술을 이용한 일본드라마 감성분석
음성인식기술을 이용한 일본드라마  감성분석음성인식기술을 이용한 일본드라마  감성분석
음성인식기술을 이용한 일본드라마 감성분석
 
Automatic speech recognition system
Automatic speech recognition systemAutomatic speech recognition system
Automatic speech recognition system
 
Ai based character recognition and speech synthesis
Ai based character recognition and speech  synthesisAi based character recognition and speech  synthesis
Ai based character recognition and speech synthesis
 
Noise Adaptive Training for Robust Automatic Speech Recognition
Noise Adaptive Training for Robust Automatic Speech RecognitionNoise Adaptive Training for Robust Automatic Speech Recognition
Noise Adaptive Training for Robust Automatic Speech Recognition
 
경영정보기술과제
경영정보기술과제경영정보기술과제
경영정보기술과제
 
Multimodal deep learning
Multimodal deep learningMultimodal deep learning
Multimodal deep learning
 
An Introduction To Speech Recognition
An Introduction To Speech RecognitionAn Introduction To Speech Recognition
An Introduction To Speech Recognition
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by Iqbal
 
Blackboard Pattern
Blackboard PatternBlackboard Pattern
Blackboard Pattern
 
Blackboard architecture pattern
Blackboard architecture patternBlackboard architecture pattern
Blackboard architecture pattern
 
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi..."Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
"Automatic speech recognition for mobile applications in Yandex" — Fran Campi...
 

Similar to Automatic speech recognition system using deep learning

Software solution - Lean development and Agile methodologies lesson 1
Software solution - Lean development and Agile methodologies lesson 1Software solution - Lean development and Agile methodologies lesson 1
Software solution - Lean development and Agile methodologies lesson 1Francesco Mapelli
 
Automatic Subtitle Generation For Sound In Videos
Automatic Subtitle Generation For Sound In VideosAutomatic Subtitle Generation For Sound In Videos
Automatic Subtitle Generation For Sound In VideosAsia Smith
 
Automatic Subtitle Generation for Sound in Videos
Automatic Subtitle Generation for Sound in VideosAutomatic Subtitle Generation for Sound in Videos
Automatic Subtitle Generation for Sound in VideosIRJET Journal
 
ITAC 2016 Where Open Source Meets Audit Analytics
ITAC 2016 Where Open Source Meets Audit AnalyticsITAC 2016 Where Open Source Meets Audit Analytics
ITAC 2016 Where Open Source Meets Audit AnalyticsAndrew Clark
 
A survey on Enhancements in Speech Recognition
A survey on Enhancements in Speech RecognitionA survey on Enhancements in Speech Recognition
A survey on Enhancements in Speech RecognitionIRJET Journal
 
Smart Sound Measurement and Control System for Smart City
Smart Sound Measurement and Control System for Smart CitySmart Sound Measurement and Control System for Smart City
Smart Sound Measurement and Control System for Smart CityIRJET Journal
 
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...IRJET Journal
 
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET Journal
 
Simulation of speech recognition using correlation method on matlab software
Simulation of speech recognition using correlation method on matlab softwareSimulation of speech recognition using correlation method on matlab software
Simulation of speech recognition using correlation method on matlab softwareVaishaliVaishali14
 
Free Software for Free Sound
Free Software for Free SoundFree Software for Free Sound
Free Software for Free SoundXavier Amatriain
 
8th Ethiopian ICT Conference Bazaar and Exhibition.pptx
8th Ethiopian ICT Conference Bazaar and Exhibition.pptx8th Ethiopian ICT Conference Bazaar and Exhibition.pptx
8th Ethiopian ICT Conference Bazaar and Exhibition.pptxssusera032bc
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingIOSR Journals
 
Ian definitions 3rd try 2
Ian definitions 3rd try 2Ian definitions 3rd try 2
Ian definitions 3rd try 2thomasmcd6
 
Ian definitions 3rd try 2
Ian definitions 3rd try 2Ian definitions 3rd try 2
Ian definitions 3rd try 2thomasmcd6
 
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...Paolo Nesi
 

Similar to Automatic speech recognition system using deep learning (20)

Software solution - Lean development and Agile methodologies lesson 1
Software solution - Lean development and Agile methodologies lesson 1Software solution - Lean development and Agile methodologies lesson 1
Software solution - Lean development and Agile methodologies lesson 1
 
Automatic Subtitle Generation For Sound In Videos
Automatic Subtitle Generation For Sound In VideosAutomatic Subtitle Generation For Sound In Videos
Automatic Subtitle Generation For Sound In Videos
 
Automatic Subtitle Generation for Sound in Videos
Automatic Subtitle Generation for Sound in VideosAutomatic Subtitle Generation for Sound in Videos
Automatic Subtitle Generation for Sound in Videos
 
ITAC 2016 Where Open Source Meets Audit Analytics
ITAC 2016 Where Open Source Meets Audit AnalyticsITAC 2016 Where Open Source Meets Audit Analytics
ITAC 2016 Where Open Source Meets Audit Analytics
 
Resume_Shankar_Manickavasagam
Resume_Shankar_ManickavasagamResume_Shankar_Manickavasagam
Resume_Shankar_Manickavasagam
 
PKSengupta_TechAssoc
PKSengupta_TechAssocPKSengupta_TechAssoc
PKSengupta_TechAssoc
 
A survey on Enhancements in Speech Recognition
A survey on Enhancements in Speech RecognitionA survey on Enhancements in Speech Recognition
A survey on Enhancements in Speech Recognition
 
Desktop assistant
Desktop assistant Desktop assistant
Desktop assistant
 
Dynamix IoT 2012
Dynamix IoT 2012Dynamix IoT 2012
Dynamix IoT 2012
 
Smart Sound Measurement and Control System for Smart City
Smart Sound Measurement and Control System for Smart CitySmart Sound Measurement and Control System for Smart City
Smart Sound Measurement and Control System for Smart City
 
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
LIP READING - AN EFFICIENT CROSS AUDIO-VIDEO RECOGNITION USING 3D CONVOLUTION...
 
The CLAM Framework
The CLAM FrameworkThe CLAM Framework
The CLAM Framework
 
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and SynthesizerIRJET- Voice Command Execution with Speech Recognition and Synthesizer
IRJET- Voice Command Execution with Speech Recognition and Synthesizer
 
Simulation of speech recognition using correlation method on matlab software
Simulation of speech recognition using correlation method on matlab softwareSimulation of speech recognition using correlation method on matlab software
Simulation of speech recognition using correlation method on matlab software
 
Free Software for Free Sound
Free Software for Free SoundFree Software for Free Sound
Free Software for Free Sound
 
8th Ethiopian ICT Conference Bazaar and Exhibition.pptx
8th Ethiopian ICT Conference Bazaar and Exhibition.pptx8th Ethiopian ICT Conference Bazaar and Exhibition.pptx
8th Ethiopian ICT Conference Bazaar and Exhibition.pptx
 
Efficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And RecordingEfficient Intralingual Text To Speech Web Podcasting And Recording
Efficient Intralingual Text To Speech Web Podcasting And Recording
 
Ian definitions 3rd try 2
Ian definitions 3rd try 2Ian definitions 3rd try 2
Ian definitions 3rd try 2
 
Ian definitions 3rd try 2
Ian definitions 3rd try 2Ian definitions 3rd try 2
Ian definitions 3rd try 2
 
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
NLP on Hadoop: A Distributed Framework for NLP-Based Keyword and Keyphrase Ex...
 

Recently uploaded

Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTbhaskargani46
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwaitjaanualu31
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityMorshed Ahmed Rahath
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startQuintin Balsdon
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdfKamal Acharya
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"mphochane1998
 

Recently uploaded (20)

Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills KuwaitKuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
Kuwait City MTP kit ((+919101817206)) Buy Abortion Pills Kuwait
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
A Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna MunicipalityA Study of Urban Area Plan for Pabna Municipality
A Study of Urban Area Plan for Pabna Municipality
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 

Automatic speech recognition system using deep learning

  • 1. Automatic Speech Recognition System using Deep Learning Ankan Dutta 14MCEI03 Guided By Dr. Sharada Valiveti Institute of Technology Nirma University May 16, 2016 Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 1 / 39
  • 2. Introduction Definition Development of Automatic Speech Recognition System using Deep Learning Techniques Scope Automatic Speech Recognition allows computers to interpret human speech Lower barrier for computer interactions Speech recognition allows converting speech to text Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 2 / 39
  • 3. Introduction Objectives The audio signals should be converted in the form of MFCCs(Mel-Frequency Ceptral Coefficients)[1] Implementing Convolutional Neural Network[2, 3] for audio feature extraction Then using Gaussian Mixture Model - Hidden Markov Model[4]for recognition of audio signal. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 3 / 39
  • 4. 1st Review Applications of Speech Recognition System Dimensions of Speech Recognition System. Generic Block Diagram of the system fig.1. Performance Evaluation of the system. Conventional Speech Recognition Systems. Required Machine Learning Techniques. MFCCs fig. 2 Hidden Markov Models fig.3 Gaussian Mixture Models fig. 4 Deep De-noising Auto-Encoders fig. 5 Convolutional Neural Networks fig. 6 Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 4 / 39
  • 5. 1st Review Literature Survey on various proposed Audio Features Extraction Mechanisms. Visual Features Extraction Mechanisms. Integration of Audio and Visual Systems. Architectures for the System. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 5 / 39
  • 6. Mechanisms for Audio Features Extraction [5, 6, 7] Approaches: Using DNN-HMM in recognition phase. Using DNN in feature-extraction phase and GMM-HMM in recognition phase. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 6 / 39
  • 7. Generic Block Diagram Figure: System architecture Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 7 / 39
  • 8. MFCC[1] Figure: MFCC Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 8 / 39
  • 9. Hidden Markov Model Figure: Hidden Markov Model Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 9 / 39
  • 10. Gaussian Mixture Model Figure: Gaussian Mixture Model Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 10 / 39
  • 11. Deep De-noising Auto-Encoder Figure: Deep De-noising Auto-Encoder Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 11 / 39
  • 12. Convolutional Neural Network Figure: Convolutional Neural Network Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 12 / 39
  • 13. 2nd Review Audio Visual Speech Recognition. Architecture of the Model which we will use in our implementation fig. 7 Required Tools and Datasets. A Basic implementation Using KALDI. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 13 / 39
  • 14. Architecture of the Model Figure: System architecture Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 14 / 39
  • 15. Requirements For Automatic Speech Recognition, Nvidia CUDA 7.5 (System should have a Nvidia GPU)[8] We are using KALDI Speech Recognition Toolkit For our implementation. For Automatic Speaker Recognition, Python Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 15 / 39
  • 16. KALDI’s Dependencies [9] For Using Kaldi following libraries and tools have to be installed: OpenFst : Most of the compilation is done with it, and it is very heavily used. IRSTLM : It is a language modeling Toolkit. sph2pipe : It is for converting .sph files to .wav files, it is required for using LDC datasets. sclite : It is not that important but still may arise as one of the dependencies, so it is better to install it ATLAS : Its a linear algebra library. It will only work if your CPU throttling is not enabled. CLAPACK : This also a linear algebra library. If one doesn’t have ATLAS ,CLAPACK can be used as an alternative. SRILM : SRILM is a toolkit for building and applying statistical language models (LMs), primarily for use in speech recognition, statistical tagging and segmentation, and machine translation. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 16 / 39
  • 17. Dataset We have made our own dataset according to our requirement of the project. Vocabulary size of our dataset is very small i.e numbers from 0 to 9. This same dataset is used in both of our implementation, automatic speech recognition, and speaker identification. Sentences that contain only digits are perfect in this case. 10 different speakers (ASR systems must be trained and tested on different speakers, the more speakers you have the better), each speaker says 10 sentences, 100 sentences/utterances (in 100 *.wav files placed in 10 folders related to particular speakers - 10 *.wav files in each folder), 300 words (digits from zero to nine), each sentence/utterance consist of 3 words. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 17 / 39
  • 18. Implementation Automatic Speech Recognition Automatic Speaker Recognition Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 18 / 39
  • 19. Procedure for making a Automatic Speech Recognition System Using KALDI Data Preparation Audio Data Acoustic Modelling Language Modelling Project Finalization Tools Attachment Scoring Script Configuration Files Running Scripts Creation Getting Results Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 19 / 39
  • 20. Data Preparation Audio Data : We are using our own Digits dataset for the implementation of our ASR system. TASK : Go to kaldi-trunk/egs/digits directory and create ’digitsaudio’ folder. In kalditrunk/egs/digits/digitsaudio create two folders: ’train’ and ’test’. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 20 / 39
  • 21. Data Preparation Acoustic Modelling TASK: In kaldi-trunk/egs/digits directory, create a folder ’data’. Then create ’test’ and ’train’ subfolders inside. spk2gender : This file informs about speakers gender. PATTERN: < speakerID >< gender > wav.scp : This file connects every utterance (sentence said by one person during particular recording session) with an audio file related to this utterance. PATTERN: < uterranceID >< full − path − to − audio − file > Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 21 / 39
  • 22. Data Preparation Acoustic Modelling text : This file contains every utterance matched with its text transcription. PATTERN: < uterranceID >< text − transcription > utt2spk : This file tells the ASR system which utterance belongs to particular speaker. PATTERN: < uterranceID >< speakerID > corpus.txt: In kaldi-trunk/egs/digits/data/local create a file corpus.txt which should contain every single utterance transcription that can occur in our ASR system (in our case it will be 100 lines from 100 audio files). PATTERN: < text − transcription > Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 22 / 39
  • 23. Data Preparation Language Modelling TASK: In kaldi-trunk/egs/digits/data/local directory, create a folder ’dict’. Then create ’test’ and ’train’ subfolders inside. lexicon.txt : This file contains every word from our dictionary with its ’phone transcriptions’ (taken from /egs/voxforge). PATTERN: < word >< phone1 >< phone2 > ... nonsilence-phones.txt : This file lists nonsilence phones that are present in our project. PATTERN:< phone > silence-phones.txt : This file lists silence phones. PATTERN:< phone > optional-silence.txt : This file lists optional silence phones. PATTERN: < phone > Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 23 / 39
  • 24. Project Finalization Tools Attachment From kaldi-trunk/egs/wsj/s5 copy two folders (with the whole content) - ’utils’ and ’steps’ - and put them in our kaldi-trunk/egs/digits directory. Scoring Script This script will help you to get decoding results. TASK: From kaldi-trunk/egs/voxforge/local copy the script score.sh into exactly same location in our project (kaldi-trunk/egs/digits/local). Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 24 / 39
  • 25. Project Finalization Configuration Files TASK:In kaldi-trunk/egs/digits create a folder ’conf’. Inside kaldi-trunk/egs/digits/conf create two files (for some configuration modifications in decoding and mfcc feature extraction processes - taken from /egs/voxforge) decode.config mfcc.conf Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 25 / 39
  • 26. Running Script Creation Our last job is to prepare running scripts to create ASR system of our choice. path.sh cmd.sh run.sh Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 26 / 39
  • 27. Results Automatic Speech Recognition System When we execute the run.sh file the training is done and results logs are generated for the decoding process are found in ’log’ folder. Fig. 8, 9, 10 shows the training process and figure 11 shows the results of the prediction on the test data and accuracy. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 27 / 39
  • 28. Results Automatic Speech Recognition System (Training) Figure: Training Screenshot 1 Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 28 / 39
  • 29. Results Automatic Speech Recognition System (Training) Figure: Training Screenshot 2 Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 29 / 39
  • 30. Results Automatic Speech Recognition System (Training) Figure: Training Screenshot 3 Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 30 / 39
  • 31. Results Automatic Speech Recognition System (Prediction) Figure: Prediction of our ASR System Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 31 / 39
  • 32. Procedure for making a Automatic Speaker Recognition System Using K-NN in Python We have used our own dataset in our K-NN implementation. First we have extracted the mfcc features using matlab code. Then we have to arrange this mfcc feature in .csv form. Now this .csv feature file is given as input to our K-NN implementation in python. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 32 / 39
  • 33. Result of the Automatic Speaker Recognition System When we execute the program the dataset is split into two parts training and testing. After the training is completed and speaker prediction is done on the testing data. Then the accuracy is measured on the prediction. Figure 13 shows the result of the accuracy is 80% when we took nine-speaker and all of the mfcc features generated for all the ten samples. Figure 12 shows the mfcc features. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 33 / 39
  • 34. Result of the Automatic Speaker Recognition System Figure: MFCC features generated from Digits dataset Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 34 / 39
  • 35. Result of the Automatic Speaker Recognition System Figure: Prediction and Accuracy of Speaker Identification on Digits dataset Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 35 / 39
  • 36. Conclusion For Automatic Speech Recognition, For implementing our own ASR system we used KALDI speech recognition framework. We trained our system on our own dataset digits. Digits is our own dataset containing 10 speakers each speaking 10 sentences, every sentence contains 3 words. The vocabulary of our dataset is from 0 to 9. As a result, we have achieved an accuracy rate of 72% for our Speech Recognition System. Our ASR system is the text-dependent system as it has a limited vocabulary of 0 to 9. Higher recognition rate gain can be achieved with a larger dataset. For Automatic Speaker Recognition, We have also implemented Speaker identification using K-NN classification in Python using the same dataset. After the training of our K-NN classifier, we achieved an accuracy of 80% for our Speaker Identification system. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 36 / 39
  • 37. Bibliography I J. Luettin, N. Thacker, S. W. Beet, et al., “Visual speech recognition using active shape models and hidden markov models,” in Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on, vol. 2, pp. 817–820, IEEE, 1996. O. Abdel-Hamid, A.-r. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,” Audio, Speech, and Language Processing, IEEE/ACM Transactions on, vol. 22, no. 10, pp. 1533–1545, 2014. Y. Lan, B.-J. Theobald, R. Harvey, E.-J. Ong, and R. Bowden, “Improving visual features for lip-reading.,” in AVSP, pp. 7–3, 2010. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 37 / 39
  • 38. Bibliography II S. Renals, N. Morgan, H. Bourlard, M. Cohen, and H. Franco, “Connectionist probability estimators in hmm speech recognition,” Speech and Audio Processing, IEEE Transactions on, vol. 2, no. 1, pp. 161–174, 1994. P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proceedings of the 25th international conference on Machine learning, pp. 1096–1103, ACM, 2008. P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol, “Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion,” The Journal of Machine Learning Research, vol. 11, pp. 3371–3408, 2010. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 38 / 39
  • 39. Bibliography III K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, “Audio-visual speech recognition using deep learning,” Applied Intelligence, vol. 42, no. 4, pp. 722–737, 2015. ETSI/SAGE, “Specification of the 3GPP Confidentiality and Integrity Algorithms 128-EEA3 & 128-EIA3. Document 1: 128-EEA3 and 128-EIA3 Specification.” D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motl´ıˇcek, Y. Qian, P. Schwarz, et al., “The kaldi speech recognition toolkit,” 2011. Ankan Dutta (Institute of TechnologyNirma University)Audio Visual Speech Recognition System using Deep LearningMay 16, 2016 39 / 39