Retrouvez la présentation de notre Meetup du 27 septembre 2017 présenté par notre collaborateur Abdelwahab HEBA : Deep Learning in practice : Speech recognition and beyond
4. 4 / 56
A Deep-Learning
Approach
Books:
Bengio, Yoshua (2009).
"Learning Deep Architectures fo
r AI"
.
L. Deng and D. Yu (2014) "Deep
Learning: Methods and
Applications"
http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol
7-SIG-039.pdf
D. Yu and L. Deng (2014).
"Automatic Speech
Recognition: A Deep Learning
Approach” (Publisher:
Springer).
Reading MaterialReading Material
6. 6 / 56
Part I : Machine Learning ( Deep/Shallow)Part I : Machine Learning ( Deep/Shallow)
and Signal Processingand Signal Processing
7. 7 / 56
Current view of Artificial Intelligence, Machine Learning & DeepCurrent view of Artificial Intelligence, Machine Learning & Deep
LearningLearning
Edureka blog – what-is-deep-learning
8. 8 / 56
Current view of Machine Learning founding & disciplinesCurrent view of Machine Learning founding & disciplines
Edureka blog – what-is-deep-learning
9. 9 / 56
Machine Learning Paradigms : An OverviewMachine Learning Paradigms : An Overview
Machine learning Data
Analysis/
Statistic
s
Programs
10. 10 / 56
Supervised Machine Learning (classification)Supervised Machine Learning (classification)
measurements (features)
&
associated ‘class’ labels
(colors used to show class labels)
Training data set
Training
algorithm
Parameters/weights
(and sometimes structure)
Learned model
Training phase (usually offline)
11. 11 / 56
Supervised Machine Learning (classification)Supervised Machine Learning (classification)
Input test data point
structure + parameters
predicted class label or
label sequence (e.g. sentence)
Learned model Output
measurements (features) only
Test phase (run time, online)
12. 12 / 56
What Is Deep Learning ?What Is Deep Learning ?
Deep learning
Machine
learning
Deep learning (deep
machine learning, or deep
structured learning, or
hierarchical learning, or
sometimes DL) is a branch of
machine learning based on a
set of algorithms that attempt
to model high-level
abstractions in data by using
model architectures, with
complex structures or
otherwise, composed of
multiple non-
linear transformations.[1](p198)[2]
[3][4]
13. 13 / 56
Evolution of Machine LearningEvolution of Machine Learning
(Slide from: Yoshua Bengio)
15. Y LeCun
MA Ranzato
D-AE
DBN DBM
AEPerceptron
RBM
GMM BayesNP
SVM
Sparse
Coding
DecisionTree
Boosting
SHALLOW DEEP
Conv. Net
Neural Net
RNN
Bayes Nets
Modified from
16. Y LeCun
MA Ranzato
SHALLOW DEEP
Neural Networks
Probabilistic Models
D-AE
DBN DBM
AEPerceptron
RBM
GMM BayesNP
SVM
Sparse
Coding
DecisionTree
Boosting
Conv. Net
Deep Neural
Net RNN
Bayes Nets
Modified from
17. Y LeCun
MA Ranzato
SHALLOW DEEP
Neural Networks
Probabilistic Models
Conv. Net
D-AE
DBN DBM
AEPerceptron
RBM
?GMM BayesNP
SVM
Supervised Supervised
Unsupervised
Sparse
Coding
Boosting
DecisionTree
Deep Neural
Net RNN
?Bayes Nets
Modified from
18. 18 / 56
Part II : Speech RecognitionPart II : Speech Recognition
19. 19 / 56
Human Communication : verbal & non verbal informationHuman Communication : verbal & non verbal information
20. 20 / 56
Speech recognition problemSpeech recognition problem
22. 22 / 56
Speech representationSpeech representation
● Same word : « Appeler »
23. 23 / 56
Speech representationSpeech representation
We want a low-dimensionality representation, invariant to
speaker, background noise, rate of speaking etc.
● Fourier analysis shows energy in different frequency bands
24. 24 / 56
Acoustic representationAcoustic representation
Vowel triangle as seen from the formants 1 & 2
25. 25 / 56
Acoustic representationAcoustic representation
● Features used in speech recognition
● Mel Frequency Cepstral Coefficients – MFCC
● Perceptual Linear Prediction – PLP
● RASTA-PLP
● Filter Banks Coefficient – F-BANKs
26. 26 / 56
Speech Recognition asSpeech Recognition as
transduction Fromtransduction From
signal to languagesignal to language
27. 27 / 56
Speech Recognition asSpeech Recognition as
transduction Fromtransduction From
signal to languagesignal to language
28. 28 / 56
Speech Recognition asSpeech Recognition as
transduction Fromtransduction From
signal to languagesignal to language
29. 29 / 56
Probabilistic speech recognitionProbabilistic speech recognition
● Speech signal represented as an acoustic observation sequence
● We want to find the most likely word sequence W
● We model this with a Hidden Markov Model
● The system has a set of discrete states,
● Transitions from state to state according to transition probabilities (Markovian :
memoryless)
● Acoustic observation when making a transition is conditioned on state alone.
P(o|c)
● We seek to recover the state sequence and consequently the word sequence
30. 30 / 56
Speech Recognition asSpeech Recognition as
transduction - Phone Recognitiontransduction - Phone Recognition
● Training Algorithm (N iteration)
● Align data & text
● Compute probabilities P(o/p) of each segments o
● Update boundaries
31. 31 / 56
Speech Recognition asSpeech Recognition as
transduction - Lexicontransduction - Lexicon
● Construct graph using Weighted Finite State Transducers
(WFST)
32. 32 / 56
Speech Recognition asSpeech Recognition as
transductiontransduction
● Compose Lexicon FST with Grammar FST L o G
● Transduction via Composition
● Map output labels of lexicon to input labels of Language Model.
● Join and optimize end-to-end graph.
33. 33 / 56
Different steps of acoustic modelingDifferent steps of acoustic modeling
35. 35 / 56
DecodingDecoding
● We want to find the most likely word sequence W
knowing the observation o in the graph
36. 36 / 56
Part III : Neural Networks for Speech RecognitionPart III : Neural Networks for Speech Recognition
37. 37 / 56
Three main paradigms for neural networks for speechThree main paradigms for neural networks for speech
● Use neural networks to compute nonlinear feature
representation
● « Bottleneck » or « tandem » features
● Use neural networks to estimate phonetic unit
probabilities (Hybrid networks)
● Use end-to-end neural networks
38. 38 / 56
Neural network featuresNeural network features
● Train a neural network to discriminate classes.
● Use output or a low-dimensional bottleneck layer
representation as features.
39. 39 / 56
Hybrid Speech Recognition SystemHybrid Speech Recognition System
● Train the network as a classifier with a softmax across
the phonetic units.
49. 49 / 56
Connexionist Temporal Classification (CTC)Connexionist Temporal Classification (CTC)
● CTC Loss Function :
50. 50 / 56
Connexionist Temporal Classification (CTC)Connexionist Temporal Classification (CTC)
● Mise à jour du réseau avec la CTC Loss Function :● Mise à jour du réseau avec la CTC Loss Function :
● Backprobagation :
51. 51 / 56
Home messageHome message
● Speech Recognition systems
● HMM-GMM traditional system
● Hybrid ASR system
● Use Neural Networks for feature representation
● Or , use Neural Networks for phoneme recognition
● End-To-End Neural Networks system
● Grapheme based model
● Need lot of date to perform
● Complex modeling
53. 53 / 56
The Kaldi ToolkitThe Kaldi Toolkit
● Kaldi is specifically designed for speech recognition research
application
● Kaldi training tools
● Data preparation (link text to wav, speaker to utt..)
● Feature extraction : MFCC, PLP, F-BANKs, Pitch, LDA, HLDA,
fMLLR, MLLT, VTLN, etc.
● Scripts for building finite state transducer : converting
Lexicon & Language model to fst format
● HMM-GMM traditional system
● Hybrid system
● Online decoding
56. Thanks for your attentionThanks for your attention
LINAGORA – headquarters
80, rue Roque de Fillol
92800 PUTEAUX
FRANCE
Phone : +33 (0)1 46 96 63 63
Info : info@linagora.com
Web : www.linagora.com
facebook.com/Linagora/
@linagora