Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Deep Learning in practice : Speech recognition and beyond - Meetup

4 021 vues

Publié le

Retrouvez la présentation de notre Meetup du 27 septembre 2017 présenté par notre collaborateur Abdelwahab HEBA : Deep Learning in practice : Speech recognition and beyond

Publié dans : Technologie
  • Nice !! Download 100 % Free Ebooks, PPts, Study Notes, Novels, etc @ https://www.ThesisScientist.com
       Répondre 
    Voulez-vous vraiment ?  Oui  Non
    Votre message apparaîtra ici

Deep Learning in practice : Speech recognition and beyond - Meetup

  1. 1. Deep learning in practice : Speech recognition and beyond Abdel HEBA 27 septembre 2017
  2. 2. 2 / 56 OutlineOutline ● Part 1 : Basics of Machine Learning ( Deep and Shallow) and of Signal Processing ● Part 2 : Speech Recognition ● Acoustic representation ● Probabilistic speech recognition ● Part 3 : Neural Network Speech Recognition ● Hybrid neural networks ● End-to-End architecture ● Part 4 : Kaldi
  3. 3. 3 / 56 Reading MaterialReading Material
  4. 4. 4 / 56 A Deep-Learning Approach Books: Bengio, Yoshua (2009). "Learning Deep Architectures fo r AI" .   L. Deng and D. Yu (2014) "Deep Learning: Methods and Applications" http://research.microsoft.com/pubs/209355/DeepLearning-NowPublishing-Vol 7-SIG-039.pdf   D. Yu and L. Deng (2014). "Automatic Speech Recognition: A Deep Learning Approach” (Publisher: Springer). Reading MaterialReading Material
  5. 5. 5 / 56 Reading MaterialReading Material
  6. 6. 6 / 56 Part I : Machine Learning ( Deep/Shallow)Part I : Machine Learning ( Deep/Shallow) and Signal Processingand Signal Processing
  7. 7. 7 / 56 Current view of Artificial Intelligence, Machine Learning & DeepCurrent view of Artificial Intelligence, Machine Learning & Deep LearningLearning Edureka blog – what-is-deep-learning
  8. 8. 8 / 56 Current view of Machine Learning founding & disciplinesCurrent view of Machine Learning founding & disciplines Edureka blog – what-is-deep-learning
  9. 9. 9 / 56 Machine Learning Paradigms : An OverviewMachine Learning Paradigms : An Overview Machine learning Data Analysis/ Statistic s Programs
  10. 10. 10 / 56 Supervised Machine Learning (classification)Supervised Machine Learning (classification) measurements (features) & associated ‘class’ labels (colors used to show class labels) Training data set Training algorithm Parameters/weights (and sometimes structure) Learned model Training phase (usually offline)
  11. 11. 11 / 56 Supervised Machine Learning (classification)Supervised Machine Learning (classification) Input test data point structure + parameters predicted class label or label sequence (e.g. sentence) Learned model Output measurements (features) only Test phase (run time, online)
  12. 12. 12 / 56 What Is Deep Learning ?What Is Deep Learning ? Deep learning Machine learning Deep learning (deep machine learning, or deep structured learning, or hierarchical learning, or sometimes DL) is a branch of machine learning based on a set of algorithms that attempt to model high-level abstractions in data by using model architectures, with complex structures or otherwise, composed of multiple non- linear transformations.[1](p198)[2] [3][4]
  13. 13. 13 / 56 Evolution of Machine LearningEvolution of Machine Learning (Slide from: Yoshua Bengio)
  14. 14. 14 / 56 Face RecognitionFace Recognition
  15. 15. Y LeCun MA Ranzato D-AE DBN DBM AEPerceptron RBM GMM BayesNP SVM Sparse Coding  DecisionTree Boosting SHALLOW DEEP Conv. Net Neural Net RNN Bayes Nets Modified from
  16. 16. Y LeCun MA Ranzato SHALLOW DEEP Neural Networks Probabilistic Models D-AE DBN DBM AEPerceptron RBM GMM BayesNP SVM Sparse Coding  DecisionTree Boosting Conv. Net Deep Neural Net RNN Bayes Nets Modified from
  17. 17. Y LeCun MA Ranzato SHALLOW DEEP Neural Networks Probabilistic Models Conv. Net D-AE DBN DBM AEPerceptron RBM ?GMM BayesNP SVM Supervised Supervised Unsupervised Sparse Coding  Boosting DecisionTree Deep Neural Net RNN ?Bayes Nets Modified from
  18. 18. 18 / 56 Part II : Speech RecognitionPart II : Speech Recognition
  19. 19. 19 / 56 Human Communication : verbal & non verbal informationHuman Communication : verbal & non verbal information
  20. 20. 20 / 56 Speech recognition problemSpeech recognition problem
  21. 21. 21 / 56 Speech recognition problemSpeech recognition problem ● Automatic speech recognition ● Spontaneous vs read speech ● Large vocabulary ● In noise ● Low resource ● Far-Field ● Accent-independent ● Speaker-adaptative ● Speaker identification ● Speech enhancement ● Speech separation
  22. 22. 22 / 56 Speech representationSpeech representation ● Same word : « Appeler »
  23. 23. 23 / 56 Speech representationSpeech representation We want a low-dimensionality representation, invariant to speaker, background noise, rate of speaking etc. ● Fourier analysis shows energy in different frequency bands
  24. 24. 24 / 56 Acoustic representationAcoustic representation Vowel triangle as seen from the formants 1 & 2
  25. 25. 25 / 56 Acoustic representationAcoustic representation ● Features used in speech recognition ● Mel Frequency Cepstral Coefficients – MFCC ● Perceptual Linear Prediction – PLP ● RASTA-PLP ● Filter Banks Coefficient – F-BANKs
  26. 26. 26 / 56 Speech Recognition asSpeech Recognition as transduction Fromtransduction From signal to languagesignal to language
  27. 27. 27 / 56 Speech Recognition asSpeech Recognition as transduction Fromtransduction From signal to languagesignal to language
  28. 28. 28 / 56 Speech Recognition asSpeech Recognition as transduction Fromtransduction From signal to languagesignal to language
  29. 29. 29 / 56 Probabilistic speech recognitionProbabilistic speech recognition ● Speech signal represented as an acoustic observation sequence ● We want to find the most likely word sequence W ● We model this with a Hidden Markov Model ● The system has a set of discrete states, ● Transitions from state to state according to transition probabilities (Markovian : memoryless) ● Acoustic observation when making a transition is conditioned on state alone. P(o|c) ● We seek to recover the state sequence and consequently the word sequence
  30. 30. 30 / 56 Speech Recognition asSpeech Recognition as transduction - Phone Recognitiontransduction - Phone Recognition ● Training Algorithm (N iteration) ● Align data & text ● Compute probabilities P(o/p) of each segments o ● Update boundaries
  31. 31. 31 / 56 Speech Recognition asSpeech Recognition as transduction - Lexicontransduction - Lexicon ● Construct graph using Weighted Finite State Transducers (WFST)
  32. 32. 32 / 56 Speech Recognition asSpeech Recognition as transductiontransduction ● Compose Lexicon FST with Grammar FST L o G ● Transduction via Composition ● Map output labels of lexicon to input labels of Language Model. ● Join and optimize end-to-end graph.
  33. 33. 33 / 56 Different steps of acoustic modelingDifferent steps of acoustic modeling
  34. 34. 34 / 56 DecodingDecoding
  35. 35. 35 / 56 DecodingDecoding ● We want to find the most likely word sequence W knowing the observation o in the graph
  36. 36. 36 / 56 Part III : Neural Networks for Speech RecognitionPart III : Neural Networks for Speech Recognition
  37. 37. 37 / 56 Three main paradigms for neural networks for speechThree main paradigms for neural networks for speech ● Use neural networks to compute nonlinear feature representation ● « Bottleneck » or « tandem » features ● Use neural networks to estimate phonetic unit probabilities (Hybrid networks) ● Use end-to-end neural networks
  38. 38. 38 / 56 Neural network featuresNeural network features ● Train a neural network to discriminate classes. ● Use output or a low-dimensional bottleneck layer representation as features.
  39. 39. 39 / 56 Hybrid Speech Recognition SystemHybrid Speech Recognition System ● Train the network as a classifier with a softmax across the phonetic units.
  40. 40. 40 / 56 Hybrid Speech Recognition SystemHybrid Speech Recognition System
  41. 41. 41 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Fully connected ● Convolutional Networks (CNNs) ● Recurrent neural networks (RNNs) ● LSTMs ● GRUs
  42. 42. 42 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Convolutional Neural network
  43. 43. 43 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Recurrent Neural Network
  44. 44. 44 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Recurrent Neural Network
  45. 45. 45 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Recurrent Neural Network
  46. 46. 46 / 56 Neural network architectures for speech recognitionNeural network architectures for speech recognition ● Recurrent Neural Network
  47. 47. 47 / 56 End-To-End Neural Networks for Speech Recognition :End-To-End Neural Networks for Speech Recognition : CTC Loss FucntionCTC Loss Fucntion
  48. 48. 48 / 56 End-To-End Speech Recognition :End-To-End Speech Recognition : CTC InputCTC Input ● Graphem-based model : c {A,B,C…,Z,Blank,Space} ● P(c=HHH_E_LL_LO___|x)= P(c₁=H|x)P(c₂=H|x)...P(c₆=blank|x)..
  49. 49. 49 / 56 Connexionist Temporal Classification (CTC)Connexionist Temporal Classification (CTC) ● CTC Loss Function :
  50. 50. 50 / 56 Connexionist Temporal Classification (CTC)Connexionist Temporal Classification (CTC) ● Mise à jour du réseau avec la CTC Loss Function :● Mise à jour du réseau avec la CTC Loss Function : ● Backprobagation :
  51. 51. 51 / 56 Home messageHome message ● Speech Recognition systems ● HMM-GMM traditional system ● Hybrid ASR system ● Use Neural Networks for feature representation ● Or , use Neural Networks for phoneme recognition ● End-To-End Neural Networks system ● Grapheme based model ● Need lot of date to perform ● Complex modeling
  52. 52. 52 / 56 Part IV : KaldiPart IV : Kaldi
  53. 53. 53 / 56 The Kaldi ToolkitThe Kaldi Toolkit ● Kaldi is specifically designed for speech recognition research application ● Kaldi training tools ● Data preparation (link text to wav, speaker to utt..) ● Feature extraction : MFCC, PLP, F-BANKs, Pitch, LDA, HLDA, fMLLR, MLLT, VTLN, etc. ● Scripts for building finite state transducer : converting Lexicon & Language model to fst format ● HMM-GMM traditional system ● Hybrid system ● Online decoding
  54. 54. 54 / 56 Kaldi ArchitectureKaldi Architecture
  55. 55. 55 / 56 LinSTT use KaldiLinSTT use Kaldi Site CLIPS ENST IRENE LIA LIMSI LIUM LORIA Linagora WER 40.7 45.4 35.4 26.7 11.9 23.6 27.6 26.23 Audio Corpus 90h 90h 90h 90h 90h +100h 90h +90h 90h 90h #states 1,500 114 6,000 3,600 12,000 7,000 6,000 15,000 #gaussians 24k 14k 200k 230k 370k 154k 90k 500k #pronunciations 38k 118k 118k 130k 276k 107k 112k 105k
  56. 56. Thanks for your attentionThanks for your attention LINAGORA – headquarters 80, rue Roque de Fillol 92800 PUTEAUX FRANCE Phone : +33 (0)1 46 96 63 63 Info : info@linagora.com Web : www.linagora.com facebook.com/Linagora/ @linagora

×