Publicité

Deep Learning for Speech Recognition - Vikrant Singh Tomar

WithTheBest
17 Oct 2016
Publicité

Contenu connexe

Similaire à Deep Learning for Speech Recognition - Vikrant Singh Tomar(20)

Publicité
Publicité

Deep Learning for Speech Recognition - Vikrant Singh Tomar

  1. Deep Learning for Speech Recognition Vikrant Tomar Founder, Fluent.ai vt@fluent.ai We are hiring!
  2. Outline - Introduction - General overview of speech recognition framework - Conventional GMM-HMM based systems - Deep neural networks in speech - ConvNets - RNNs/LSTMs and End-to-end learning - New interesting stuff 2
  3. Intro 1: What is speech recognition? - Dream: A machine should be able to develop a functional equivalent of the speaker’s intended message as effortlessly as humans can - In other words: The goal is to find the most likely sequence of symbols such as words or sub-word speech units from a stream of acoustic data. 3
  4. Intro 2: How is deep learning for speech different from deep learning for images? - Speech is a temporal signal, there is information in the sequence - One dimensional signal with multitudes of information: - Speaker - Accent and language - Age and health - Environment - Issues: - Noise and background conditions - Accents - Recording devices 4
  5. Overview: Statistical Framework for speech recognition - Formally, an ASR system maps the sequence of observation vectors, X, to the optimum sequence of words, Ŵ : - 5
  6. Overview 2: System Architecture 6
  7. System Architecture : Feature extraction & spectrogram 7
  8. GMM-HMM based systems 8
  9. Deep neural networks in speech - Few different approaches - Tandem - Hybrid - End-to-end - Old but new 9
  10. Tandem DNN: DNN -- GMM -- HMM 10
  11. Hybrid DNN - HMM 11 - Good source: Hinton et. al, Deep neural networks for acoustic modelling in speech, 2012.
  12. Hybrid CNN - HMM 12 - Good source: A-Hamid et. al, Covolutional neural networks for speech recognition, 2014
  13. Hybrid CNN - HMM -- Partial weight sharing 13
  14. Some benchmarks 14
  15. RNNs and End to end models - RNN: - Good because sequential models - However, cannot capture long-term dependencies - Vanishing gradients - Solutions: LSTMs and GRUs - End to end models have overall simplified arch. - CTC : Connectionist temporal classification A. Graves et. al., “Towards End-to-End Speech Recognition with Recurrent Neural Networks, 2014 15
  16. New interesting stuff - Baidu Deep Speech: Use bi-directional RNNs to directly map to characters - IBM 2015/2016 and Microsoft 2016: Deep CNN with 3 x 3 kernels similar to VGG net etc. - CLDNN : Conv + LSTMs + Fully Connected Baidu Lab: Deep Speech 2014 and Deep Speech 2, 2015 Sainath et. al, CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP NEURAL NETWORKS, 2015 Xiong et. al, THE MICROSOFT 2016 CONVERSATIONAL SPEECH RECOGNITION SYSTEM, 2016 Saon et. al, The IBM 2015/16 English Conversational Telephone Speech Recognition System, 2015/16 16
  17. Conclusion and resources - Lots of exciting stuff, most concepts are similar to other deep learning communities - Good starting point: http://www.recognize-speech.com - You can use any toolbox you like to start: - Tensorflow, Torch, Theano etc. - Kaldi, Currennt - Older stuff: CMU-Sphinx, RWTH-ASR, HTK - Free(-ish) datasets: http://www.openslr.org/resources.php - Contact: vt@fluent.ai (Hiring Scientists) 17
Publicité