SlideShare une entreprise Scribd logo
1  sur  35
Introduction to Automatic
Speech Recognition
Outline
Define the problem
What is speech?
Feature Selection
Models
 Early methods
 Modern statistical models
Current State of ASR
Future Work
The ASR Problem
There is no single ASR problem
The problem depends on many factors
 Microphone: Close-mic, throat-mic, microphone
array, audio-visual
 Sources: band-limited, background noise,
reverberation
 Speaker: speaker dependent, speaker
independent
 Language: open/closed vocabulary, vocabulary
size, read/spontaneous speech
 Output: Transcription, speaker id, keywords
Performance Evaluation
Accuracy
 Percentage of tokens correctly recognized
Error Rate
 Inverse of accuracy
Token Type
 Phones
 Words*
 Sentences
 Semantics?
What is Speech?
Analog signal produced by humans
You can think about the speech signal being
decomposed into the source and filter
The source is the vocal folds in voiced speech
The filter is the vocal tract and articulators
Speech Production
Speech Production
Speech Production
Speech Visualization
Speech Visualization
Speech Visualization
Feature Selection
As in any data-driven task, the data must be
represented in some format
Cepstral features have been found to perform
well
They represent the frequency of the
frequencies
Mel-frequency cepstral coefficients (MFCC)
are the most common variety
Where do we stand?
Defined the multiple problems associated with
ASR
Described how speech is produced
Illustrated how speech can be represented in
an ASR system
Now that we have the data, how do we
recognize the speech?
Radio Rex
First known attempt at speech recognition
A toy from 1922
Worked by analyzing the signal strength at
500Hz
Actual speech recognition
systems
 Originally thought to be a relatively simple
task requiring a few years of concerted effort
 1969, “Wither speech recognition” is
published
 A DARPA project ran from 1971-1976 in
response to the statements in the Pierce
article
 We can examine a few general systems
Template-Based ASR
 Originally only worked for isolated words
 Performs best when training and testing
conditions are best
 For each word we want to recognize, we
store a template or example based on actual
data
 Each test utterance is checked against the
templates to find the best match
 Uses the Dynamic Time Warping (DTW)
algorithm
Dynamic Time Warping
 Create a similarity matrix for the two
utterances
 Use dynamic programming to find the lowest
cost path
Hearsay-II
 One of the systems developed during the
DARPA program
 A blackboard-based system utilizing symbolic
problem solvers
 Each problem solver was called a knowledge
group
 A complex scheduler was used to decide
when each KG should be called
Hearsay-II
DARPA Results
 The Hearsay-II system performed much
better than the two other similar competing
systems
 However, only one system met the
performance goals of the project
 The Harpy system was also a CMU built system
 In many ways it was a predecessor to the
modern statistical systems
Modern Statistical ASR
Modern Statistical ASR
Acoustic Model
 For each frame of data, we need some way
of describing the likelihood of it belonging to
any of our classes
 Two methods are commonly used
 Multilayer perceptron (MLP) gives the likelihood
of a class given the data
 Gaussian Mixture Model (GMM) gives the
likelihood of the data given a class
Gaussian Distribution
Pronunciation Model
 While the pronunciation model can be very
complex, it is typically just a dictionary
 The dictionary contains the valid
pronunciations for each word
 Examples:
 Cat: k ae t
 Dog: d ao g
 Fox: f aa x s
Language Model
 Now we need some way of representing the
likelihood of any given word sequence
 Many methods exist, but ngrams are the
most common
 Ngrams models are trained by simply
counting the occurrences of words in a
training set
Ngrams
 A unigram is the probability of any word in
isolation
 A bigram is the probability of a given word
given the previous word
 Higher order ngrams continue in a similar
fashion
 A backoff probability is used for any unseen
data
How do we put it together?
 We now have models to represent the three
parts of our equation
 We need a framework to join these models
together
 The standard framework used is the Hidden
Markov Model (HMM)
Markov Model
 A state model using the markov property
 The markov property states that the future
depends only on the present state
 Models the likelihood of transitions between
states in a model
 Given the model, we can determine the
likelihood of any sequence of states
Hidden Markov Model
 Similar to a markov model except the states
are hidden
 We now have observations tied to the
individual states
 We no longer know the exact state sequence
given the data
 Allows for the modeling of an underlying
unobservable process
HMMs for ASR
 First we build an HMM for each phone
 Next we combine the phone models based
on the pronunciation model to create word
level models
 Finally, the word level models are combined
based on the language model
 We now have a giant network with potentially
thousands or even millions of states
Decoding
 Decoding happens in the same way as the
previous example
 For each time frame we need to maintain two
pieces of information
 The likelihood of being at any state
 The previous state for every state
State of the Art
 What works well
 Constrained vocabulary systems
 Systems adapted to a given speaker
 Systems in anechoic environments without
background noise
 Systems expecting read speech
 What doesn't work
 Large unconstrained vocabulary
 Noisy environments
 Conversational speech
Future Work
 Better representations of audio based on
humans
 Better representation of acoustic elements
based on articulatory phonology
 Segmental models that do not rely on the
simple frame-based approach
Resources
 Hidden Markov Model Toolkit (HTK)
 http://htk.eng.cam.ac.uk/
 CHIME ( a freely available dataset)
 http://spandh.dcs.shef.ac.uk/projects/chime/PCC
/datasets.html
 Machine Learning Lectures
 http://www.stanford.edu/class/cs229/
 http://www.youtube.com/watch?v=UzxYlbK2c7E

Contenu connexe

Similaire à sr.ppt

Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySeminar Links
 
AUTOMATIC SPEECH RECOGNITION- A SURVEY
AUTOMATIC SPEECH RECOGNITION- A SURVEYAUTOMATIC SPEECH RECOGNITION- A SURVEY
AUTOMATIC SPEECH RECOGNITION- A SURVEYIJCERT
 
Bondec - A Sentence Boundary Detector
Bondec - A Sentence Boundary DetectorBondec - A Sentence Boundary Detector
Bondec - A Sentence Boundary Detectorbutest
 
Text independent speaker identification system using average pitch and forman...
Text independent speaker identification system using average pitch and forman...Text independent speaker identification system using average pitch and forman...
Text independent speaker identification system using average pitch and forman...ijitjournal
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversionankit_saluja
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversionankit_saluja
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics Ibutest
 
The role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingThe role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingConstantin Orasan
 
Research Developments and Directions in Speech Recognition and ...
Research Developments and Directions in Speech Recognition and ...Research Developments and Directions in Speech Recognition and ...
Research Developments and Directions in Speech Recognition and ...butest
 
Recent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesRecent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesIJECEIAES
 
High level speaker specific features modeling in automatic speaker recognitio...
High level speaker specific features modeling in automatic speaker recognitio...High level speaker specific features modeling in automatic speaker recognitio...
High level speaker specific features modeling in automatic speaker recognitio...IJECEIAES
 
A Recorded Debating Dataset
A Recorded Debating DatasetA Recorded Debating Dataset
A Recorded Debating DatasetScott Faria
 

Similaire à sr.ppt (20)

Asr
AsrAsr
Asr
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
AUTOMATIC SPEECH RECOGNITION- A SURVEY
AUTOMATIC SPEECH RECOGNITION- A SURVEYAUTOMATIC SPEECH RECOGNITION- A SURVEY
AUTOMATIC SPEECH RECOGNITION- A SURVEY
 
Voice
VoiceVoice
Voice
 
Bondec - A Sentence Boundary Detector
Bondec - A Sentence Boundary DetectorBondec - A Sentence Boundary Detector
Bondec - A Sentence Boundary Detector
 
Text independent speaker identification system using average pitch and forman...
Text independent speaker identification system using average pitch and forman...Text independent speaker identification system using average pitch and forman...
Text independent speaker identification system using average pitch and forman...
 
Asr
AsrAsr
Asr
 
BTP paper
BTP paperBTP paper
BTP paper
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
De4201715719
De4201715719De4201715719
De4201715719
 
CMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics ICMSC 723: Computational Linguistics I
CMSC 723: Computational Linguistics I
 
The role of linguistic information for shallow language processing
The role of linguistic information for shallow language processingThe role of linguistic information for shallow language processing
The role of linguistic information for shallow language processing
 
Research Developments and Directions in Speech Recognition and ...
Research Developments and Directions in Speech Recognition and ...Research Developments and Directions in Speech Recognition and ...
Research Developments and Directions in Speech Recognition and ...
 
Asr
AsrAsr
Asr
 
Recent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performancesRecent advances in LVCSR : A benchmark comparison of performances
Recent advances in LVCSR : A benchmark comparison of performances
 
Supervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured TextSupervised Approach to Extract Sentiments from Unstructured Text
Supervised Approach to Extract Sentiments from Unstructured Text
 
Assign
AssignAssign
Assign
 
High level speaker specific features modeling in automatic speaker recognitio...
High level speaker specific features modeling in automatic speaker recognitio...High level speaker specific features modeling in automatic speaker recognitio...
High level speaker specific features modeling in automatic speaker recognitio...
 
A Recorded Debating Dataset
A Recorded Debating DatasetA Recorded Debating Dataset
A Recorded Debating Dataset
 

Dernier

80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jisc
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxCeline George
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxDenish Jangid
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...pradhanghanshyam7136
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Pooja Bhuva
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptxMaritesTamaniVerdade
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfNirmal Dwivedi
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.MaryamAhmad92
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsKarakKing
 

Dernier (20)

80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)Jamworks pilot and AI at Jisc (20/03/2024)
Jamworks pilot and AI at Jisc (20/03/2024)
 
How to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptxHow to setup Pycharm environment for Odoo 17.pptx
How to setup Pycharm environment for Odoo 17.pptx
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...Kodo Millet  PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
Kodo Millet PPT made by Ghanshyam bairwa college of Agriculture kumher bhara...
 
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
Beyond_Borders_Understanding_Anime_and_Manga_Fandom_A_Comprehensive_Audience_...
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdfUGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
UGC NET Paper 1 Mathematical Reasoning & Aptitude.pdf
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Salient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functionsSalient Features of India constitution especially power and functions
Salient Features of India constitution especially power and functions
 

sr.ppt

  • 2. Outline Define the problem What is speech? Feature Selection Models  Early methods  Modern statistical models Current State of ASR Future Work
  • 3. The ASR Problem There is no single ASR problem The problem depends on many factors  Microphone: Close-mic, throat-mic, microphone array, audio-visual  Sources: band-limited, background noise, reverberation  Speaker: speaker dependent, speaker independent  Language: open/closed vocabulary, vocabulary size, read/spontaneous speech  Output: Transcription, speaker id, keywords
  • 4. Performance Evaluation Accuracy  Percentage of tokens correctly recognized Error Rate  Inverse of accuracy Token Type  Phones  Words*  Sentences  Semantics?
  • 5. What is Speech? Analog signal produced by humans You can think about the speech signal being decomposed into the source and filter The source is the vocal folds in voiced speech The filter is the vocal tract and articulators
  • 12. Feature Selection As in any data-driven task, the data must be represented in some format Cepstral features have been found to perform well They represent the frequency of the frequencies Mel-frequency cepstral coefficients (MFCC) are the most common variety
  • 13. Where do we stand? Defined the multiple problems associated with ASR Described how speech is produced Illustrated how speech can be represented in an ASR system Now that we have the data, how do we recognize the speech?
  • 14. Radio Rex First known attempt at speech recognition A toy from 1922 Worked by analyzing the signal strength at 500Hz
  • 15. Actual speech recognition systems  Originally thought to be a relatively simple task requiring a few years of concerted effort  1969, “Wither speech recognition” is published  A DARPA project ran from 1971-1976 in response to the statements in the Pierce article  We can examine a few general systems
  • 16. Template-Based ASR  Originally only worked for isolated words  Performs best when training and testing conditions are best  For each word we want to recognize, we store a template or example based on actual data  Each test utterance is checked against the templates to find the best match  Uses the Dynamic Time Warping (DTW) algorithm
  • 17. Dynamic Time Warping  Create a similarity matrix for the two utterances  Use dynamic programming to find the lowest cost path
  • 18. Hearsay-II  One of the systems developed during the DARPA program  A blackboard-based system utilizing symbolic problem solvers  Each problem solver was called a knowledge group  A complex scheduler was used to decide when each KG should be called
  • 20. DARPA Results  The Hearsay-II system performed much better than the two other similar competing systems  However, only one system met the performance goals of the project  The Harpy system was also a CMU built system  In many ways it was a predecessor to the modern statistical systems
  • 23. Acoustic Model  For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes  Two methods are commonly used  Multilayer perceptron (MLP) gives the likelihood of a class given the data  Gaussian Mixture Model (GMM) gives the likelihood of the data given a class
  • 25. Pronunciation Model  While the pronunciation model can be very complex, it is typically just a dictionary  The dictionary contains the valid pronunciations for each word  Examples:  Cat: k ae t  Dog: d ao g  Fox: f aa x s
  • 26. Language Model  Now we need some way of representing the likelihood of any given word sequence  Many methods exist, but ngrams are the most common  Ngrams models are trained by simply counting the occurrences of words in a training set
  • 27. Ngrams  A unigram is the probability of any word in isolation  A bigram is the probability of a given word given the previous word  Higher order ngrams continue in a similar fashion  A backoff probability is used for any unseen data
  • 28. How do we put it together?  We now have models to represent the three parts of our equation  We need a framework to join these models together  The standard framework used is the Hidden Markov Model (HMM)
  • 29. Markov Model  A state model using the markov property  The markov property states that the future depends only on the present state  Models the likelihood of transitions between states in a model  Given the model, we can determine the likelihood of any sequence of states
  • 30. Hidden Markov Model  Similar to a markov model except the states are hidden  We now have observations tied to the individual states  We no longer know the exact state sequence given the data  Allows for the modeling of an underlying unobservable process
  • 31. HMMs for ASR  First we build an HMM for each phone  Next we combine the phone models based on the pronunciation model to create word level models  Finally, the word level models are combined based on the language model  We now have a giant network with potentially thousands or even millions of states
  • 32. Decoding  Decoding happens in the same way as the previous example  For each time frame we need to maintain two pieces of information  The likelihood of being at any state  The previous state for every state
  • 33. State of the Art  What works well  Constrained vocabulary systems  Systems adapted to a given speaker  Systems in anechoic environments without background noise  Systems expecting read speech  What doesn't work  Large unconstrained vocabulary  Noisy environments  Conversational speech
  • 34. Future Work  Better representations of audio based on humans  Better representation of acoustic elements based on articulatory phonology  Segmental models that do not rely on the simple frame-based approach
  • 35. Resources  Hidden Markov Model Toolkit (HTK)  http://htk.eng.cam.ac.uk/  CHIME ( a freely available dataset)  http://spandh.dcs.shef.ac.uk/projects/chime/PCC /datasets.html  Machine Learning Lectures  http://www.stanford.edu/class/cs229/  http://www.youtube.com/watch?v=UzxYlbK2c7E