2. Outline
Define the problem
What is speech?
Feature Selection
Models
Early methods
Modern statistical models
Current State of ASR
Future Work
3. The ASR Problem
There is no single ASR problem
The problem depends on many factors
Microphone: Close-mic, throat-mic, microphone
array, audio-visual
Sources: band-limited, background noise,
reverberation
Speaker: speaker dependent, speaker
independent
Language: open/closed vocabulary, vocabulary
size, read/spontaneous speech
Output: Transcription, speaker id, keywords
5. What is Speech?
Analog signal produced by humans
You can think about the speech signal being
decomposed into the source and filter
The source is the vocal folds in voiced speech
The filter is the vocal tract and articulators
12. Feature Selection
As in any data-driven task, the data must be
represented in some format
Cepstral features have been found to perform
well
They represent the frequency of the
frequencies
Mel-frequency cepstral coefficients (MFCC)
are the most common variety
13. Where do we stand?
Defined the multiple problems associated with
ASR
Described how speech is produced
Illustrated how speech can be represented in
an ASR system
Now that we have the data, how do we
recognize the speech?
14. Radio Rex
First known attempt at speech recognition
A toy from 1922
Worked by analyzing the signal strength at
500Hz
15. Actual speech recognition
systems
Originally thought to be a relatively simple
task requiring a few years of concerted effort
1969, “Wither speech recognition” is
published
A DARPA project ran from 1971-1976 in
response to the statements in the Pierce
article
We can examine a few general systems
16. Template-Based ASR
Originally only worked for isolated words
Performs best when training and testing
conditions are best
For each word we want to recognize, we
store a template or example based on actual
data
Each test utterance is checked against the
templates to find the best match
Uses the Dynamic Time Warping (DTW)
algorithm
17. Dynamic Time Warping
Create a similarity matrix for the two
utterances
Use dynamic programming to find the lowest
cost path
18. Hearsay-II
One of the systems developed during the
DARPA program
A blackboard-based system utilizing symbolic
problem solvers
Each problem solver was called a knowledge
group
A complex scheduler was used to decide
when each KG should be called
20. DARPA Results
The Hearsay-II system performed much
better than the two other similar competing
systems
However, only one system met the
performance goals of the project
The Harpy system was also a CMU built system
In many ways it was a predecessor to the
modern statistical systems
23. Acoustic Model
For each frame of data, we need some way
of describing the likelihood of it belonging to
any of our classes
Two methods are commonly used
Multilayer perceptron (MLP) gives the likelihood
of a class given the data
Gaussian Mixture Model (GMM) gives the
likelihood of the data given a class
25. Pronunciation Model
While the pronunciation model can be very
complex, it is typically just a dictionary
The dictionary contains the valid
pronunciations for each word
Examples:
Cat: k ae t
Dog: d ao g
Fox: f aa x s
26. Language Model
Now we need some way of representing the
likelihood of any given word sequence
Many methods exist, but ngrams are the
most common
Ngrams models are trained by simply
counting the occurrences of words in a
training set
27. Ngrams
A unigram is the probability of any word in
isolation
A bigram is the probability of a given word
given the previous word
Higher order ngrams continue in a similar
fashion
A backoff probability is used for any unseen
data
28. How do we put it together?
We now have models to represent the three
parts of our equation
We need a framework to join these models
together
The standard framework used is the Hidden
Markov Model (HMM)
29. Markov Model
A state model using the markov property
The markov property states that the future
depends only on the present state
Models the likelihood of transitions between
states in a model
Given the model, we can determine the
likelihood of any sequence of states
30. Hidden Markov Model
Similar to a markov model except the states
are hidden
We now have observations tied to the
individual states
We no longer know the exact state sequence
given the data
Allows for the modeling of an underlying
unobservable process
31. HMMs for ASR
First we build an HMM for each phone
Next we combine the phone models based
on the pronunciation model to create word
level models
Finally, the word level models are combined
based on the language model
We now have a giant network with potentially
thousands or even millions of states
32. Decoding
Decoding happens in the same way as the
previous example
For each time frame we need to maintain two
pieces of information
The likelihood of being at any state
The previous state for every state
33. State of the Art
What works well
Constrained vocabulary systems
Systems adapted to a given speaker
Systems in anechoic environments without
background noise
Systems expecting read speech
What doesn't work
Large unconstrained vocabulary
Noisy environments
Conversational speech
34. Future Work
Better representations of audio based on
humans
Better representation of acoustic elements
based on articulatory phonology
Segmental models that do not rely on the
simple frame-based approach
35. Resources
Hidden Markov Model Toolkit (HTK)
http://htk.eng.cam.ac.uk/
CHIME ( a freely available dataset)
http://spandh.dcs.shef.ac.uk/projects/chime/PCC
/datasets.html
Machine Learning Lectures
http://www.stanford.edu/class/cs229/
http://www.youtube.com/watch?v=UzxYlbK2c7E