sr.ppt

Introduction to Automatic
Speech Recognition

Outline
Define the problem
What is speech?
Feature Selection
Models
 Early methods
 Modern statistical models
Current State of ASR
Future Work

The ASR Problem
There is no single ASR problem
The problem depends on many factors
 Microphone: Close-mic, throat-mic, microphone
array, audio-visual
 Sources: band-limited, background noise,
reverberation
 Speaker: speaker dependent, speaker
independent
 Language: open/closed vocabulary, vocabulary
size, read/spontaneous speech
 Output: Transcription, speaker id, keywords

Performance Evaluation
Accuracy
 Percentage of tokens correctly recognized
Error Rate
 Inverse of accuracy
Token Type
 Phones
 Words*
 Sentences
 Semantics?

What is Speech?
Analog signal produced by humans
You can think about the speech signal being
decomposed into the source and filter
The source is the vocal folds in voiced speech
The filter is the vocal tract and articulators

Feature Selection
As in any data-driven task, the data must be
represented in some format
Cepstral features have been found to perform
well
They represent the frequency of the
frequencies
Mel-frequency cepstral coefficients (MFCC)
are the most common variety

Where do we stand?
Defined the multiple problems associated with
ASR
Described how speech is produced
Illustrated how speech can be represented in
an ASR system
Now that we have the data, how do we
recognize the speech?

Radio Rex
First known attempt at speech recognition
A toy from 1922
Worked by analyzing the signal strength at
500Hz

Actual speech recognition
systems
 Originally thought to be a relatively simple
task requiring a few years of concerted effort
 1969, “Wither speech recognition” is
published
 A DARPA project ran from 1971-1976 in
response to the statements in the Pierce
article
 We can examine a few general systems

Template-Based ASR
 Originally only worked for isolated words
 Performs best when training and testing
conditions are best
 For each word we want to recognize, we
store a template or example based on actual
data
 Each test utterance is checked against the
templates to find the best match
 Uses the Dynamic Time Warping (DTW)
algorithm

Dynamic Time Warping
 Create a similarity matrix for the two
utterances
 Use dynamic programming to find the lowest
cost path

Hearsay-II
 One of the systems developed during the
DARPA program
 A blackboard-based system utilizing symbolic
problem solvers
 Each problem solver was called a knowledge
group
 A complex scheduler was used to decide
when each KG should be called

DARPA Results
 The Hearsay-II system performed much
better than the two other similar competing
systems
 However, only one system met the
performance goals of the project
 The Harpy system was also a CMU built system
 In many ways it was a predecessor to the
modern statistical systems

Acoustic Model
 For each frame of data, we need some way
of describing the likelihood of it belonging to
any of our classes
 Two methods are commonly used
 Multilayer perceptron (MLP) gives the likelihood
of a class given the data
 Gaussian Mixture Model (GMM) gives the
likelihood of the data given a class

Pronunciation Model
 While the pronunciation model can be very
complex, it is typically just a dictionary
 The dictionary contains the valid
pronunciations for each word
 Examples:
 Cat: k ae t
 Dog: d ao g
 Fox: f aa x s

Language Model
 Now we need some way of representing the
likelihood of any given word sequence
 Many methods exist, but ngrams are the
most common
 Ngrams models are trained by simply
counting the occurrences of words in a
training set

Ngrams
 A unigram is the probability of any word in
isolation
 A bigram is the probability of a given word
given the previous word
 Higher order ngrams continue in a similar
fashion
 A backoff probability is used for any unseen
data

How do we put it together?
 We now have models to represent the three
parts of our equation
 We need a framework to join these models
together
 The standard framework used is the Hidden
Markov Model (HMM)

Markov Model
 A state model using the markov property
 The markov property states that the future
depends only on the present state
 Models the likelihood of transitions between
states in a model
 Given the model, we can determine the
likelihood of any sequence of states

Hidden Markov Model
 Similar to a markov model except the states
are hidden
 We now have observations tied to the
individual states
 We no longer know the exact state sequence
given the data
 Allows for the modeling of an underlying
unobservable process

HMMs for ASR
 First we build an HMM for each phone
 Next we combine the phone models based
on the pronunciation model to create word
level models
 Finally, the word level models are combined
based on the language model
 We now have a giant network with potentially
thousands or even millions of states

Decoding
 Decoding happens in the same way as the
previous example
 For each time frame we need to maintain two
pieces of information
 The likelihood of being at any state
 The previous state for every state

State of the Art
 What works well
 Constrained vocabulary systems
 Systems adapted to a given speaker
 Systems in anechoic environments without
background noise
 Systems expecting read speech
 What doesn't work
 Large unconstrained vocabulary
 Noisy environments
 Conversational speech

Future Work
 Better representations of audio based on
humans
 Better representation of acoustic elements
based on articulatory phonology
 Segmental models that do not rely on the
simple frame-based approach

Resources
 Hidden Markov Model Toolkit (HTK)
 http://htk.eng.cam.ac.uk/
 CHIME ( a freely available dataset)
 http://spandh.dcs.shef.ac.uk/projects/chime/PCC
/datasets.html
 Machine Learning Lectures
 http://www.stanford.edu/class/cs229/
 http://www.youtube.com/watch?v=UzxYlbK2c7E

sr.ppt

Recommandé

Recommandé

Contenu connexe

Similaire à sr.ppt

Similaire à sr.ppt (20)

Dernier

Dernier (20)

sr.ppt