This paper approaches speaker recognition in a new way. A speaker recognition system has been realized that works on adult and child speakers, both male and female. Furthermore, the system employs text-dependent and text-independent algorithms, which makes robust speaker recognition possible in many applications. Single-speaker classication is achieved by age/sex pre-classication and is implemented using classic text-dependent techniques, as well as a novel technology for text-independent recognition. This new research uses Evolutionary Stable Strategies to model human speech and allows speaker recognition by analyzing just one vowel.
Research: Applying Various DSP-Related Techniques for Robust Recognition of Adult and Child Speakers
1. Applying Various DSP-Related Techniques for
Robust Recognition of Adult and Child
Speakers.
R.Atachiants
C.Bendermacher
J.Claessen
E.Lesser
S.Karami
January 21, 2009
Faculty of Humanity and Science, Maastricht University
Abstract
This paper approaches speaker recognition in a new way. A speaker
recognition system has been realized that works on adult and child speak-
ers, both male and female. Furthermore, the system employs text-dependent
and text-independent algorithms, which makes robust speaker recognition
possible in many applications. Single-speaker classication is achieved by
age/sex pre-classication and is implemented using classic text-dependent
techniques, as well as a novel technology for text-independent recognition.
This new research uses Evolutionary Stable Strategies to model human
speech and allows speaker recognition by analyzing just one vowel.
1 Introduction
In the past few years privacy became of bigger importance to people all over the
world. A great factor for this is the rise of the Internet, all private elements in
a persons life became easier to adjust. The privacy of people became easier to
copy. Since money is very important to be able to live nowadays, it was stolen
very often, using the internet. Copying cards and the information belonging to
it, was easier then ever and still occurrs very often. If it would be able to have
a proper system for voicerecognation in combination with a password, maybe
the security of our money increases.
The importance to handle these problems lies with modeling the human
speech. If an algorithm recognizes speech on its own, a person to check the
sounds for human speech would not be needed. Some questions arise namely
'How can an algorithm know that there is speech?', 'How does an algorithm
1
2. estimate the noise?', 'How does an algorithm achieve the classication of speech
and when to do so?', and 'How can an algorithm notice that there are multiple
speakers?'. These questions leads to an overal problem denition, namely: 'How
to identify one or more speakers?'.
To handle this problem the paper starts with detecting speech. This is
the subject of the rst section, section 2. Here speech is detected by using
an end-point detection algorithm, which recognizes speech, and noise reduction
that uses three ways of ltering, namely (a) Finite Impulse Response (FIR),
(b) Wavelets, and (c) Spectral Subtractions. Combining these three subjects
the program retrieves a signal that satises the properties to be used by the
algorithms for classication: classifying the speaker alone or in a conversation.
First the speaker has to be recognized when he is talking alone, this is discussed
in section 3. Speaker recognition is done using (a) discrete word selection,
(b) Mel-Frequency Cepstral Coecients and Vector Quantization, (c) Age/Sex
Classication, (d) Voice Model and (e) the contradictions that leads to the con-
clusion that there is a person speaking or not. In the last part multiple speaker
are identied and classied adjusting the methods Framed Multi Speaker Classi-
cation and Harmonic Matching Classier. After these sections there is a short
discussion about the subjects and then the conclusion are shown.
2 Speech detection
The very rst step for identifying the speaker is detecting speech. This means
that the part of the signal that contains speech has to be seperared from the
noise part.
There are two algorithms that can be used to detect speech. The rst one
is endpoint detection which will be described in subsection 2.2 and the second
algorithm is noise reduction. More information about the second algorithm can
be found in subsection 2.3.
2.1 Architectural overview
If a signal contains little noise, the end point detection algorithm can eectively
determine wether the signal contains speach. However, if there is much noise,
noise reduction has to be applied to the signal rst. To estimate the noise level
of a signal the Spectral Subtraction algorithm is used. This estimation is then
compared to the whole signal, resulting in the signal-to-noise ratio (SNR). If
needed, one of three noise reduction techniques (FIR, Spectral Subtraction and
Wavelets) is selected, based on the weighted SNR of each denoised signal. FIR
is prefered over Wavelets and Spectral Subtraction, while the use of Wavelets
is prefered over Spectral Subtraction. When end point detection is used on the
selected denoised signal, it is safe to say that speech can be detected accurately.
See gure 1 for a schematic overview.
2
3. Figure 1: Architectural Overview speech detection
2.2 Endpoint detection
The endpoint detection algorithm lters the noise from the begin and end of
the signal and detects the begin and end of speech. If these two points are the
same it means that the signal contains no speech and only exists of noise.
It is assumed that the rst 100 ms of the signal contains no speech. From
this part of the signal the energy and the zero crossing rate (ZCR) of the noise
can be calculated. Next, the lower treshold (ITL) and higher treshold (ITU)
can be calculated as follows:
I1 = 0, 03(maxEnergy=avgEnergy) + avgEnergy
I2 = 4 ∗ avgEnergy
ITL = MIN(I1, I2)
ITU = 5 ∗ ITU
To determine the starting point (N1) and the end point (N2) of the speech,
the ITL and ITU are considered. When the energy of the signal crosses the
ITL for the rst time, this point is saved. If the energy then goes below the
ITL again it was a false alarm. However, when it also crosses the ITU it means
speech was found and the saved point is considered N1, see gure 2(a). For N2
a similar procedure is followed, just the other way around.
Finally N1 and N2 can be determined more precisely by looking at the ZCR.
To be more exact, a closer look is taken on the ZCRs of the 250 ms before
N1. If a high ZCR is found in that interval that is an indication that there is
3
4. speech and N1 needs to be reconsidered, see gure 2(b). Similarly N2 can be
determined more accurately.
Figure 2: (a) Determining N1 and N2. (b) Redermining N1 and N2.
2.3 Noise reduction
Noise reduction is the other algorithm that is used for speech recognition, it
can be done with help of FIR ltering, spectral subtraction and wavelets, more
information about these topics can be found respectively in subsections 2.3.1,
2.3.2 and 2.3.3.
2.3.1 FIR ltering
In signal processing two types of ltering are used. As the name suggests, the
impulse response of the FIR lter is nite. The other type's lter response is
normally not nite because of the feedback structure. The FIR lter is exten-
sively used in this project in order to remove the white gaussion noise (WGN)
from the signals. The frequencies of the WGN lie mainly in the low frequency
band of the spectrum. A high pass rst order (FIR) lter has been applied to
strengthen the amplitude of the high frequencies. This is done by decreasing the
amplitudo of the low frequencies up to 20 dB, so the speech becomes stronger
and the noise is reduced.
For ltering the transfer function from thez-domain is used:
H[z] =
z − α
z
(1)
The standard formula for a transfer function is H(z) = Y (z)
X(z) . So it is clear
that Y (z) = z - αand X(z) = z. The working of this formula is shown in gure
3.
A transfer function with all coecients within the z-plane are always stable.
Therefore αlies between -1 and 1. To get a decrease as high as possible with
a rst-order formula the alpha is set to 0.95. In gure 4 the poles are shown
4
5. Figure 3: FIR lter
following from the facts that X(z) = 0, then z = 0 and therefore Y (z) = −0.95.
So the FIR lter is stable because of the place of α in the z-plane.
Figure 4: z-plane
To determine the frequency response of a discrete-time (FIR) lter, the trans-
fer function is evaluated at z = ejωT
. From all this the transfer formula used in
this paper for FIR ltering looks as in formula 2.
H[ejωT
] = 1 − (0.95 ∗ e−jωT
) (2)
This is one way of ltering WGN. In the part about wavelets, 2.3.3, another
approach to lter WGN is explained.
2.3.2 Spectral subtraction
Spectral subtraction is an advanced form of noise reduction. It is used for signals
that contain Non-Gaussian (articial) noise. After framing and Hamming win-
dowing (DSP), endpoint detection is used on every frame to seperate the noise
frames from the frames with speech. From the noise frames a noise estimation
5
6. of the signal is made. After applying Discrete Fourier Transform (DFT) to the
windowed signal, the noise estimation is simply subtracted from the signal to
obtain the denoised frames. Moreover the noise estimation is used to calculate
the SNR later on (see section 2.1).
Finally the inverse DFT is taken and the frames can be reassembled to get
the denoised signal. A schematic overview of the whole process is given in gure
5.
Figure 5: spectral subtraction
2.3.3 Wavelets
As in the section about FIR ltering already suggested, wavelets are used to lter
the signal on white gaussian noise. This way of ltering starts with the original
signal and the mother wavelet. The mother wavelet could be one of many
mother wavelets that are available. In this paper one of the available daubechie
wavelets are used, the daubechie 3, used often in Matlab. This mother wavelet
is recommended by Matlab, the program used to create the wavelet lter.
Next step in ltering is the decomposition of the original signal. By tting
the mother wavelet to the signal at the smallest scale, the lter produces what
is called the rst wavelet detail and a remainder which is called the rst approx-
imation. Then the timescale of the mother wavelet is doubled and again t to
the rst approximation. This results in a second wavelet detail and the second
remainder, the second approximation. Doubling the timescale of the mother
wavelet is also known as dilation. Dilation and splitting the remainders into
a new detail and approximation part, gure 6, is continued until the mother
wavelet has been dilated to such an extent that it covers the entire range of the
signal. [9]
There are two ways of thresholding, soft- and hard-thresholding. With hard
thresholding the signal below a certain threshold is set to zero. Soft thresholding
6
7. Figure 6: Signal Decomposition
is more complicated. It substracts the value of the threshold from the values of
the signal that are above that certain threshold. The values below that threshold
are set to zero again. [10] In Matlab this is integrated in the function ddencmp
and wdencmp. The function ddencmp de-noises the signal using a threshold and
the way of thresholding dened using the sound sample. The function wdencmp
uses this threshold value and the soft/hard-thresholding to create a de-noised
signal. So using these two functions, Matlab generates a denoised signal by
itself.
3 Speaker classication
he speaker classication algorithms described in this paper works best on a
discrete words or small signals. First, Discrete Word Selection (DWS) algo-
rithm is applied to cut the signal containing the most vowel components. Next,
Age/Sex Classication (ASC) algorithm tries to classify the signal in order to
reduce computation by eliminating the database samples that should be pro-
cessed. Text-Dependent (T-D) speaker detection techniques such as Dynamic
Time Warping (DTW) and Vector Quantization (VQ) and Text-Independent
(T-I) such as Voice Model Algorithm are processed. The contradictions are
checked and if detected, the ASC bias is discarded and the T-D and T-I al-
gorithms are computed again. If a speaker is detected, the system proceeds
to classication of Multiple speakers, using in parallel two dierent techniques:
Framed Multi-Speaker Classication and Harmonic Matching Classier. There
results of both are combined to achieve best result. See gure 7 for schematic
overview.
7
8. Figure 7: Architectural overview speaker recognition
3.1 Discrete word selection
Discrete word selection is used for two reasons: rst of all, the techniques used
in the system are mainly valid for discrete speech processing and not so much
for the processing of continuous speech. This means that the best results will
be achieved when working only with one, isolated group of words. Working
with discrete speech will also optimize the performance of the system. The
second reason for using discrete word selection is as a help for the 'Age/Sex
Classication' (ASC) block. The ASC block uses physical properties of the
human vocal tract to classify speech.
The algorithm for discrete word selection is based on V/C/P (Vowel / Con-
sonant / Pause) classication algorithm. This algorithm is text independent
and composed of four blocks, see gure 8.
In the rst block the main features are extracted; in the second block the
signal is framed and classied for the rst time. Next, the noise level is estimated
and the frames are classied again with an updated noise level parameter.
In order to distinguish a consonant, the V/C/P algorithm proposes the usage
of zero crossing rate features and a threshold (ZCR_dyna). In the case where
ZCR is bigger than the threshold, the frame can be classied as a consonant.
If the frame can not be classied, the energy of that frame will be checked. Is
8
9. Figure 8: V/C/P classication algorithm blocks.
the energy smaller than the overall noise level, then the frame can be classied
as a pause. The frame can be classied as a vowel if the energy is larger. The
results of an example speech clip using V/C/P classication is shown in gure
9.
Figure 9: V/C/P classication of an example speech clip (o:consonant, +:pause,
*:vowel).Image from Microsoft Research Asia.[12]
The complete discrete word selection algorithm is implemented as follows:
1. Audio input is segmented into non-overlapping frames of 10ms, where
energy and ZCR features are extracted.
2. Energy curve is smoothed, using FIR.
3. The Mean_Energy and Std_Energy of the energy curve are calculated
to estimate the background noise energy level, and the threshold of ZCR
(ZCR_dyna) as: NoiseLevel = Mean_Energy - 0,75 * Std_Energy
ZCR_dyna = Mean_ZCR + 0,5 Std_ZCR
9
10. 4. Frames are classied as V/C/P coarsely by using the following rules, where
FrameType is used to denote the type of each frame.
If ZCR ZCR_dyna
then FrameType = Consonant
Elseif Energy NoiseLevel,
then FrameType = Pause
Else FrameType = Vowel
5. Update the NoiseLevel as the weighted average energy of the frames at
each vowel boundary and the background segments
6. Re-classify the frames using algorithm in step 4 with the updated Noise-
Level. Pauses are merged by removing isolated short consonants. Vowel
will be split at its energy if its duration is too long.
7. After classication is terminated, select the word with the highest number
of V-frames.
3.2 MFCC and vector quantization
Mel-frequency cepstral coecients (MFCCs) and vector quantization (VQ) are
used to construct a set of highly representative feature vectors from a speech
fragment. These vectors are used to achieve speaker classication.
Frequencies below 1 kHz contain the most relevant information for speech.
Hence the human hearing emphazises these frequencies. To immitate this, fre-
quencies can be mapped to the Mel frequency scale (Mel scale). The Mel scale
is linear up to 1 kHz, while for higher frequencies it is a logarithmic scale, thus
emphasizing lower frequencies. After converting to the Mel scale, the MFCCs
can be found using the Discrete Cosine Transform. In this paper 13 MFCCs are
obtained from each frame of the speech signal.
Since a speech fragment generally is divided into many frames, this will result
in a large set of data. Therefore VQ, implemented as proposed in [7], is used to
compress these data points to a set of feature vectors (codevectors). In the case
of speech fragments the set of codevectors is a representation of the speaker.
Such a representation is called a codebook. Here VQ is used to compress each
set of MFCCs to 4 points. In the training phase a codebook is generated for
every known speaker. These codebooks are saved in the database.
When identifying a speaker from a new speech fragment VQ compares the
MFCCs of the fragment to each codebook in the database, as can be seen in 10.
The distance between a MFCC and the closest codevector is called its distortion.
The codebook with the smallest total distortion of all MFCCs is identied as
the speaker.
3.3 Dynamic time warping
Dynamic Time Warping (DTW) is a generic algorithm, used to compare two
signals. In order to nd the similarity between such sequences or as a prepro-
10
11. Figure 10: Matching MFCCs to a codebook
cessing step before averaging them, we must warp the time axis of one (or
both) sequences to achieve a better alignment,gure11.
Figure 11: Two sequences of data, having both overall similar shape but they
are not aligned to the time axis.[11]
In order to compare two speech signals in the system the DTW is applied to
the 13 of Mel-frequency cepstral coecients (MFCCs) from the Mel scale and
compared to its database samples.
To nd a warping path of two sequences of MFCC data, few steps are re-
quired:
1. Calculate the distances cost matrix (In this Paper Euclidean distance was
used to compute the cost)
2. Computing the path, starting from a corner of the cost matrix, processing
adjacent cells. This path can be found very eciently using dynamic
programming.[11]
11
12. ˆ W = w1, w2, . . . ,wk,. . . ,wK max(m,n) ¿ K m+n-1
3. Select only in the path which minimizes the warping cost:
DTW(Q, C) = min
K
k−1
Wk
K
4. Repeat the path calculation for each MFCC feature and compute a dier-
ence from each path.
3.4 Age/ sex classication
The ASC block is based on physical properties of speech and the vocal tract and
will pre-classify the input to one of the following 4 categories: male adult, female
adult, male child, female child. This pre-classication will help the classication
algorithms of the system to classify the speaker more accurately.
The total length of the vocal tract L can be calculated from the rst harmonic
of a sound exiting the closed tube.
L =
c
4F
(3)
where c is the speed of sound and F the fundamental frequency. Once the
length of the vocal tract has been calculated, it is very straightforward to classify
the length according to age and sex. General assumptions are that an adult has a
longer vocal tract than a child and that a male also has a longer vocal tract than
a female [1]. For easier implementation of the classier, it was chosen to work
with vocal tract length instead of directly with the fundamental frequencies.
Based on [2] the ASC algorithm has been developed and implemented, which
uses LPC to extract the rst formant out of the signal. Classication is then
based on heuristic methods, where length intervals for adult female and child
male are divided into sub-bands, allowing to distinguish between these cate-
gories.
Implementation-wise it is important to note that a ASC has been imple-
mented such that it will only be carried out if the number of samples in the
database of the system is larger than the number of classes of speakers. This
is done to avoid the pre-classication block (ASC) to act as a classication
algorithm and hence disable the classication blocks.
3.5 Voice model
Human speech is produced by expelling air from the lungs into the vocal tract,
where the `air signal' is `modeled' into the desired utterance by the glottis, the
tongue and the lips, amongst others. Thus, a speech signal can be seen as a
signal evolving over time which is formed by certain invasions. In this research,
it is proposed to use Evolutionary Stable Strategies (ESS) originating from the
12
13. eld of game theory to model human speech and to accurately recognize speakers
on a text-independent basis.
In Appendix A, a detailed overview is given of how this theory is developed.
Here the general implementation of the algorithm will be discussed.
Finding a solution for the following two research problems is attempted:
1. Find an algorithm that, given an utterance of human speech, determines
a tness matrix, appropriate strategies and invasions so that the speech
utterance is correctly dened by the resulting evolution of the population
of the game.
2. Employ the result of goal 1 to achieve speaker recognition, text-independent
if possible.
Since the ltering eect of separate speech organs can hardly be distinguished,
a lossless concatenated tube model (n-tube model [3][4]) for modeling the vocal
tract is assumed instead. The n-tube model also allows sequential modeling of
the speech utterance and thus solves the problem of parallel eects that occur
in the vocal tract.
In essence, the algorithm we need will proceed as follows:
1. Determine the number of tubes in the model and their respective equa-
tions.
2. Start lling out the tness matrix:
(a) Initially it contains the value 2 in position (1,1)
(b) Determine the equation of the signal after applying the rst lter.
(c) Determine the elements of the next column of the tness matrix.
(d) Determine the correct invasion parameters so that the current signal
will become the desired signal as determined in (b)
(e) Repeat steps (b) to (d) until the desired utterance is modeled (until
all tubes have been passed)
3. Store the values from step 2 in a database format that includes elements
of the tness matrix as well as strategy information and invasions.
In order to analyze the feasibility of this algorithm it is required to delve a
bit deeper into steps (c)and (d). It is obvious that (c)and (d) are mutually
dependent since the outcome of an invasion will depend of the ospring param-
eters. Furthermore, it has to be determined what strategy to play generally and
when to invade. Finally, an ESS that will simplify the entire process has to be
incorporated.
Let's assume that at every iteration it is decided to carry out a pure invasion;
that is, at time step x+e the type of column x will invade the existing population,
or more concrete, at that point in time the game will be played with strategy
(0,1), where 1 is for the type of column x. In that case, the elements of column
13
14. x have to be such that lling them out in equations (A.4) and (A.5) will yield
the correct population graph.
Using an ESS will help determine at what exact time steps to carry out pure
invasions, since the evolution of the population is then predetermined and thus
known. It is desirable that playing (1,0), where 1 is for the rst element of the
rst column, is an ESS. Therefore, all other elements in the tness matrix must
be smaller than 2.
To tackle the second research goal, it is important to know that the equations
of the lters will partially depend on the physical model of the speaker. It is
thus the question how to extract these parameters from the speech utterance so
that the equations for the lters can be established.
3.6 Contradictions
Since three algorithms are employed in the single-speaker classication stage,
their respective outcome have to be checked on consistency. A list of contra-
dictions allows the system to detect inconsistencies as well as indications to
multiple speakers.
In the table above T-D denotes text-dependent algorithms, while T-I de-
notes the text-independent algorithm. The system contains two text-dependent
algorithms and one text-independent algorithm. The binary value for T-D is
dened by the logical AND operation of T-D1 and TD2.
4 Multiple speaker detection
In order to successfully classify multiple speakers in a speech clip, two use-cases
should be analyzed. There are two main types:
1. Non-Overlapping speech where two or more speakers are speaking in dif-
ferent time frames (For example a dialogue).
2. Overlapping speech where two or more speakers speaking in both separate
time frames and same time frames (For example a debate).
In this se we discuss a technique for each of those use-cases: Framed Multi-
Speaker classication for Non-Overlapping speech and Harmonic Matching Fil-
ter for Overlapping speech. Those two techniques are executed in parallel in
14
15. the system and both results are combined in order to detect the most speakers
as possible.
4.1 Framed multi speaker classication
Framed Multi-Speaker classication algorithm is used in the system in order to
detect and classify multiple speakers in a speech signal. In order to do this,
the whole signal is processed. The algorithm is used on dialogues or other non-
overlapping speech clips. It uses single speaker classication techniques in order
to detect each speaker.
Figure 12: FMS classication stages.
The algorithm works in 3 stages as shown in gure 12:
1. FHM starts with erasing the pauses in the signal and uses this to frame
the signal;
2. Loop on each frame and classies the frame using the classication tech-
niques discussed in the previous section. The text-dependent speaker clas-
sication as well as the text-independent classication algorithms are used.
Also a check for contradiction is done to classify the single speaker as
shown in gure 13;
3. Finally FHM checks the results to extract only distinct speakers.
4.2 Harmonic matching classied
In order to enable the system to recognize speakers in multi-speaker speech
fragments with overlapping speech, the Harmonic Matching Classier (HMC) is
used. The HMC was introduced by Radfar et al. in [5] and separates Unvoiced-
Voiced (U-V) frames from Voiced-Voiced (V-V) frames in mixed speech.
15
16. Figure 13: FMS classication, per frame classication block.
The table above indicates what kind of speech is uttered by the respective
speaker in each frame category. U-V frames are useful in speaker recognition
of mixed speech, since in such a frame the features of the voiced speaker will
dominate. Hence, it will be possible to recognize the speaker for every frame.
However, before being able to separate U-V frames from V-V frames, rst the
U-U frames have to be removed from the signal. To achieve this, an algorithm
proposed by Bachu et al. [6] is employed, which uses energy and ZCR cal-
culations to distinguish unvoiced frames from voiced frames. Unvoiced/voiced
classication is based on heuristic methods. HMC recognizes U-V frames by
tting a harmonic model, given by equation (2), to a mixed analysis frame and
then evaluate the introduced error (3) against a threshold sv (4). This process
is repeated for all frames of the mixed signal.
1. Hmodel =
L(ωi)
l=1 A2
lωi
W2
(ω − lωi)
2. et
= minwi ||Xt
mix(w)|2
− Hmodel|
3. σ = mean({et
}
T
t=1)
where ωi is the fundamental frequency and W(ω)is a window applied to the
spectrum. The X component of equation (3) denotes the spectrum of the tth
mixed signal frame.
After the U-V frames have been extracted from the mixed speech signal, they
are passed to the Vector Quantization (VQ) block of the system, where every
frame is matched against the relevant database and two speakers are nally
16
17. recognized. Our system is currently limited to recognizing maximal 2 speakers
from a mixed signal, which is an obvious consequence of the limitations of the
methods used, especially harmonic model tting.
5 Test and Results
The output of the program comparing exact same speech le with existing one.
Everything classied perfectly.
-
START PHASE 1
Starting Endpoint Detection...
.ITL: 1.2448
..ITU: 6.2242
..IZCT: 220.5
.EnergyTotal: 384 elements
.RatesTotal: 384 elements
.BackoLength: 12
Starting FIR...
Starting Spectral Subtraction...
Starting Wavelets...
Entering Phase 1 Select block...
START PHASE 2
Starting DWS...
Starting MFCC...
Starting DTW...
'Adult Female'
Starting VQ...
'Adult Female'
Starting VM...
'Adult Female'
Final result:
'Adult Female'
Trying to classify dierent sound le (same person, same text). Once again,
everything is classiend and there's no contradictions.
START PHASE 1
Starting Endpoint Detection...
.ITL: 0.66714
.ITU: 3.3357
.IZCT: 120
.EnergyTotal: 384 elements
.RatesTotal: 384 elements
.BackoLength: 12
Starting FIR...
17
18. Starting Spectral Subtraction...
Starting Wavelets...
Entering Phase 1 Select block...
START PHASE 2
Starting DWS...
Saving le...
Starting MFCC...
Starting DTW...
'Adult Female'
Starting VQ...
'Adult Female'
Starting VM...
'Adult Female'
Final result:
'Adult Female'
-
Classifying a poor quality sound le, VM and VQ classies it correctly, but
DTW fails. The contradictions are veried and nal result is assigned correctly.
-
START PHASE 1
Starting Endpoint Detection...
.ITL: 0.2542
.ITU: 1.271
.IZCT: 120
.EnergyTotal: 387 elements
.RatesTotal: 387 elements
.BackoLength: 12
Starting FIR...
Starting Spectral Subtraction...
Starting Wavelets...
Entering Phase 1 Select block...
START PHASE 2 Starting DWS...
Starting MFCC...
Starting DTW...
'Adult Male'
Starting VQ...
'Child Female'
Starting VM...
'Child Female'
Final result:
'Child Female'
-
Classifying a poor quality sound le, this time DTW and VM classies it cor-
rectly, but VQ fails. The contradictions are veried and nal result is assigned
correctly.
-
18
19. START PHASE 1
Starting Endpoint Detection...
.ITL: 1.3699
.ITU: 6.8496
.IZCT: 120
.EnergyTotal: 421 elements
.RatesTotal: 421 elements
.BackoLength: 12
Starting FIR...
Starting Spectral Subtraction...
Starting Wavelets...
Entering Phase 1 Select block...
START PHASE 2
Skipping DWS and loading existing one...
Starting MFCC...
Starting DTW...
'Adult Male'
Starting VQ...
'Child Female'
Starting VM...
'Adult Male'
Final result:
'Adult Male'
6 Discussion
The developed system incorporates classical techniques as well as novel tech-
niques and is a combination of scientically proven and heuristic methods. The
techniques used for speech detection and noise reduction are well-known and
widely-used in speech processing applications. The addition of Spectral Sub-
traction to this stage of the system is a novel touch that improves the accuracy
of further steps.
In the processing and single-speaker classication stage various DSP-related
techniques have been combined with new research. Discrete Word Selection
and Age/Sex Classication both rely on existing methods, but are used in an
entirely new fashion in our implementation. Digital Signal Processing which
incorporates Windowing and Framing and Frequency Analysis (MFCC), on
the other hand, are classical supporting techniques that are used to prepare the
signal for further processing, as is customary in this kind of systems.
Working with pre-classication is very useful for larger databases and does
provide the user of the system with information about the speaker even if the
system can nd no match. Needless to say, the system relies heavily on the
physical model of speech and the vocal tract to accomplish this for adult and
child, male and female speakers.
19
20. For the actual classication, three algorithms have been selected that t the
requirements of the system the most. An originally planned implementation
of Extended Dynamic Time Warping (EDTW), however, had to be reduced
to the simple Dynamic Time Warping implementation, due to a lack of time.
Extended Dynamic Time Warping applies dimensionality reduction algorithms
like Principle Component Analysis before searching for a cost path, which would
have optimized the performance of the system.
The new research that the system incorporates, namely single-speaker, text-
independent classication by using Evolutionary Stable Strategies, is a very in-
teresting technique that needs further development and testing before its actual
use can be proven.
Multi-speaker classication also a novel heuristic method (Framed Multi-
Speaker Classication) for the recognition of multiple speakers in non-overlapping
speech. Harmonic Model Classication is a combination and adaptation of ex-
isting methods and is used for recognition in overlapping speech, which is a
novelty in its own right that is not easily achieved.
Between several stages of the system, a considerable amount of logic has
been incorporated to assure accurate processing of temporary results. The most
striking example of this logic and its use is probably the technique employed to
detect multiple speaker in a speech signal. It is implemented via the logical de-
coding of the results of the multiple classication algorithms. Of course, for this
method to be accurate, a reasonable amount of input is necessary. Therefore,
the more classication algorithms we have in the system the better the result
will be. Hence, incorporating EDTW and maybe other classication algorithms
in the system, in addition to the existing algorithms, will prove useful for the
switch to multi-speaker recognition, which is currently partially a task for the
user to carry out manually.
7 Conclusion
In this Paper several techniques to classify/detect single or multiple speakers are
discussed. In conjunction and proper usage, those techniques help to identify
one or more speakers. Tests and results of such system have shown that many
existing algorithms have dierent purposes and can only classify a speaker if
several conditions are met (for instance text dependent algorithms). Thus, to be
able to achieve best results for the speaker classication problem, the algorithms
should work together and be checked for contradictions of their output.
References
[1] Stevens, K.N., Acoustic Phonetics', 0262692503, MIT Press, 1998.
[2] Kamran, M. and Bruce, I. C., Robust Formant Tracking for Continuous
Speech with Speaker Variability, IEEE Transactions on Audio, Speech,
and Language Processing, Vol. 14, No. 2, 2006.
20
21. [3] Fant, G., Acoustic Theory of Speech Production, Mouton (The Hague),
1960.
[4] Flanagan, J.L., Speech Analysis, Synthesis and Perception, Springer Ver-
lag, Berlin, Heidelberg, 1972.
[5] Radfar, M.H., Sayadiyan, A. and Dansereau, R.M., A Generalized Ap-
proach for Model-Based Speaker-Dependent Single Channel Speech Sepa-
ration, Iranian Journal of Science Technology, Transaction B, Engineer-
ing, Vol. 31, No. B3, pp 361-375, The Islamic Republic of Iran, 2007.
[6] Bachu, R.G., Kopparthi, S., Adapa, B., Barkana, B. D., Separation of
Voiced and Unvoiced using Zero-Crossing Rate and Energy of the Speech
Signal, American Society for Engineering Education (ASEE) Zone Con-
ference Proceedings, 2008.
[7] Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani and Md. Sai-
fur Rahman, Speaker Identication Using Mel Frequency Cepstral Coe-
cients, 2004
[9] Goring, D. (2006). Orthogonal Wavelet Decomposition. Available:
http://www.tideman.co.nz/Salalah/OrthWaveDecomp.html. Last accessed
21 January 2009.
[10] Patrick J. van Fleet (2008). Discrete wavelet transformation. New Jersey:
John Wiley sons. 317-350
[11] Keogh E.J., Pazzani M.J. Derivative Dynamic Time Warping, 2000
[12] Dong Wang, Lie L., Hong-Jiang Zhang Speech Segmentation Without
Speech Recognition, Microsoft Research Asia
21
22. Appendix A: Using Evolutionary Stable Strategies
to Model Human Speech
Let the air signal be called the signal s, then it can be modeled by an evolu-
tionary game with the following tness matrix:
This matrix can be extended to contain the eects of the speech mdeling, as
follows:
where g,t,l are the deformation signals of the glottis, tongue and lips, respec-
tively. The question marks in the matrix represent the amount of deformation
one signal evokes in another.This value is obviously dependant on the utterance,
which leads us to our rst conclusion.
Conclusion1:
Evolutionary games can only be used to model discrete speech utterances.
Practically this means that this technique will be used to model isolated vowels
and consonants.
Let's clarify the above a bit by considering an evolutionary game consisting
of a population of two types, i and j. The game has the following tness matrix
(not bimatrix, since only player 1 gets ospring):
Now, let's plot the evolution of the population over time for the following
strategies (or strategy pairs; player 1 and player 2 use the same strategy in each
of the following cases). Note that for this game we assume that all possible re-
lations occur during one generation (one element of the population has multiple
inter- and intra-type relationships, where applicable). It is also obvious that no
distinction is made between male and female elements; in fact, all elements are
genderless.
22
24. exclusively. Since the ospring is equal to 2, the population will never grow
beyond its initial size, namely 2. Strategy (0,1) yields a similar case, where the
entire population consists of type j exclusively. However, the ospring size here
is 4, hence the population will grow over time. The number of relationships that
can (and will) occur at a certain point tx in time is :
P (tx−1)−1
n=1
n = 1 + 2 + 3 + ... + P(tx−1) − 1
which are all possible combinations, except the element with itself and re-
versed combinations. This amount of relationships can be calculated using the
form:
1 + 2 + 3 + ... + n =
n(n + 1)
2
which then yields equation (A.2). Finally, the population when using strategy
1
2 , 1
2
consists for 50% type i and 50% type j. Equation (A) is an extension of equation
(A.2) in order to include all possible relationships. The term
−2 ∗ P (tx−1)
2 − 1 ∗ P (tx−1)
4
can not be simplied because it originates from the form mentioned above
and hence a standard simplication would yield a wrong result.
In this specic case equation (A.3) can be reduced to A.3.1 :
P(tx) = offspring(i,i) ∗
P(tx−1)
2
− 1 ∗
P(tx−1)
4
+
offspring(i,j)∗ (P(tx−1) − 1) ∗
P(tx−1)
2
− 2 ∗
P(tx−1)
2
− 1 ∗
P(tx−1)
4
+
offspring(j,j) ∗
P(tx−1)
2
− 1 ∗
P(tx−1)
4
since
1
2
offspring(i,j) +
1
2
offspring(j,i) = offspring(i,j) = offspring(j,i).
Equation (A.3.1) can then further be reduced to (A.3.2)
P(tx) = offspring(j,i) ∗ (tx−1 − 1) ∗
tx−1
2
,
which equals equation(A.2), since in this case (i, j) = (j, i) = (i,i)+(j,j)
2 .
Let us now consider the eect that an invasion would have on the population
graph. As it happens, the pure strategy pair ((0,1),(0,1)) that we have examined
previously, is an Evolutionary Stable Strategy (ESS), because (a) it is a Nash
equilibrium and (b) i scores better against j than against itself. (Note that if
we remove dominated actions from this game, only strategy (0,1) remains.)
24
25. Consider the same strategies (pairs) again, but now with a pure invasion at
some moment in time.
The general population function is given by equations (A.1), (A.2) and (A.3)
respectively until t3, and by (A.4) and (A.5), as detailed below, thereafter:
P(tx) = offspring(i,i) ∗ ((Fi(tx−1)) ∗ P(tx−1) − 1) ∗ Fi(tx−1)∗P (tx−1)
2 +
25
26. offspring(i,j)
2 +
offspring(i,j)
2 ∗ (P(tx−1) − 1) ∗ P (tx−1)
2 −
(Fi(tx−1) ∗ P(tx−1) − 1) ∗ (Fi(tx−1)∗P (tx−1))
2 +
(Fj(tx−1) ∗ P(tx−1) − 1) ∗
Fj (tx−1)∗P (tx−1)
2 +
offspring(j,j) ∗ ((Fj(tx−1)) ∗ P(tx−1) − 1) ∗
Fj (tx−1)∗P (tx−1)
2
A =
y=i,j
(Fy(tx−1) ∗ P(tx−1) − 1) ∗
Fy(tx−1)∗P (tx−1)
2
B = Ftype(tx−1) ∗ P(tx−1)
C =
offspring(i,j)
2
+
offspring(j,i)
2
Ftype (tx) =
offspring(type,type)∗(B−1)∗
B
2
+
C ∗ (P(tx−1) − 1) ∗
P(tx−1)
2
− A
2
P (tx)
x = 1...∞
The general deformation function is dened by:
D(tx)
0 x inv
Pinv(tx) − Pno−inv(tx) x ≥ inv
x = 1...∞
Equation (A.4) consists of three components: the rst to calculate the
number of possible combinations (and ospring after multiplication with the
ospring-factor) of type i, the second for the mixed combinations and the third
for combinations of type j. Equation (A.5) is a function called from equation
(A.4) and calculates the fraction (the ratio) of a certain type at a given moment
in time. This is achieved by calculating the sum of the ospring of the respective
type and half of the mixed ospring, and dividing this sum by the population
number.
As can be seen from the 'Type Ratios'-graphs, only in the case of an ESS
the evolution of the population restores and stabilizes over time.
26