SlideShare une entreprise Scribd logo
1  sur  26
Télécharger pour lire hors ligne
Applying Various DSP-Related Techniques for
Robust Recognition of Adult and Child
Speakers.
R.Atachiants
C.Bendermacher
J.Claessen
E.Lesser
S.Karami
January 21, 2009
Faculty of Humanity and Science, Maastricht University
Abstract
This paper approaches speaker recognition in a new way. A speaker
recognition system has been realized that works on adult and child speak-
ers, both male and female. Furthermore, the system employs text-dependent
and text-independent algorithms, which makes robust speaker recognition
possible in many applications. Single-speaker classication is achieved by
age/sex pre-classication and is implemented using classic text-dependent
techniques, as well as a novel technology for text-independent recognition.
This new research uses Evolutionary Stable Strategies to model human
speech and allows speaker recognition by analyzing just one vowel.
1 Introduction
In the past few years privacy became of bigger importance to people all over the
world. A great factor for this is the rise of the Internet, all private elements in
a persons life became easier to adjust. The privacy of people became easier to
copy. Since money is very important to be able to live nowadays, it was stolen
very often, using the internet. Copying cards and the information belonging to
it, was easier then ever and still occurrs very often. If it would be able to have
a proper system for voicerecognation in combination with a password, maybe
the security of our money increases.
The importance to handle these problems lies with modeling the human
speech. If an algorithm recognizes speech on its own, a person to check the
sounds for human speech would not be needed. Some questions arise namely
'How can an algorithm know that there is speech?', 'How does an algorithm
1
estimate the noise?', 'How does an algorithm achieve the classication of speech
and when to do so?', and 'How can an algorithm notice that there are multiple
speakers?'. These questions leads to an overal problem denition, namely: 'How
to identify one or more speakers?'.
To handle this problem the paper starts with detecting speech. This is
the subject of the rst section, section 2. Here speech is detected by using
an end-point detection algorithm, which recognizes speech, and noise reduction
that uses three ways of ltering, namely (a) Finite Impulse Response (FIR),
(b) Wavelets, and (c) Spectral Subtractions. Combining these three subjects
the program retrieves a signal that satises the properties to be used by the
algorithms for classication: classifying the speaker alone or in a conversation.
First the speaker has to be recognized when he is talking alone, this is discussed
in section 3. Speaker recognition is done using (a) discrete word selection,
(b) Mel-Frequency Cepstral Coecients and Vector Quantization, (c) Age/Sex
Classication, (d) Voice Model and (e) the contradictions that leads to the con-
clusion that there is a person speaking or not. In the last part multiple speaker
are identied and classied adjusting the methods Framed Multi Speaker Classi-
cation and Harmonic Matching Classier. After these sections there is a short
discussion about the subjects and then the conclusion are shown.
2 Speech detection
The very rst step for identifying the speaker is detecting speech. This means
that the part of the signal that contains speech has to be seperared from the
noise part.
There are two algorithms that can be used to detect speech. The rst one
is endpoint detection which will be described in subsection 2.2 and the second
algorithm is noise reduction. More information about the second algorithm can
be found in subsection 2.3.
2.1 Architectural overview
If a signal contains little noise, the end point detection algorithm can eectively
determine wether the signal contains speach. However, if there is much noise,
noise reduction has to be applied to the signal rst. To estimate the noise level
of a signal the Spectral Subtraction algorithm is used. This estimation is then
compared to the whole signal, resulting in the signal-to-noise ratio (SNR). If
needed, one of three noise reduction techniques (FIR, Spectral Subtraction and
Wavelets) is selected, based on the weighted SNR of each denoised signal. FIR
is prefered over Wavelets and Spectral Subtraction, while the use of Wavelets
is prefered over Spectral Subtraction. When end point detection is used on the
selected denoised signal, it is safe to say that speech can be detected accurately.
See gure 1 for a schematic overview.
2
Figure 1: Architectural Overview speech detection
2.2 Endpoint detection
The endpoint detection algorithm lters the noise from the begin and end of
the signal and detects the begin and end of speech. If these two points are the
same it means that the signal contains no speech and only exists of noise.
It is assumed that the rst 100 ms of the signal contains no speech. From
this part of the signal the energy and the zero crossing rate (ZCR) of the noise
can be calculated. Next, the lower treshold (ITL) and higher treshold (ITU)
can be calculated as follows:
I1 = 0, 03(maxEnergy=avgEnergy) + avgEnergy
I2 = 4 ∗ avgEnergy
ITL = MIN(I1, I2)
ITU = 5 ∗ ITU
To determine the starting point (N1) and the end point (N2) of the speech,
the ITL and ITU are considered. When the energy of the signal crosses the
ITL for the rst time, this point is saved. If the energy then goes below the
ITL again it was a false alarm. However, when it also crosses the ITU it means
speech was found and the saved point is considered N1, see gure 2(a). For N2
a similar procedure is followed, just the other way around.
Finally N1 and N2 can be determined more precisely by looking at the ZCR.
To be more exact, a closer look is taken on the ZCRs of the 250 ms before
N1. If a high ZCR is found in that interval that is an indication that there is
3
speech and N1 needs to be reconsidered, see gure 2(b). Similarly N2 can be
determined more accurately.
Figure 2: (a) Determining N1 and N2. (b) Redermining N1 and N2.
2.3 Noise reduction
Noise reduction is the other algorithm that is used for speech recognition, it
can be done with help of FIR ltering, spectral subtraction and wavelets, more
information about these topics can be found respectively in subsections 2.3.1,
2.3.2 and 2.3.3.
2.3.1 FIR ltering
In signal processing two types of ltering are used. As the name suggests, the
impulse response of the FIR lter is nite. The other type's lter response is
normally not nite because of the feedback structure. The FIR lter is exten-
sively used in this project in order to remove the white gaussion noise (WGN)
from the signals. The frequencies of the WGN lie mainly in the low frequency
band of the spectrum. A high pass rst order (FIR) lter has been applied to
strengthen the amplitude of the high frequencies. This is done by decreasing the
amplitudo of the low frequencies up to 20 dB, so the speech becomes stronger
and the noise is reduced.
For ltering the transfer function from thez-domain is used:
H[z] =
z − α
z
(1)
The standard formula for a transfer function is H(z) = Y (z)
X(z) . So it is clear
that Y (z) = z - αand X(z) = z. The working of this formula is shown in gure
3.
A transfer function with all coecients within the z-plane are always stable.
Therefore αlies between -1 and 1. To get a decrease as high as possible with
a rst-order formula the alpha is set to 0.95. In gure 4 the poles are shown
4
Figure 3: FIR lter
following from the facts that X(z) = 0, then z = 0 and therefore Y (z) = −0.95.
So the FIR lter is stable because of the place of α in the z-plane.
Figure 4: z-plane
To determine the frequency response of a discrete-time (FIR) lter, the trans-
fer function is evaluated at z = ejωT
. From all this the transfer formula used in
this paper for FIR ltering looks as in formula 2.
H[ejωT
] = 1 − (0.95 ∗ e−jωT
) (2)
This is one way of ltering WGN. In the part about wavelets, 2.3.3, another
approach to lter WGN is explained.
2.3.2 Spectral subtraction
Spectral subtraction is an advanced form of noise reduction. It is used for signals
that contain Non-Gaussian (articial) noise. After framing and Hamming win-
dowing (DSP), endpoint detection is used on every frame to seperate the noise
frames from the frames with speech. From the noise frames a noise estimation
5
of the signal is made. After applying Discrete Fourier Transform (DFT) to the
windowed signal, the noise estimation is simply subtracted from the signal to
obtain the denoised frames. Moreover the noise estimation is used to calculate
the SNR later on (see section 2.1).
Finally the inverse DFT is taken and the frames can be reassembled to get
the denoised signal. A schematic overview of the whole process is given in gure
5.
Figure 5: spectral subtraction
2.3.3 Wavelets
As in the section about FIR ltering already suggested, wavelets are used to lter
the signal on white gaussian noise. This way of ltering starts with the original
signal and the mother wavelet. The mother wavelet could be one of many
mother wavelets that are available. In this paper one of the available daubechie
wavelets are used, the daubechie 3, used often in Matlab. This mother wavelet
is recommended by Matlab, the program used to create the wavelet lter.
Next step in ltering is the decomposition of the original signal. By tting
the mother wavelet to the signal at the smallest scale, the lter produces what
is called the rst wavelet detail and a remainder which is called the rst approx-
imation. Then the timescale of the mother wavelet is doubled and again t to
the rst approximation. This results in a second wavelet detail and the second
remainder, the second approximation. Doubling the timescale of the mother
wavelet is also known as dilation. Dilation and splitting the remainders into
a new detail and approximation part, gure 6, is continued until the mother
wavelet has been dilated to such an extent that it covers the entire range of the
signal. [9]
There are two ways of thresholding, soft- and hard-thresholding. With hard
thresholding the signal below a certain threshold is set to zero. Soft thresholding
6
Figure 6: Signal Decomposition
is more complicated. It substracts the value of the threshold from the values of
the signal that are above that certain threshold. The values below that threshold
are set to zero again. [10] In Matlab this is integrated in the function ddencmp
and wdencmp. The function ddencmp de-noises the signal using a threshold and
the way of thresholding dened using the sound sample. The function wdencmp
uses this threshold value and the soft/hard-thresholding to create a de-noised
signal. So using these two functions, Matlab generates a denoised signal by
itself.
3 Speaker classication
he speaker classication algorithms described in this paper works best on a
discrete words or small signals. First, Discrete Word Selection (DWS) algo-
rithm is applied to cut the signal containing the most vowel components. Next,
Age/Sex Classication (ASC) algorithm tries to classify the signal in order to
reduce computation by eliminating the database samples that should be pro-
cessed. Text-Dependent (T-D) speaker detection techniques such as Dynamic
Time Warping (DTW) and Vector Quantization (VQ) and Text-Independent
(T-I) such as Voice Model Algorithm are processed. The contradictions are
checked and if detected, the ASC bias is discarded and the T-D and T-I al-
gorithms are computed again. If a speaker is detected, the system proceeds
to classication of Multiple speakers, using in parallel two dierent techniques:
Framed Multi-Speaker Classication and Harmonic Matching Classier. There
results of both are combined to achieve best result. See gure 7 for schematic
overview.
7
Figure 7: Architectural overview speaker recognition
3.1 Discrete word selection
Discrete word selection is used for two reasons: rst of all, the techniques used
in the system are mainly valid for discrete speech processing and not so much
for the processing of continuous speech. This means that the best results will
be achieved when working only with one, isolated group of words. Working
with discrete speech will also optimize the performance of the system. The
second reason for using discrete word selection is as a help for the 'Age/Sex
Classication' (ASC) block. The ASC block uses physical properties of the
human vocal tract to classify speech.
The algorithm for discrete word selection is based on V/C/P (Vowel / Con-
sonant / Pause) classication algorithm. This algorithm is text independent
and composed of four blocks, see gure 8.
In the rst block the main features are extracted; in the second block the
signal is framed and classied for the rst time. Next, the noise level is estimated
and the frames are classied again with an updated noise level parameter.
In order to distinguish a consonant, the V/C/P algorithm proposes the usage
of zero crossing rate features and a threshold (ZCR_dyna). In the case where
ZCR is bigger than the threshold, the frame can be classied as a consonant.
If the frame can not be classied, the energy of that frame will be checked. Is
8
Figure 8: V/C/P classication algorithm blocks.
the energy smaller than the overall noise level, then the frame can be classied
as a pause. The frame can be classied as a vowel if the energy is larger. The
results of an example speech clip using V/C/P classication is shown in gure
9.
Figure 9: V/C/P classication of an example speech clip (o:consonant, +:pause,
*:vowel).Image from Microsoft Research Asia.[12]
The complete discrete word selection algorithm is implemented as follows:
1. Audio input is segmented into non-overlapping frames of 10ms, where
energy and ZCR features are extracted.
2. Energy curve is smoothed, using FIR.
3. The Mean_Energy and Std_Energy of the energy curve are calculated
to estimate the background noise energy level, and the threshold of ZCR
(ZCR_dyna) as: NoiseLevel = Mean_Energy - 0,75 * Std_Energy
ZCR_dyna = Mean_ZCR + 0,5 Std_ZCR
9
4. Frames are classied as V/C/P coarsely by using the following rules, where
FrameType is used to denote the type of each frame.
If ZCR  ZCR_dyna
then FrameType = Consonant
Elseif Energy  NoiseLevel,
then FrameType = Pause
Else FrameType = Vowel
5. Update the NoiseLevel as the weighted average energy of the frames at
each vowel boundary and the background segments
6. Re-classify the frames using algorithm in step 4 with the updated Noise-
Level. Pauses are merged by removing isolated short consonants. Vowel
will be split at its energy if its duration is too long.
7. After classication is terminated, select the word with the highest number
of V-frames.
3.2 MFCC and vector quantization
Mel-frequency cepstral coecients (MFCCs) and vector quantization (VQ) are
used to construct a set of highly representative feature vectors from a speech
fragment. These vectors are used to achieve speaker classication.
Frequencies below 1 kHz contain the most relevant information for speech.
Hence the human hearing emphazises these frequencies. To immitate this, fre-
quencies can be mapped to the Mel frequency scale (Mel scale). The Mel scale
is linear up to 1 kHz, while for higher frequencies it is a logarithmic scale, thus
emphasizing lower frequencies. After converting to the Mel scale, the MFCCs
can be found using the Discrete Cosine Transform. In this paper 13 MFCCs are
obtained from each frame of the speech signal.
Since a speech fragment generally is divided into many frames, this will result
in a large set of data. Therefore VQ, implemented as proposed in [7], is used to
compress these data points to a set of feature vectors (codevectors). In the case
of speech fragments the set of codevectors is a representation of the speaker.
Such a representation is called a codebook. Here VQ is used to compress each
set of MFCCs to 4 points. In the training phase a codebook is generated for
every known speaker. These codebooks are saved in the database.
When identifying a speaker from a new speech fragment VQ compares the
MFCCs of the fragment to each codebook in the database, as can be seen in 10.
The distance between a MFCC and the closest codevector is called its distortion.
The codebook with the smallest total distortion of all MFCCs is identied as
the speaker.
3.3 Dynamic time warping
Dynamic Time Warping (DTW) is a generic algorithm, used to compare two
signals. In order to nd the similarity between such sequences or as a prepro-
10
Figure 10: Matching MFCCs to a codebook
cessing step before averaging them, we must warp the time axis of one (or
both) sequences to achieve a better alignment,gure11.
Figure 11: Two sequences of data, having both overall similar shape but they
are not aligned to the time axis.[11]
In order to compare two speech signals in the system the DTW is applied to
the 13 of Mel-frequency cepstral coecients (MFCCs) from the Mel scale and
compared to its database samples.
To nd a warping path of two sequences of MFCC data, few steps are re-
quired:
1. Calculate the distances cost matrix (In this Paper Euclidean distance was
used to compute the cost)
2. Computing the path, starting from a corner of the cost matrix, processing
adjacent cells. This path can be found very eciently using dynamic
programming.[11]
11
ˆ W = w1, w2, . . . ,wk,. . . ,wK max(m,n) ¿ K  m+n-1
3. Select only in the path which minimizes the warping cost:
DTW(Q, C) = min
K
k−1
Wk
K
4. Repeat the path calculation for each MFCC feature and compute a dier-
ence from each path.
3.4 Age/ sex classication
The ASC block is based on physical properties of speech and the vocal tract and
will pre-classify the input to one of the following 4 categories: male adult, female
adult, male child, female child. This pre-classication will help the classication
algorithms of the system to classify the speaker more accurately.
The total length of the vocal tract L can be calculated from the rst harmonic
of a sound exiting the closed tube.
L =
c
4F
(3)
where c is the speed of sound and F the fundamental frequency. Once the
length of the vocal tract has been calculated, it is very straightforward to classify
the length according to age and sex. General assumptions are that an adult has a
longer vocal tract than a child and that a male also has a longer vocal tract than
a female [1]. For easier implementation of the classier, it was chosen to work
with vocal tract length instead of directly with the fundamental frequencies.
Based on [2] the ASC algorithm has been developed and implemented, which
uses LPC to extract the rst formant out of the signal. Classication is then
based on heuristic methods, where length intervals for adult female and child
male are divided into sub-bands, allowing to distinguish between these cate-
gories.
Implementation-wise it is important to note that a ASC has been imple-
mented such that it will only be carried out if the number of samples in the
database of the system is larger than the number of classes of speakers. This
is done to avoid the pre-classication block (ASC) to act as a classication
algorithm and hence disable the classication blocks.
3.5 Voice model
Human speech is produced by expelling air from the lungs into the vocal tract,
where the `air signal' is `modeled' into the desired utterance by the glottis, the
tongue and the lips, amongst others. Thus, a speech signal can be seen as a
signal evolving over time which is formed by certain invasions. In this research,
it is proposed to use Evolutionary Stable Strategies (ESS) originating from the
12
eld of game theory to model human speech and to accurately recognize speakers
on a text-independent basis.
In Appendix A, a detailed overview is given of how this theory is developed.
Here the general implementation of the algorithm will be discussed.
Finding a solution for the following two research problems is attempted:
1. Find an algorithm that, given an utterance of human speech, determines
a tness matrix, appropriate strategies and invasions so that the speech
utterance is correctly dened by the resulting evolution of the population
of the game.
2. Employ the result of goal 1 to achieve speaker recognition, text-independent
if possible.
Since the ltering eect of separate speech organs can hardly be distinguished,
a lossless concatenated tube model (n-tube model [3][4]) for modeling the vocal
tract is assumed instead. The n-tube model also allows sequential modeling of
the speech utterance and thus solves the problem of parallel eects that occur
in the vocal tract.
In essence, the algorithm we need will proceed as follows:
1. Determine the number of tubes in the model and their respective equa-
tions.
2. Start lling out the tness matrix:
(a) Initially it contains the value 2 in position (1,1)
(b) Determine the equation of the signal after applying the rst lter.
(c) Determine the elements of the next column of the tness matrix.
(d) Determine the correct invasion parameters so that the current signal
will become the desired signal as determined in (b)
(e) Repeat steps (b) to (d) until the desired utterance is modeled (until
all tubes have been passed)
3. Store the values from step 2 in a database format that includes elements
of the tness matrix as well as strategy information and invasions.
In order to analyze the feasibility of this algorithm it is required to delve a
bit deeper into steps (c)and (d). It is obvious that (c)and (d) are mutually
dependent since the outcome of an invasion will depend of the ospring param-
eters. Furthermore, it has to be determined what strategy to play generally and
when to invade. Finally, an ESS that will simplify the entire process has to be
incorporated.
Let's assume that at every iteration it is decided to carry out a pure invasion;
that is, at time step x+e the type of column x will invade the existing population,
or more concrete, at that point in time the game will be played with strategy
(0,1), where 1 is for the type of column x. In that case, the elements of column
13
x have to be such that lling them out in equations (A.4) and (A.5) will yield
the correct population graph.
Using an ESS will help determine at what exact time steps to carry out pure
invasions, since the evolution of the population is then predetermined and thus
known. It is desirable that playing (1,0), where 1 is for the rst element of the
rst column, is an ESS. Therefore, all other elements in the tness matrix must
be smaller than 2.
To tackle the second research goal, it is important to know that the equations
of the lters will partially depend on the physical model of the speaker. It is
thus the question how to extract these parameters from the speech utterance so
that the equations for the lters can be established.
3.6 Contradictions
Since three algorithms are employed in the single-speaker classication stage,
their respective outcome have to be checked on consistency. A list of contra-
dictions allows the system to detect inconsistencies as well as indications to
multiple speakers.
In the table above T-D denotes text-dependent algorithms, while T-I de-
notes the text-independent algorithm. The system contains two text-dependent
algorithms and one text-independent algorithm. The binary value for T-D is
dened by the logical AND operation of T-D1 and TD2.
4 Multiple speaker detection
In order to successfully classify multiple speakers in a speech clip, two use-cases
should be analyzed. There are two main types:
1. Non-Overlapping speech where two or more speakers are speaking in dif-
ferent time frames (For example a dialogue).
2. Overlapping speech where two or more speakers speaking in both separate
time frames and same time frames (For example a debate).
In this se we discuss a technique for each of those use-cases: Framed Multi-
Speaker classication for Non-Overlapping speech and Harmonic Matching Fil-
ter for Overlapping speech. Those two techniques are executed in parallel in
14
the system and both results are combined in order to detect the most speakers
as possible.
4.1 Framed multi speaker classication
Framed Multi-Speaker classication algorithm is used in the system in order to
detect and classify multiple speakers in a speech signal. In order to do this,
the whole signal is processed. The algorithm is used on dialogues or other non-
overlapping speech clips. It uses single speaker classication techniques in order
to detect each speaker.
Figure 12: FMS classication stages.
The algorithm works in 3 stages as shown in gure 12:
1. FHM starts with erasing the pauses in the signal and uses this to frame
the signal;
2. Loop on each frame and classies the frame using the classication tech-
niques discussed in the previous section. The text-dependent speaker clas-
sication as well as the text-independent classication algorithms are used.
Also a check for contradiction is done to classify the single speaker as
shown in gure 13;
3. Finally FHM checks the results to extract only distinct speakers.
4.2 Harmonic matching classied
In order to enable the system to recognize speakers in multi-speaker speech
fragments with overlapping speech, the Harmonic Matching Classier (HMC) is
used. The HMC was introduced by Radfar et al. in [5] and separates Unvoiced-
Voiced (U-V) frames from Voiced-Voiced (V-V) frames in mixed speech.
15
Figure 13: FMS classication, per frame classication block.
The table above indicates what kind of speech is uttered by the respective
speaker in each frame category. U-V frames are useful in speaker recognition
of mixed speech, since in such a frame the features of the voiced speaker will
dominate. Hence, it will be possible to recognize the speaker for every frame.
However, before being able to separate U-V frames from V-V frames, rst the
U-U frames have to be removed from the signal. To achieve this, an algorithm
proposed by Bachu et al. [6] is employed, which uses energy and ZCR cal-
culations to distinguish unvoiced frames from voiced frames. Unvoiced/voiced
classication is based on heuristic methods. HMC recognizes U-V frames by
tting a harmonic model, given by equation (2), to a mixed analysis frame and
then evaluate the introduced error (3) against a threshold sv (4). This process
is repeated for all frames of the mixed signal.
1. Hmodel =
L(ωi)
l=1 A2
lωi
W2
(ω − lωi)
2. et
= minwi ||Xt
mix(w)|2
− Hmodel|
3. σ = mean({et
}
T
t=1)
where ωi is the fundamental frequency and W(ω)is a window applied to the
spectrum. The X component of equation (3) denotes the spectrum of the tth
mixed signal frame.
After the U-V frames have been extracted from the mixed speech signal, they
are passed to the Vector Quantization (VQ) block of the system, where every
frame is matched against the relevant database and two speakers are nally
16
recognized. Our system is currently limited to recognizing maximal 2 speakers
from a mixed signal, which is an obvious consequence of the limitations of the
methods used, especially harmonic model tting.
5 Test and Results
The output of the program comparing exact same speech le with existing one.
Everything classied perfectly.
-
START PHASE 1
Starting Endpoint Detection...
.ITL: 1.2448
..ITU: 6.2242
..IZCT: 220.5
.EnergyTotal: 384 elements
.RatesTotal: 384 elements
.BackoLength: 12
Starting FIR...
Starting Spectral Subtraction...
Starting Wavelets...
Entering Phase 1 Select block...
START PHASE 2
Starting DWS...
Starting MFCC...
Starting DTW...
'Adult Female'
Starting VQ...
'Adult Female'
Starting VM...
'Adult Female'
Final result:
'Adult Female'

Trying to classify dierent sound le (same person, same text). Once again,
everything is classiend and there's no contradictions.

START PHASE 1
Starting Endpoint Detection...
.ITL: 0.66714
.ITU: 3.3357
.IZCT: 120
.EnergyTotal: 384 elements
.RatesTotal: 384 elements
.BackoLength: 12
Starting FIR...
17
Starting Spectral Subtraction...
Starting Wavelets...
Entering Phase 1 Select block...
START PHASE 2
Starting DWS...
Saving le...
Starting MFCC...
Starting DTW...
'Adult Female'
Starting VQ...
'Adult Female'
Starting VM...
'Adult Female'
Final result:
'Adult Female'
-
Classifying a poor quality sound le, VM and VQ classies it correctly, but
DTW fails. The contradictions are veried and nal result is assigned correctly.
-
START PHASE 1
Starting Endpoint Detection...
.ITL: 0.2542
.ITU: 1.271
.IZCT: 120
.EnergyTotal: 387 elements
.RatesTotal: 387 elements
.BackoLength: 12
Starting FIR...
Starting Spectral Subtraction...
Starting Wavelets...
Entering Phase 1 Select block...
START PHASE 2 Starting DWS...
Starting MFCC...
Starting DTW...
'Adult Male'
Starting VQ...
'Child Female'
Starting VM...
'Child Female'
Final result:
'Child Female'
-
Classifying a poor quality sound le, this time DTW and VM classies it cor-
rectly, but VQ fails. The contradictions are veried and nal result is assigned
correctly.
-
18
START PHASE 1
Starting Endpoint Detection...
.ITL: 1.3699
.ITU: 6.8496
.IZCT: 120
.EnergyTotal: 421 elements
.RatesTotal: 421 elements
.BackoLength: 12
Starting FIR...
Starting Spectral Subtraction...
Starting Wavelets...
Entering Phase 1 Select block...
START PHASE 2
Skipping DWS and loading existing one...
Starting MFCC...
Starting DTW...
'Adult Male'
Starting VQ...
'Child Female'
Starting VM...
'Adult Male'
Final result:
'Adult Male'
6 Discussion
The developed system incorporates classical techniques as well as novel tech-
niques and is a combination of scientically proven and heuristic methods. The
techniques used for speech detection and noise reduction are well-known and
widely-used in speech processing applications. The addition of Spectral Sub-
traction to this stage of the system is a novel touch that improves the accuracy
of further steps.
In the processing and single-speaker classication stage various DSP-related
techniques have been combined with new research. Discrete Word Selection
and Age/Sex Classication both rely on existing methods, but are used in an
entirely new fashion in our implementation. Digital Signal Processing  which
incorporates Windowing and Framing  and Frequency Analysis (MFCC), on
the other hand, are classical supporting techniques that are used to prepare the
signal for further processing, as is customary in this kind of systems.
Working with pre-classication is very useful for larger databases and does
provide the user of the system with information about the speaker even if the
system can nd no match. Needless to say, the system relies heavily on the
physical model of speech and the vocal tract to accomplish this for adult and
child, male and female speakers.
19
For the actual classication, three algorithms have been selected that t the
requirements of the system the most. An originally planned implementation
of Extended Dynamic Time Warping (EDTW), however, had to be reduced
to the simple Dynamic Time Warping implementation, due to a lack of time.
Extended Dynamic Time Warping applies dimensionality reduction algorithms
like Principle Component Analysis before searching for a cost path, which would
have optimized the performance of the system.
The new research that the system incorporates, namely single-speaker, text-
independent classication by using Evolutionary Stable Strategies, is a very in-
teresting technique that needs further development and testing before its actual
use can be proven.
Multi-speaker classication also a novel heuristic method (Framed Multi-
Speaker Classication) for the recognition of multiple speakers in non-overlapping
speech. Harmonic Model Classication is a combination and adaptation of ex-
isting methods and is used for recognition in overlapping speech, which is a
novelty in its own right that is not easily achieved.
Between several stages of the system, a considerable amount of logic has
been incorporated to assure accurate processing of temporary results. The most
striking example of this logic and its use is probably the technique employed to
detect multiple speaker in a speech signal. It is implemented via the logical de-
coding of the results of the multiple classication algorithms. Of course, for this
method to be accurate, a reasonable amount of input is necessary. Therefore,
the more classication algorithms we have in the system the better the result
will be. Hence, incorporating EDTW and maybe other classication algorithms
in the system, in addition to the existing algorithms, will prove useful for the
switch to multi-speaker recognition, which is currently partially a task for the
user to carry out manually.
7 Conclusion
In this Paper several techniques to classify/detect single or multiple speakers are
discussed. In conjunction and proper usage, those techniques help to identify
one or more speakers. Tests and results of such system have shown that many
existing algorithms have dierent purposes and can only classify a speaker if
several conditions are met (for instance text dependent algorithms). Thus, to be
able to achieve best results for the speaker classication problem, the algorithms
should work together and be checked for contradictions of their output.
References
[1] Stevens, K.N., Acoustic Phonetics', 0262692503, MIT Press, 1998.
[2] Kamran, M. and Bruce, I. C., Robust Formant Tracking for Continuous
Speech with Speaker Variability, IEEE Transactions on Audio, Speech,
and Language Processing, Vol. 14, No. 2, 2006.
20
[3] Fant, G., Acoustic Theory of Speech Production, Mouton (The Hague),
1960.
[4] Flanagan, J.L., Speech Analysis, Synthesis and Perception, Springer Ver-
lag, Berlin, Heidelberg, 1972.
[5] Radfar, M.H., Sayadiyan, A. and Dansereau, R.M., A Generalized Ap-
proach for Model-Based Speaker-Dependent Single Channel Speech Sepa-
ration, Iranian Journal of Science  Technology, Transaction B, Engineer-
ing, Vol. 31, No. B3, pp 361-375, The Islamic Republic of Iran, 2007.
[6] Bachu, R.G., Kopparthi, S., Adapa, B., Barkana, B. D., Separation of
Voiced and Unvoiced using Zero-Crossing Rate and Energy of the Speech
Signal, American Society for Engineering Education (ASEE) Zone Con-
ference Proceedings, 2008.
[7] Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani and Md. Sai-
fur Rahman, Speaker Identication Using Mel Frequency Cepstral Coe-
cients, 2004
[9] Goring, D. (2006). Orthogonal Wavelet Decomposition. Available:
http://www.tideman.co.nz/Salalah/OrthWaveDecomp.html. Last accessed
21 January 2009.
[10] Patrick J. van Fleet (2008). Discrete wavelet transformation. New Jersey:
John Wiley  sons. 317-350
[11] Keogh E.J., Pazzani M.J. Derivative Dynamic Time Warping, 2000
[12] Dong Wang, Lie L., Hong-Jiang Zhang  Speech Segmentation Without
Speech Recognition, Microsoft Research Asia
21
Appendix A: Using Evolutionary Stable Strategies
to Model Human Speech
Let the air signal be called the signal s, then it can be modeled by an evolu-
tionary game with the following tness matrix:
This matrix can be extended to contain the eects of the speech mdeling, as
follows:
where g,t,l are the deformation signals of the glottis, tongue and lips, respec-
tively. The question marks in the matrix represent the amount of deformation
one signal evokes in another.This value is obviously dependant on the utterance,
which leads us to our rst conclusion.
Conclusion1:
Evolutionary games can only be used to model discrete speech utterances.
Practically this means that this technique will be used to model isolated vowels
and consonants.
Let's clarify the above a bit by considering an evolutionary game consisting
of a population of two types, i and j. The game has the following tness matrix
(not bimatrix, since only player 1 gets ospring):
Now, let's plot the evolution of the population over time for the following
strategies (or strategy pairs; player 1 and player 2 use the same strategy in each
of the following cases). Note that for this game we assume that all possible re-
lations occur during one generation (one element of the population has multiple
inter- and intra-type relationships, where applicable). It is also obvious that no
distinction is made between male and female elements; in fact, all elements are
genderless.
22
Applying strategy (1,0) means that the entire population exists of type i
23
exclusively. Since the ospring is equal to 2, the population will never grow
beyond its initial size, namely 2. Strategy (0,1) yields a similar case, where the
entire population consists of type j exclusively. However, the ospring size here
is 4, hence the population will grow over time. The number of relationships that
can (and will) occur at a certain point tx in time is :
P (tx−1)−1
n=1
n = 1 + 2 + 3 + ... + P(tx−1) − 1
which are all possible combinations, except the element with itself and re-
versed combinations. This amount of relationships can be calculated using the
form:
1 + 2 + 3 + ... + n =
n(n + 1)
2
which then yields equation (A.2). Finally, the population when using strategy
1
2 , 1
2
consists for 50% type i and 50% type j. Equation (A) is an extension of equation
(A.2) in order to include all possible relationships. The term
−2 ∗ P (tx−1)
2 − 1 ∗ P (tx−1)
4
can not be simplied because it originates from the form mentioned above
and hence a standard simplication would yield a wrong result.
In this specic case equation (A.3) can be reduced to A.3.1 :
P(tx) = offspring(i,i) ∗
P(tx−1)
2
− 1 ∗
P(tx−1)
4
+
offspring(i,j)∗ (P(tx−1) − 1) ∗
P(tx−1)
2
− 2 ∗
P(tx−1)
2
− 1 ∗
P(tx−1)
4
+
offspring(j,j) ∗
P(tx−1)
2
− 1 ∗
P(tx−1)
4
since
1
2
offspring(i,j) +
1
2
offspring(j,i) = offspring(i,j) = offspring(j,i).
Equation (A.3.1) can then further be reduced to (A.3.2)
P(tx) = offspring(j,i) ∗ (tx−1 − 1) ∗
tx−1
2
,
which equals equation(A.2), since in this case (i, j) = (j, i) = (i,i)+(j,j)
2 .
Let us now consider the eect that an invasion would have on the population
graph. As it happens, the pure strategy pair ((0,1),(0,1)) that we have examined
previously, is an Evolutionary Stable Strategy (ESS), because (a) it is a Nash
equilibrium and (b) i scores better against j than against itself. (Note that if
we remove dominated actions from this game, only strategy (0,1) remains.)
24
Consider the same strategies (pairs) again, but now with a pure invasion at
some moment in time.
The general population function is given by equations (A.1), (A.2) and (A.3)
respectively until t3, and by (A.4) and (A.5), as detailed below, thereafter:
P(tx) = offspring(i,i) ∗ ((Fi(tx−1)) ∗ P(tx−1) − 1) ∗ Fi(tx−1)∗P (tx−1)
2 +
25
offspring(i,j)
2 +
offspring(i,j)
2 ∗ (P(tx−1) − 1) ∗ P (tx−1)
2 −
(Fi(tx−1) ∗ P(tx−1) − 1) ∗ (Fi(tx−1)∗P (tx−1))
2 +
(Fj(tx−1) ∗ P(tx−1) − 1) ∗
Fj (tx−1)∗P (tx−1)
2 +
offspring(j,j) ∗ ((Fj(tx−1)) ∗ P(tx−1) − 1) ∗
Fj (tx−1)∗P (tx−1)
2
A =
y=i,j
(Fy(tx−1) ∗ P(tx−1) − 1) ∗
Fy(tx−1)∗P (tx−1)
2
B = Ftype(tx−1) ∗ P(tx−1)
C =
offspring(i,j)
2
+
offspring(j,i)
2
Ftype (tx) =
offspring(type,type)∗(B−1)∗
B
2
+
C ∗ (P(tx−1) − 1) ∗
P(tx−1)
2
− A
2
P (tx)
x = 1...∞
The general deformation function is dened by:
D(tx)



0 x  inv
Pinv(tx) − Pno−inv(tx) x ≥ inv
x = 1...∞
Equation (A.4) consists of three components: the rst to calculate the
number of possible combinations (and ospring after multiplication with the
ospring-factor) of type i, the second for the mixed combinations and the third
for combinations of type j. Equation (A.5) is a function called from equation
(A.4) and calculates the fraction (the ratio) of a certain type at a given moment
in time. This is achieved by calculating the sum of the ospring of the respective
type and half of the mixed ospring, and dividing this sum by the population
number.
As can be seen from the 'Type Ratios'-graphs, only in the case of an ESS
the evolution of the population restores and stabilizes over time.
26

Contenu connexe

Tendances

Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...CSCJournals
 
Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...Sourjya Dutta
 
Chapter7 circuits
Chapter7 circuitsChapter7 circuits
Chapter7 circuitsVin Voro
 
Thresholding eqns for wavelet
Thresholding eqns for waveletThresholding eqns for wavelet
Thresholding eqns for waveletajayhakkumar
 
Ch1 representation of signal pg 130
Ch1 representation of signal pg 130Ch1 representation of signal pg 130
Ch1 representation of signal pg 130Prateek Omer
 
3.Wavelet Transform(Backup slide-3)
3.Wavelet Transform(Backup slide-3)3.Wavelet Transform(Backup slide-3)
3.Wavelet Transform(Backup slide-3)Nashid Alam
 
A seminar on INTRODUCTION TO MULTI-RESOLUTION AND WAVELET TRANSFORM
A seminar on INTRODUCTION TO MULTI-RESOLUTION AND WAVELET TRANSFORMA seminar on INTRODUCTION TO MULTI-RESOLUTION AND WAVELET TRANSFORM
A seminar on INTRODUCTION TO MULTI-RESOLUTION AND WAVELET TRANSFORMमनीष राठौर
 
Noise Performance of CW system
Noise Performance of CW systemNoise Performance of CW system
Noise Performance of CW systemDr Naim R Kidwai
 
Wireless Communication Networks and Systems 1st Edition Beard Solutions Manual
Wireless Communication Networks and Systems 1st Edition Beard Solutions ManualWireless Communication Networks and Systems 1st Edition Beard Solutions Manual
Wireless Communication Networks and Systems 1st Edition Beard Solutions Manualpuriryrap
 
Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63Prateek Omer
 
Data bit rate_by_abhishek_wadhwa
Data bit rate_by_abhishek_wadhwaData bit rate_by_abhishek_wadhwa
Data bit rate_by_abhishek_wadhwaAbhishek Wadhwa
 
Ch03-Data And Signals
Ch03-Data And SignalsCh03-Data And Signals
Ch03-Data And Signalsasadawan123
 
EE443 - Communications 1 - Lab 3 - Loren Schwappach.pdf
EE443 - Communications 1 - Lab 3 - Loren Schwappach.pdfEE443 - Communications 1 - Lab 3 - Loren Schwappach.pdf
EE443 - Communications 1 - Lab 3 - Loren Schwappach.pdfLoren Schwappach
 
Ch6 digital transmission of analog signal pg 99
Ch6 digital transmission of analog signal pg 99Ch6 digital transmission of analog signal pg 99
Ch6 digital transmission of analog signal pg 99Prateek Omer
 
Audio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for DepressionAudio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for Depressionijsrd.com
 

Tendances (20)

Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
Speech Processing in Stressing Co-Channel Interference Using the Wigner Distr...
 
Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...Wavelet transform and its applications in data analysis and signal and image ...
Wavelet transform and its applications in data analysis and signal and image ...
 
Chapter7 circuits
Chapter7 circuitsChapter7 circuits
Chapter7 circuits
 
Thresholding eqns for wavelet
Thresholding eqns for waveletThresholding eqns for wavelet
Thresholding eqns for wavelet
 
Ch1 representation of signal pg 130
Ch1 representation of signal pg 130Ch1 representation of signal pg 130
Ch1 representation of signal pg 130
 
2. data and signals
2. data and signals2. data and signals
2. data and signals
 
3.Wavelet Transform(Backup slide-3)
3.Wavelet Transform(Backup slide-3)3.Wavelet Transform(Backup slide-3)
3.Wavelet Transform(Backup slide-3)
 
A seminar on INTRODUCTION TO MULTI-RESOLUTION AND WAVELET TRANSFORM
A seminar on INTRODUCTION TO MULTI-RESOLUTION AND WAVELET TRANSFORMA seminar on INTRODUCTION TO MULTI-RESOLUTION AND WAVELET TRANSFORM
A seminar on INTRODUCTION TO MULTI-RESOLUTION AND WAVELET TRANSFORM
 
Noise Performance of CW system
Noise Performance of CW systemNoise Performance of CW system
Noise Performance of CW system
 
Wireless Communication Networks and Systems 1st Edition Beard Solutions Manual
Wireless Communication Networks and Systems 1st Edition Beard Solutions ManualWireless Communication Networks and Systems 1st Edition Beard Solutions Manual
Wireless Communication Networks and Systems 1st Edition Beard Solutions Manual
 
Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63Ch7 noise variation of different modulation scheme pg 63
Ch7 noise variation of different modulation scheme pg 63
 
Signal Processing
Signal ProcessingSignal Processing
Signal Processing
 
Data bit rate_by_abhishek_wadhwa
Data bit rate_by_abhishek_wadhwaData bit rate_by_abhishek_wadhwa
Data bit rate_by_abhishek_wadhwa
 
Ch 03
Ch 03Ch 03
Ch 03
 
Ft and FFT
Ft and FFTFt and FFT
Ft and FFT
 
Ch03-Data And Signals
Ch03-Data And SignalsCh03-Data And Signals
Ch03-Data And Signals
 
EE443 - Communications 1 - Lab 3 - Loren Schwappach.pdf
EE443 - Communications 1 - Lab 3 - Loren Schwappach.pdfEE443 - Communications 1 - Lab 3 - Loren Schwappach.pdf
EE443 - Communications 1 - Lab 3 - Loren Schwappach.pdf
 
Ch6 digital transmission of analog signal pg 99
Ch6 digital transmission of analog signal pg 99Ch6 digital transmission of analog signal pg 99
Ch6 digital transmission of analog signal pg 99
 
Audio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for DepressionAudio/Speech Signal Analysis for Depression
Audio/Speech Signal Analysis for Depression
 
Lec3
Lec3Lec3
Lec3
 

Similaire à Research: Applying Various DSP-Related Techniques for Robust Recognition of Adult and Child Speakers

Empirical mode decomposition and normal shrink tresholding for speech denoising
Empirical mode decomposition and normal shrink tresholding for speech denoisingEmpirical mode decomposition and normal shrink tresholding for speech denoising
Empirical mode decomposition and normal shrink tresholding for speech denoisingijitjournal
 
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LANMETHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LANIJNSA Journal
 
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
An Effective Approach for Chinese Speech Recognition on Small Size of VocabularyAn Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabularysipij
 
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...ijsrd.com
 
Denoising of image using wavelet
Denoising of image using waveletDenoising of image using wavelet
Denoising of image using waveletAsim Qureshi
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Speech Compression Using Wavelets
Speech Compression Using Wavelets Speech Compression Using Wavelets
Speech Compression Using Wavelets IJMER
 
Image Denoising Using Wavelet
Image Denoising Using WaveletImage Denoising Using Wavelet
Image Denoising Using WaveletAsim Qureshi
 
Signal Processing of Radar Echoes Using Wavelets and Hilbert Huang Transform
Signal Processing of Radar Echoes Using Wavelets and Hilbert Huang TransformSignal Processing of Radar Echoes Using Wavelets and Hilbert Huang Transform
Signal Processing of Radar Echoes Using Wavelets and Hilbert Huang Transformsipij
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderIJTET Journal
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arraysRamin Anushiravani
 
Broad phoneme classification using signal based features
Broad phoneme classification using signal based featuresBroad phoneme classification using signal based features
Broad phoneme classification using signal based featuresijsc
 
Paper id 21201498
Paper id 21201498Paper id 21201498
Paper id 21201498IJRAT
 
Automatic identification of silence, unvoiced and voiced chunks in speech
Automatic identification of silence, unvoiced and voiced chunks in speechAutomatic identification of silence, unvoiced and voiced chunks in speech
Automatic identification of silence, unvoiced and voiced chunks in speechcsandit
 
AUTOMATIC IDENTIFICATION OF SILENCE, UNVOICED AND VOICED CHUNKS IN SPEECH
AUTOMATIC IDENTIFICATION OF SILENCE, UNVOICED AND VOICED CHUNKS IN SPEECH AUTOMATIC IDENTIFICATION OF SILENCE, UNVOICED AND VOICED CHUNKS IN SPEECH
AUTOMATIC IDENTIFICATION OF SILENCE, UNVOICED AND VOICED CHUNKS IN SPEECH cscpconf
 
Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features  Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features ijsc
 
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction IOSR Journals
 
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reductionEnsemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reductionIOSR Journals
 
Improvement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation methodImprovement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation methodCSCJournals
 

Similaire à Research: Applying Various DSP-Related Techniques for Robust Recognition of Adult and Child Speakers (20)

Empirical mode decomposition and normal shrink tresholding for speech denoising
Empirical mode decomposition and normal shrink tresholding for speech denoisingEmpirical mode decomposition and normal shrink tresholding for speech denoising
Empirical mode decomposition and normal shrink tresholding for speech denoising
 
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LANMETHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
METHOD FOR REDUCING OF NOISE BY IMPROVING SIGNAL-TO-NOISE-RATIO IN WIRELESS LAN
 
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
An Effective Approach for Chinese Speech Recognition on Small Size of VocabularyAn Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
An Effective Approach for Chinese Speech Recognition on Small Size of Vocabulary
 
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
Cancellation of Noise from Speech Signal using Voice Activity Detection Metho...
 
Denoising of image using wavelet
Denoising of image using waveletDenoising of image using wavelet
Denoising of image using wavelet
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Speech Compression Using Wavelets
Speech Compression Using Wavelets Speech Compression Using Wavelets
Speech Compression Using Wavelets
 
Image Denoising Using Wavelet
Image Denoising Using WaveletImage Denoising Using Wavelet
Image Denoising Using Wavelet
 
Signal Processing of Radar Echoes Using Wavelets and Hilbert Huang Transform
Signal Processing of Radar Echoes Using Wavelets and Hilbert Huang TransformSignal Processing of Radar Echoes Using Wavelets and Hilbert Huang Transform
Signal Processing of Radar Echoes Using Wavelets and Hilbert Huang Transform
 
H0814247
H0814247H0814247
H0814247
 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using Vocoder
 
Sound Source Localization with microphone arrays
Sound Source Localization with microphone arraysSound Source Localization with microphone arrays
Sound Source Localization with microphone arrays
 
Broad phoneme classification using signal based features
Broad phoneme classification using signal based featuresBroad phoneme classification using signal based features
Broad phoneme classification using signal based features
 
Paper id 21201498
Paper id 21201498Paper id 21201498
Paper id 21201498
 
Automatic identification of silence, unvoiced and voiced chunks in speech
Automatic identification of silence, unvoiced and voiced chunks in speechAutomatic identification of silence, unvoiced and voiced chunks in speech
Automatic identification of silence, unvoiced and voiced chunks in speech
 
AUTOMATIC IDENTIFICATION OF SILENCE, UNVOICED AND VOICED CHUNKS IN SPEECH
AUTOMATIC IDENTIFICATION OF SILENCE, UNVOICED AND VOICED CHUNKS IN SPEECH AUTOMATIC IDENTIFICATION OF SILENCE, UNVOICED AND VOICED CHUNKS IN SPEECH
AUTOMATIC IDENTIFICATION OF SILENCE, UNVOICED AND VOICED CHUNKS IN SPEECH
 
Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features  Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features
 
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
 
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reductionEnsemble Empirical Mode Decomposition: An adaptive method for noise reduction
Ensemble Empirical Mode Decomposition: An adaptive method for noise reduction
 
Improvement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation methodImprovement of minimum tracking in Minimum Statistics noise estimation method
Improvement of minimum tracking in Minimum Statistics noise estimation method
 

Plus de Roman Atachiants

Geant4 Model Testing Framework: From PAW to ROOT
Geant4 Model Testing Framework:  From PAW to ROOTGeant4 Model Testing Framework:  From PAW to ROOT
Geant4 Model Testing Framework: From PAW to ROOTRoman Atachiants
 
Report: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing ToolsReport: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing ToolsRoman Atachiants
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Roman Atachiants
 
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de vol
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de volB.Sc Thesis: Moteur 3D en XNA pour un simulateur de vol
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de volRoman Atachiants
 
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Roman Atachiants
 

Plus de Roman Atachiants (7)

Spike-Engine Flyer
Spike-Engine FlyerSpike-Engine Flyer
Spike-Engine Flyer
 
Geant4 Model Testing Framework: From PAW to ROOT
Geant4 Model Testing Framework:  From PAW to ROOTGeant4 Model Testing Framework:  From PAW to ROOT
Geant4 Model Testing Framework: From PAW to ROOT
 
Report: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing ToolsReport: Test49 Geant4 Monte-Carlo Models Testing Tools
Report: Test49 Geant4 Monte-Carlo Models Testing Tools
 
Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...Research: Developing an Interactive Web Information Retrieval and Visualizati...
Research: Developing an Interactive Web Information Retrieval and Visualizati...
 
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de vol
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de volB.Sc Thesis: Moteur 3D en XNA pour un simulateur de vol
B.Sc Thesis: Moteur 3D en XNA pour un simulateur de vol
 
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
Master Thesis: The Design of a Rich Internet Application for Exploratory Sear...
 
Ahieving Performance C#
Ahieving Performance C#Ahieving Performance C#
Ahieving Performance C#
 

Dernier

4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptxmary850239
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxElton John Embodo
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxVanesaIglesias10
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptxmary850239
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSJoshuaGantuangco2
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptxiammrhaywood
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationRosabel UA
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfErwinPantujan2
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptshraddhaparab530
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxRosabel UA
 

Dernier (20)

4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx4.18.24 Movement Legacies, Reflection, and Review.pptx
4.18.24 Movement Legacies, Reflection, and Review.pptx
 
EMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docxEMBODO Lesson Plan Grade 9 Law of Sines.docx
EMBODO Lesson Plan Grade 9 Law of Sines.docx
 
ROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptxROLES IN A STAGE PRODUCTION in arts.pptx
ROLES IN A STAGE PRODUCTION in arts.pptx
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx4.16.24 Poverty and Precarity--Desmond.pptx
4.16.24 Poverty and Precarity--Desmond.pptx
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTSGRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
GRADE 4 - SUMMATIVE TEST QUARTER 4 ALL SUBJECTS
 
Paradigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTAParadigm shift in nursing research by RS MEHTA
Paradigm shift in nursing research by RS MEHTA
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptxAUDIENCE THEORY -CULTIVATION THEORY -  GERBNER.pptx
AUDIENCE THEORY -CULTIVATION THEORY - GERBNER.pptx
 
Activity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translationActivity 2-unit 2-update 2024. English translation
Activity 2-unit 2-update 2024. English translation
 
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdfVirtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
Virtual-Orientation-on-the-Administration-of-NATG12-NATG6-and-ELLNA.pdf
 
Integumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.pptIntegumentary System SMP B. Pharm Sem I.ppt
Integumentary System SMP B. Pharm Sem I.ppt
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptxINCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
INCLUSIVE EDUCATION PRACTICES FOR TEACHERS AND TRAINERS.pptx
 
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptxLEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
LEFT_ON_C'N_ PRELIMS_EL_DORADO_2024.pptx
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Presentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptxPresentation Activity 2. Unit 3 transv.pptx
Presentation Activity 2. Unit 3 transv.pptx
 

Research: Applying Various DSP-Related Techniques for Robust Recognition of Adult and Child Speakers

  • 1. Applying Various DSP-Related Techniques for Robust Recognition of Adult and Child Speakers. R.Atachiants C.Bendermacher J.Claessen E.Lesser S.Karami January 21, 2009 Faculty of Humanity and Science, Maastricht University Abstract This paper approaches speaker recognition in a new way. A speaker recognition system has been realized that works on adult and child speak- ers, both male and female. Furthermore, the system employs text-dependent and text-independent algorithms, which makes robust speaker recognition possible in many applications. Single-speaker classication is achieved by age/sex pre-classication and is implemented using classic text-dependent techniques, as well as a novel technology for text-independent recognition. This new research uses Evolutionary Stable Strategies to model human speech and allows speaker recognition by analyzing just one vowel. 1 Introduction In the past few years privacy became of bigger importance to people all over the world. A great factor for this is the rise of the Internet, all private elements in a persons life became easier to adjust. The privacy of people became easier to copy. Since money is very important to be able to live nowadays, it was stolen very often, using the internet. Copying cards and the information belonging to it, was easier then ever and still occurrs very often. If it would be able to have a proper system for voicerecognation in combination with a password, maybe the security of our money increases. The importance to handle these problems lies with modeling the human speech. If an algorithm recognizes speech on its own, a person to check the sounds for human speech would not be needed. Some questions arise namely 'How can an algorithm know that there is speech?', 'How does an algorithm 1
  • 2. estimate the noise?', 'How does an algorithm achieve the classication of speech and when to do so?', and 'How can an algorithm notice that there are multiple speakers?'. These questions leads to an overal problem denition, namely: 'How to identify one or more speakers?'. To handle this problem the paper starts with detecting speech. This is the subject of the rst section, section 2. Here speech is detected by using an end-point detection algorithm, which recognizes speech, and noise reduction that uses three ways of ltering, namely (a) Finite Impulse Response (FIR), (b) Wavelets, and (c) Spectral Subtractions. Combining these three subjects the program retrieves a signal that satises the properties to be used by the algorithms for classication: classifying the speaker alone or in a conversation. First the speaker has to be recognized when he is talking alone, this is discussed in section 3. Speaker recognition is done using (a) discrete word selection, (b) Mel-Frequency Cepstral Coecients and Vector Quantization, (c) Age/Sex Classication, (d) Voice Model and (e) the contradictions that leads to the con- clusion that there is a person speaking or not. In the last part multiple speaker are identied and classied adjusting the methods Framed Multi Speaker Classi- cation and Harmonic Matching Classier. After these sections there is a short discussion about the subjects and then the conclusion are shown. 2 Speech detection The very rst step for identifying the speaker is detecting speech. This means that the part of the signal that contains speech has to be seperared from the noise part. There are two algorithms that can be used to detect speech. The rst one is endpoint detection which will be described in subsection 2.2 and the second algorithm is noise reduction. More information about the second algorithm can be found in subsection 2.3. 2.1 Architectural overview If a signal contains little noise, the end point detection algorithm can eectively determine wether the signal contains speach. However, if there is much noise, noise reduction has to be applied to the signal rst. To estimate the noise level of a signal the Spectral Subtraction algorithm is used. This estimation is then compared to the whole signal, resulting in the signal-to-noise ratio (SNR). If needed, one of three noise reduction techniques (FIR, Spectral Subtraction and Wavelets) is selected, based on the weighted SNR of each denoised signal. FIR is prefered over Wavelets and Spectral Subtraction, while the use of Wavelets is prefered over Spectral Subtraction. When end point detection is used on the selected denoised signal, it is safe to say that speech can be detected accurately. See gure 1 for a schematic overview. 2
  • 3. Figure 1: Architectural Overview speech detection 2.2 Endpoint detection The endpoint detection algorithm lters the noise from the begin and end of the signal and detects the begin and end of speech. If these two points are the same it means that the signal contains no speech and only exists of noise. It is assumed that the rst 100 ms of the signal contains no speech. From this part of the signal the energy and the zero crossing rate (ZCR) of the noise can be calculated. Next, the lower treshold (ITL) and higher treshold (ITU) can be calculated as follows: I1 = 0, 03(maxEnergy=avgEnergy) + avgEnergy I2 = 4 ∗ avgEnergy ITL = MIN(I1, I2) ITU = 5 ∗ ITU To determine the starting point (N1) and the end point (N2) of the speech, the ITL and ITU are considered. When the energy of the signal crosses the ITL for the rst time, this point is saved. If the energy then goes below the ITL again it was a false alarm. However, when it also crosses the ITU it means speech was found and the saved point is considered N1, see gure 2(a). For N2 a similar procedure is followed, just the other way around. Finally N1 and N2 can be determined more precisely by looking at the ZCR. To be more exact, a closer look is taken on the ZCRs of the 250 ms before N1. If a high ZCR is found in that interval that is an indication that there is 3
  • 4. speech and N1 needs to be reconsidered, see gure 2(b). Similarly N2 can be determined more accurately. Figure 2: (a) Determining N1 and N2. (b) Redermining N1 and N2. 2.3 Noise reduction Noise reduction is the other algorithm that is used for speech recognition, it can be done with help of FIR ltering, spectral subtraction and wavelets, more information about these topics can be found respectively in subsections 2.3.1, 2.3.2 and 2.3.3. 2.3.1 FIR ltering In signal processing two types of ltering are used. As the name suggests, the impulse response of the FIR lter is nite. The other type's lter response is normally not nite because of the feedback structure. The FIR lter is exten- sively used in this project in order to remove the white gaussion noise (WGN) from the signals. The frequencies of the WGN lie mainly in the low frequency band of the spectrum. A high pass rst order (FIR) lter has been applied to strengthen the amplitude of the high frequencies. This is done by decreasing the amplitudo of the low frequencies up to 20 dB, so the speech becomes stronger and the noise is reduced. For ltering the transfer function from thez-domain is used: H[z] = z − α z (1) The standard formula for a transfer function is H(z) = Y (z) X(z) . So it is clear that Y (z) = z - αand X(z) = z. The working of this formula is shown in gure 3. A transfer function with all coecients within the z-plane are always stable. Therefore αlies between -1 and 1. To get a decrease as high as possible with a rst-order formula the alpha is set to 0.95. In gure 4 the poles are shown 4
  • 5. Figure 3: FIR lter following from the facts that X(z) = 0, then z = 0 and therefore Y (z) = −0.95. So the FIR lter is stable because of the place of α in the z-plane. Figure 4: z-plane To determine the frequency response of a discrete-time (FIR) lter, the trans- fer function is evaluated at z = ejωT . From all this the transfer formula used in this paper for FIR ltering looks as in formula 2. H[ejωT ] = 1 − (0.95 ∗ e−jωT ) (2) This is one way of ltering WGN. In the part about wavelets, 2.3.3, another approach to lter WGN is explained. 2.3.2 Spectral subtraction Spectral subtraction is an advanced form of noise reduction. It is used for signals that contain Non-Gaussian (articial) noise. After framing and Hamming win- dowing (DSP), endpoint detection is used on every frame to seperate the noise frames from the frames with speech. From the noise frames a noise estimation 5
  • 6. of the signal is made. After applying Discrete Fourier Transform (DFT) to the windowed signal, the noise estimation is simply subtracted from the signal to obtain the denoised frames. Moreover the noise estimation is used to calculate the SNR later on (see section 2.1). Finally the inverse DFT is taken and the frames can be reassembled to get the denoised signal. A schematic overview of the whole process is given in gure 5. Figure 5: spectral subtraction 2.3.3 Wavelets As in the section about FIR ltering already suggested, wavelets are used to lter the signal on white gaussian noise. This way of ltering starts with the original signal and the mother wavelet. The mother wavelet could be one of many mother wavelets that are available. In this paper one of the available daubechie wavelets are used, the daubechie 3, used often in Matlab. This mother wavelet is recommended by Matlab, the program used to create the wavelet lter. Next step in ltering is the decomposition of the original signal. By tting the mother wavelet to the signal at the smallest scale, the lter produces what is called the rst wavelet detail and a remainder which is called the rst approx- imation. Then the timescale of the mother wavelet is doubled and again t to the rst approximation. This results in a second wavelet detail and the second remainder, the second approximation. Doubling the timescale of the mother wavelet is also known as dilation. Dilation and splitting the remainders into a new detail and approximation part, gure 6, is continued until the mother wavelet has been dilated to such an extent that it covers the entire range of the signal. [9] There are two ways of thresholding, soft- and hard-thresholding. With hard thresholding the signal below a certain threshold is set to zero. Soft thresholding 6
  • 7. Figure 6: Signal Decomposition is more complicated. It substracts the value of the threshold from the values of the signal that are above that certain threshold. The values below that threshold are set to zero again. [10] In Matlab this is integrated in the function ddencmp and wdencmp. The function ddencmp de-noises the signal using a threshold and the way of thresholding dened using the sound sample. The function wdencmp uses this threshold value and the soft/hard-thresholding to create a de-noised signal. So using these two functions, Matlab generates a denoised signal by itself. 3 Speaker classication he speaker classication algorithms described in this paper works best on a discrete words or small signals. First, Discrete Word Selection (DWS) algo- rithm is applied to cut the signal containing the most vowel components. Next, Age/Sex Classication (ASC) algorithm tries to classify the signal in order to reduce computation by eliminating the database samples that should be pro- cessed. Text-Dependent (T-D) speaker detection techniques such as Dynamic Time Warping (DTW) and Vector Quantization (VQ) and Text-Independent (T-I) such as Voice Model Algorithm are processed. The contradictions are checked and if detected, the ASC bias is discarded and the T-D and T-I al- gorithms are computed again. If a speaker is detected, the system proceeds to classication of Multiple speakers, using in parallel two dierent techniques: Framed Multi-Speaker Classication and Harmonic Matching Classier. There results of both are combined to achieve best result. See gure 7 for schematic overview. 7
  • 8. Figure 7: Architectural overview speaker recognition 3.1 Discrete word selection Discrete word selection is used for two reasons: rst of all, the techniques used in the system are mainly valid for discrete speech processing and not so much for the processing of continuous speech. This means that the best results will be achieved when working only with one, isolated group of words. Working with discrete speech will also optimize the performance of the system. The second reason for using discrete word selection is as a help for the 'Age/Sex Classication' (ASC) block. The ASC block uses physical properties of the human vocal tract to classify speech. The algorithm for discrete word selection is based on V/C/P (Vowel / Con- sonant / Pause) classication algorithm. This algorithm is text independent and composed of four blocks, see gure 8. In the rst block the main features are extracted; in the second block the signal is framed and classied for the rst time. Next, the noise level is estimated and the frames are classied again with an updated noise level parameter. In order to distinguish a consonant, the V/C/P algorithm proposes the usage of zero crossing rate features and a threshold (ZCR_dyna). In the case where ZCR is bigger than the threshold, the frame can be classied as a consonant. If the frame can not be classied, the energy of that frame will be checked. Is 8
  • 9. Figure 8: V/C/P classication algorithm blocks. the energy smaller than the overall noise level, then the frame can be classied as a pause. The frame can be classied as a vowel if the energy is larger. The results of an example speech clip using V/C/P classication is shown in gure 9. Figure 9: V/C/P classication of an example speech clip (o:consonant, +:pause, *:vowel).Image from Microsoft Research Asia.[12] The complete discrete word selection algorithm is implemented as follows: 1. Audio input is segmented into non-overlapping frames of 10ms, where energy and ZCR features are extracted. 2. Energy curve is smoothed, using FIR. 3. The Mean_Energy and Std_Energy of the energy curve are calculated to estimate the background noise energy level, and the threshold of ZCR (ZCR_dyna) as: NoiseLevel = Mean_Energy - 0,75 * Std_Energy ZCR_dyna = Mean_ZCR + 0,5 Std_ZCR 9
  • 10. 4. Frames are classied as V/C/P coarsely by using the following rules, where FrameType is used to denote the type of each frame. If ZCR ZCR_dyna then FrameType = Consonant Elseif Energy NoiseLevel, then FrameType = Pause Else FrameType = Vowel 5. Update the NoiseLevel as the weighted average energy of the frames at each vowel boundary and the background segments 6. Re-classify the frames using algorithm in step 4 with the updated Noise- Level. Pauses are merged by removing isolated short consonants. Vowel will be split at its energy if its duration is too long. 7. After classication is terminated, select the word with the highest number of V-frames. 3.2 MFCC and vector quantization Mel-frequency cepstral coecients (MFCCs) and vector quantization (VQ) are used to construct a set of highly representative feature vectors from a speech fragment. These vectors are used to achieve speaker classication. Frequencies below 1 kHz contain the most relevant information for speech. Hence the human hearing emphazises these frequencies. To immitate this, fre- quencies can be mapped to the Mel frequency scale (Mel scale). The Mel scale is linear up to 1 kHz, while for higher frequencies it is a logarithmic scale, thus emphasizing lower frequencies. After converting to the Mel scale, the MFCCs can be found using the Discrete Cosine Transform. In this paper 13 MFCCs are obtained from each frame of the speech signal. Since a speech fragment generally is divided into many frames, this will result in a large set of data. Therefore VQ, implemented as proposed in [7], is used to compress these data points to a set of feature vectors (codevectors). In the case of speech fragments the set of codevectors is a representation of the speaker. Such a representation is called a codebook. Here VQ is used to compress each set of MFCCs to 4 points. In the training phase a codebook is generated for every known speaker. These codebooks are saved in the database. When identifying a speaker from a new speech fragment VQ compares the MFCCs of the fragment to each codebook in the database, as can be seen in 10. The distance between a MFCC and the closest codevector is called its distortion. The codebook with the smallest total distortion of all MFCCs is identied as the speaker. 3.3 Dynamic time warping Dynamic Time Warping (DTW) is a generic algorithm, used to compare two signals. In order to nd the similarity between such sequences or as a prepro- 10
  • 11. Figure 10: Matching MFCCs to a codebook cessing step before averaging them, we must warp the time axis of one (or both) sequences to achieve a better alignment,gure11. Figure 11: Two sequences of data, having both overall similar shape but they are not aligned to the time axis.[11] In order to compare two speech signals in the system the DTW is applied to the 13 of Mel-frequency cepstral coecients (MFCCs) from the Mel scale and compared to its database samples. To nd a warping path of two sequences of MFCC data, few steps are re- quired: 1. Calculate the distances cost matrix (In this Paper Euclidean distance was used to compute the cost) 2. Computing the path, starting from a corner of the cost matrix, processing adjacent cells. This path can be found very eciently using dynamic programming.[11] 11
  • 12. ˆ W = w1, w2, . . . ,wk,. . . ,wK max(m,n) ¿ K m+n-1 3. Select only in the path which minimizes the warping cost: DTW(Q, C) = min K k−1 Wk K 4. Repeat the path calculation for each MFCC feature and compute a dier- ence from each path. 3.4 Age/ sex classication The ASC block is based on physical properties of speech and the vocal tract and will pre-classify the input to one of the following 4 categories: male adult, female adult, male child, female child. This pre-classication will help the classication algorithms of the system to classify the speaker more accurately. The total length of the vocal tract L can be calculated from the rst harmonic of a sound exiting the closed tube. L = c 4F (3) where c is the speed of sound and F the fundamental frequency. Once the length of the vocal tract has been calculated, it is very straightforward to classify the length according to age and sex. General assumptions are that an adult has a longer vocal tract than a child and that a male also has a longer vocal tract than a female [1]. For easier implementation of the classier, it was chosen to work with vocal tract length instead of directly with the fundamental frequencies. Based on [2] the ASC algorithm has been developed and implemented, which uses LPC to extract the rst formant out of the signal. Classication is then based on heuristic methods, where length intervals for adult female and child male are divided into sub-bands, allowing to distinguish between these cate- gories. Implementation-wise it is important to note that a ASC has been imple- mented such that it will only be carried out if the number of samples in the database of the system is larger than the number of classes of speakers. This is done to avoid the pre-classication block (ASC) to act as a classication algorithm and hence disable the classication blocks. 3.5 Voice model Human speech is produced by expelling air from the lungs into the vocal tract, where the `air signal' is `modeled' into the desired utterance by the glottis, the tongue and the lips, amongst others. Thus, a speech signal can be seen as a signal evolving over time which is formed by certain invasions. In this research, it is proposed to use Evolutionary Stable Strategies (ESS) originating from the 12
  • 13. eld of game theory to model human speech and to accurately recognize speakers on a text-independent basis. In Appendix A, a detailed overview is given of how this theory is developed. Here the general implementation of the algorithm will be discussed. Finding a solution for the following two research problems is attempted: 1. Find an algorithm that, given an utterance of human speech, determines a tness matrix, appropriate strategies and invasions so that the speech utterance is correctly dened by the resulting evolution of the population of the game. 2. Employ the result of goal 1 to achieve speaker recognition, text-independent if possible. Since the ltering eect of separate speech organs can hardly be distinguished, a lossless concatenated tube model (n-tube model [3][4]) for modeling the vocal tract is assumed instead. The n-tube model also allows sequential modeling of the speech utterance and thus solves the problem of parallel eects that occur in the vocal tract. In essence, the algorithm we need will proceed as follows: 1. Determine the number of tubes in the model and their respective equa- tions. 2. Start lling out the tness matrix: (a) Initially it contains the value 2 in position (1,1) (b) Determine the equation of the signal after applying the rst lter. (c) Determine the elements of the next column of the tness matrix. (d) Determine the correct invasion parameters so that the current signal will become the desired signal as determined in (b) (e) Repeat steps (b) to (d) until the desired utterance is modeled (until all tubes have been passed) 3. Store the values from step 2 in a database format that includes elements of the tness matrix as well as strategy information and invasions. In order to analyze the feasibility of this algorithm it is required to delve a bit deeper into steps (c)and (d). It is obvious that (c)and (d) are mutually dependent since the outcome of an invasion will depend of the ospring param- eters. Furthermore, it has to be determined what strategy to play generally and when to invade. Finally, an ESS that will simplify the entire process has to be incorporated. Let's assume that at every iteration it is decided to carry out a pure invasion; that is, at time step x+e the type of column x will invade the existing population, or more concrete, at that point in time the game will be played with strategy (0,1), where 1 is for the type of column x. In that case, the elements of column 13
  • 14. x have to be such that lling them out in equations (A.4) and (A.5) will yield the correct population graph. Using an ESS will help determine at what exact time steps to carry out pure invasions, since the evolution of the population is then predetermined and thus known. It is desirable that playing (1,0), where 1 is for the rst element of the rst column, is an ESS. Therefore, all other elements in the tness matrix must be smaller than 2. To tackle the second research goal, it is important to know that the equations of the lters will partially depend on the physical model of the speaker. It is thus the question how to extract these parameters from the speech utterance so that the equations for the lters can be established. 3.6 Contradictions Since three algorithms are employed in the single-speaker classication stage, their respective outcome have to be checked on consistency. A list of contra- dictions allows the system to detect inconsistencies as well as indications to multiple speakers. In the table above T-D denotes text-dependent algorithms, while T-I de- notes the text-independent algorithm. The system contains two text-dependent algorithms and one text-independent algorithm. The binary value for T-D is dened by the logical AND operation of T-D1 and TD2. 4 Multiple speaker detection In order to successfully classify multiple speakers in a speech clip, two use-cases should be analyzed. There are two main types: 1. Non-Overlapping speech where two or more speakers are speaking in dif- ferent time frames (For example a dialogue). 2. Overlapping speech where two or more speakers speaking in both separate time frames and same time frames (For example a debate). In this se we discuss a technique for each of those use-cases: Framed Multi- Speaker classication for Non-Overlapping speech and Harmonic Matching Fil- ter for Overlapping speech. Those two techniques are executed in parallel in 14
  • 15. the system and both results are combined in order to detect the most speakers as possible. 4.1 Framed multi speaker classication Framed Multi-Speaker classication algorithm is used in the system in order to detect and classify multiple speakers in a speech signal. In order to do this, the whole signal is processed. The algorithm is used on dialogues or other non- overlapping speech clips. It uses single speaker classication techniques in order to detect each speaker. Figure 12: FMS classication stages. The algorithm works in 3 stages as shown in gure 12: 1. FHM starts with erasing the pauses in the signal and uses this to frame the signal; 2. Loop on each frame and classies the frame using the classication tech- niques discussed in the previous section. The text-dependent speaker clas- sication as well as the text-independent classication algorithms are used. Also a check for contradiction is done to classify the single speaker as shown in gure 13; 3. Finally FHM checks the results to extract only distinct speakers. 4.2 Harmonic matching classied In order to enable the system to recognize speakers in multi-speaker speech fragments with overlapping speech, the Harmonic Matching Classier (HMC) is used. The HMC was introduced by Radfar et al. in [5] and separates Unvoiced- Voiced (U-V) frames from Voiced-Voiced (V-V) frames in mixed speech. 15
  • 16. Figure 13: FMS classication, per frame classication block. The table above indicates what kind of speech is uttered by the respective speaker in each frame category. U-V frames are useful in speaker recognition of mixed speech, since in such a frame the features of the voiced speaker will dominate. Hence, it will be possible to recognize the speaker for every frame. However, before being able to separate U-V frames from V-V frames, rst the U-U frames have to be removed from the signal. To achieve this, an algorithm proposed by Bachu et al. [6] is employed, which uses energy and ZCR cal- culations to distinguish unvoiced frames from voiced frames. Unvoiced/voiced classication is based on heuristic methods. HMC recognizes U-V frames by tting a harmonic model, given by equation (2), to a mixed analysis frame and then evaluate the introduced error (3) against a threshold sv (4). This process is repeated for all frames of the mixed signal. 1. Hmodel = L(ωi) l=1 A2 lωi W2 (ω − lωi) 2. et = minwi ||Xt mix(w)|2 − Hmodel| 3. σ = mean({et } T t=1) where ωi is the fundamental frequency and W(ω)is a window applied to the spectrum. The X component of equation (3) denotes the spectrum of the tth mixed signal frame. After the U-V frames have been extracted from the mixed speech signal, they are passed to the Vector Quantization (VQ) block of the system, where every frame is matched against the relevant database and two speakers are nally 16
  • 17. recognized. Our system is currently limited to recognizing maximal 2 speakers from a mixed signal, which is an obvious consequence of the limitations of the methods used, especially harmonic model tting. 5 Test and Results The output of the program comparing exact same speech le with existing one. Everything classied perfectly. - START PHASE 1 Starting Endpoint Detection... .ITL: 1.2448 ..ITU: 6.2242 ..IZCT: 220.5 .EnergyTotal: 384 elements .RatesTotal: 384 elements .BackoLength: 12 Starting FIR... Starting Spectral Subtraction... Starting Wavelets... Entering Phase 1 Select block... START PHASE 2 Starting DWS... Starting MFCC... Starting DTW... 'Adult Female' Starting VQ... 'Adult Female' Starting VM... 'Adult Female' Final result: 'Adult Female' Trying to classify dierent sound le (same person, same text). Once again, everything is classiend and there's no contradictions. START PHASE 1 Starting Endpoint Detection... .ITL: 0.66714 .ITU: 3.3357 .IZCT: 120 .EnergyTotal: 384 elements .RatesTotal: 384 elements .BackoLength: 12 Starting FIR... 17
  • 18. Starting Spectral Subtraction... Starting Wavelets... Entering Phase 1 Select block... START PHASE 2 Starting DWS... Saving le... Starting MFCC... Starting DTW... 'Adult Female' Starting VQ... 'Adult Female' Starting VM... 'Adult Female' Final result: 'Adult Female' - Classifying a poor quality sound le, VM and VQ classies it correctly, but DTW fails. The contradictions are veried and nal result is assigned correctly. - START PHASE 1 Starting Endpoint Detection... .ITL: 0.2542 .ITU: 1.271 .IZCT: 120 .EnergyTotal: 387 elements .RatesTotal: 387 elements .BackoLength: 12 Starting FIR... Starting Spectral Subtraction... Starting Wavelets... Entering Phase 1 Select block... START PHASE 2 Starting DWS... Starting MFCC... Starting DTW... 'Adult Male' Starting VQ... 'Child Female' Starting VM... 'Child Female' Final result: 'Child Female' - Classifying a poor quality sound le, this time DTW and VM classies it cor- rectly, but VQ fails. The contradictions are veried and nal result is assigned correctly. - 18
  • 19. START PHASE 1 Starting Endpoint Detection... .ITL: 1.3699 .ITU: 6.8496 .IZCT: 120 .EnergyTotal: 421 elements .RatesTotal: 421 elements .BackoLength: 12 Starting FIR... Starting Spectral Subtraction... Starting Wavelets... Entering Phase 1 Select block... START PHASE 2 Skipping DWS and loading existing one... Starting MFCC... Starting DTW... 'Adult Male' Starting VQ... 'Child Female' Starting VM... 'Adult Male' Final result: 'Adult Male' 6 Discussion The developed system incorporates classical techniques as well as novel tech- niques and is a combination of scientically proven and heuristic methods. The techniques used for speech detection and noise reduction are well-known and widely-used in speech processing applications. The addition of Spectral Sub- traction to this stage of the system is a novel touch that improves the accuracy of further steps. In the processing and single-speaker classication stage various DSP-related techniques have been combined with new research. Discrete Word Selection and Age/Sex Classication both rely on existing methods, but are used in an entirely new fashion in our implementation. Digital Signal Processing which incorporates Windowing and Framing and Frequency Analysis (MFCC), on the other hand, are classical supporting techniques that are used to prepare the signal for further processing, as is customary in this kind of systems. Working with pre-classication is very useful for larger databases and does provide the user of the system with information about the speaker even if the system can nd no match. Needless to say, the system relies heavily on the physical model of speech and the vocal tract to accomplish this for adult and child, male and female speakers. 19
  • 20. For the actual classication, three algorithms have been selected that t the requirements of the system the most. An originally planned implementation of Extended Dynamic Time Warping (EDTW), however, had to be reduced to the simple Dynamic Time Warping implementation, due to a lack of time. Extended Dynamic Time Warping applies dimensionality reduction algorithms like Principle Component Analysis before searching for a cost path, which would have optimized the performance of the system. The new research that the system incorporates, namely single-speaker, text- independent classication by using Evolutionary Stable Strategies, is a very in- teresting technique that needs further development and testing before its actual use can be proven. Multi-speaker classication also a novel heuristic method (Framed Multi- Speaker Classication) for the recognition of multiple speakers in non-overlapping speech. Harmonic Model Classication is a combination and adaptation of ex- isting methods and is used for recognition in overlapping speech, which is a novelty in its own right that is not easily achieved. Between several stages of the system, a considerable amount of logic has been incorporated to assure accurate processing of temporary results. The most striking example of this logic and its use is probably the technique employed to detect multiple speaker in a speech signal. It is implemented via the logical de- coding of the results of the multiple classication algorithms. Of course, for this method to be accurate, a reasonable amount of input is necessary. Therefore, the more classication algorithms we have in the system the better the result will be. Hence, incorporating EDTW and maybe other classication algorithms in the system, in addition to the existing algorithms, will prove useful for the switch to multi-speaker recognition, which is currently partially a task for the user to carry out manually. 7 Conclusion In this Paper several techniques to classify/detect single or multiple speakers are discussed. In conjunction and proper usage, those techniques help to identify one or more speakers. Tests and results of such system have shown that many existing algorithms have dierent purposes and can only classify a speaker if several conditions are met (for instance text dependent algorithms). Thus, to be able to achieve best results for the speaker classication problem, the algorithms should work together and be checked for contradictions of their output. References [1] Stevens, K.N., Acoustic Phonetics', 0262692503, MIT Press, 1998. [2] Kamran, M. and Bruce, I. C., Robust Formant Tracking for Continuous Speech with Speaker Variability, IEEE Transactions on Audio, Speech, and Language Processing, Vol. 14, No. 2, 2006. 20
  • 21. [3] Fant, G., Acoustic Theory of Speech Production, Mouton (The Hague), 1960. [4] Flanagan, J.L., Speech Analysis, Synthesis and Perception, Springer Ver- lag, Berlin, Heidelberg, 1972. [5] Radfar, M.H., Sayadiyan, A. and Dansereau, R.M., A Generalized Ap- proach for Model-Based Speaker-Dependent Single Channel Speech Sepa- ration, Iranian Journal of Science Technology, Transaction B, Engineer- ing, Vol. 31, No. B3, pp 361-375, The Islamic Republic of Iran, 2007. [6] Bachu, R.G., Kopparthi, S., Adapa, B., Barkana, B. D., Separation of Voiced and Unvoiced using Zero-Crossing Rate and Energy of the Speech Signal, American Society for Engineering Education (ASEE) Zone Con- ference Proceedings, 2008. [7] Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani and Md. Sai- fur Rahman, Speaker Identication Using Mel Frequency Cepstral Coe- cients, 2004 [9] Goring, D. (2006). Orthogonal Wavelet Decomposition. Available: http://www.tideman.co.nz/Salalah/OrthWaveDecomp.html. Last accessed 21 January 2009. [10] Patrick J. van Fleet (2008). Discrete wavelet transformation. New Jersey: John Wiley sons. 317-350 [11] Keogh E.J., Pazzani M.J. Derivative Dynamic Time Warping, 2000 [12] Dong Wang, Lie L., Hong-Jiang Zhang Speech Segmentation Without Speech Recognition, Microsoft Research Asia 21
  • 22. Appendix A: Using Evolutionary Stable Strategies to Model Human Speech Let the air signal be called the signal s, then it can be modeled by an evolu- tionary game with the following tness matrix: This matrix can be extended to contain the eects of the speech mdeling, as follows: where g,t,l are the deformation signals of the glottis, tongue and lips, respec- tively. The question marks in the matrix represent the amount of deformation one signal evokes in another.This value is obviously dependant on the utterance, which leads us to our rst conclusion. Conclusion1: Evolutionary games can only be used to model discrete speech utterances. Practically this means that this technique will be used to model isolated vowels and consonants. Let's clarify the above a bit by considering an evolutionary game consisting of a population of two types, i and j. The game has the following tness matrix (not bimatrix, since only player 1 gets ospring): Now, let's plot the evolution of the population over time for the following strategies (or strategy pairs; player 1 and player 2 use the same strategy in each of the following cases). Note that for this game we assume that all possible re- lations occur during one generation (one element of the population has multiple inter- and intra-type relationships, where applicable). It is also obvious that no distinction is made between male and female elements; in fact, all elements are genderless. 22
  • 23. Applying strategy (1,0) means that the entire population exists of type i 23
  • 24. exclusively. Since the ospring is equal to 2, the population will never grow beyond its initial size, namely 2. Strategy (0,1) yields a similar case, where the entire population consists of type j exclusively. However, the ospring size here is 4, hence the population will grow over time. The number of relationships that can (and will) occur at a certain point tx in time is : P (tx−1)−1 n=1 n = 1 + 2 + 3 + ... + P(tx−1) − 1 which are all possible combinations, except the element with itself and re- versed combinations. This amount of relationships can be calculated using the form: 1 + 2 + 3 + ... + n = n(n + 1) 2 which then yields equation (A.2). Finally, the population when using strategy 1 2 , 1 2 consists for 50% type i and 50% type j. Equation (A) is an extension of equation (A.2) in order to include all possible relationships. The term −2 ∗ P (tx−1) 2 − 1 ∗ P (tx−1) 4 can not be simplied because it originates from the form mentioned above and hence a standard simplication would yield a wrong result. In this specic case equation (A.3) can be reduced to A.3.1 : P(tx) = offspring(i,i) ∗ P(tx−1) 2 − 1 ∗ P(tx−1) 4 + offspring(i,j)∗ (P(tx−1) − 1) ∗ P(tx−1) 2 − 2 ∗ P(tx−1) 2 − 1 ∗ P(tx−1) 4 + offspring(j,j) ∗ P(tx−1) 2 − 1 ∗ P(tx−1) 4 since 1 2 offspring(i,j) + 1 2 offspring(j,i) = offspring(i,j) = offspring(j,i). Equation (A.3.1) can then further be reduced to (A.3.2) P(tx) = offspring(j,i) ∗ (tx−1 − 1) ∗ tx−1 2 , which equals equation(A.2), since in this case (i, j) = (j, i) = (i,i)+(j,j) 2 . Let us now consider the eect that an invasion would have on the population graph. As it happens, the pure strategy pair ((0,1),(0,1)) that we have examined previously, is an Evolutionary Stable Strategy (ESS), because (a) it is a Nash equilibrium and (b) i scores better against j than against itself. (Note that if we remove dominated actions from this game, only strategy (0,1) remains.) 24
  • 25. Consider the same strategies (pairs) again, but now with a pure invasion at some moment in time. The general population function is given by equations (A.1), (A.2) and (A.3) respectively until t3, and by (A.4) and (A.5), as detailed below, thereafter: P(tx) = offspring(i,i) ∗ ((Fi(tx−1)) ∗ P(tx−1) − 1) ∗ Fi(tx−1)∗P (tx−1) 2 + 25
  • 26. offspring(i,j) 2 + offspring(i,j) 2 ∗ (P(tx−1) − 1) ∗ P (tx−1) 2 − (Fi(tx−1) ∗ P(tx−1) − 1) ∗ (Fi(tx−1)∗P (tx−1)) 2 + (Fj(tx−1) ∗ P(tx−1) − 1) ∗ Fj (tx−1)∗P (tx−1) 2 + offspring(j,j) ∗ ((Fj(tx−1)) ∗ P(tx−1) − 1) ∗ Fj (tx−1)∗P (tx−1) 2 A = y=i,j (Fy(tx−1) ∗ P(tx−1) − 1) ∗ Fy(tx−1)∗P (tx−1) 2 B = Ftype(tx−1) ∗ P(tx−1) C = offspring(i,j) 2 + offspring(j,i) 2 Ftype (tx) = offspring(type,type)∗(B−1)∗ B 2 + C ∗ (P(tx−1) − 1) ∗ P(tx−1) 2 − A 2 P (tx) x = 1...∞ The general deformation function is dened by: D(tx)    0 x inv Pinv(tx) − Pno−inv(tx) x ≥ inv x = 1...∞ Equation (A.4) consists of three components: the rst to calculate the number of possible combinations (and ospring after multiplication with the ospring-factor) of type i, the second for the mixed combinations and the third for combinations of type j. Equation (A.5) is a function called from equation (A.4) and calculates the fraction (the ratio) of a certain type at a given moment in time. This is achieved by calculating the sum of the ospring of the respective type and half of the mixed ospring, and dividing this sum by the population number. As can be seen from the 'Type Ratios'-graphs, only in the case of an ESS the evolution of the population restores and stabilizes over time. 26