This document summarizes a study on independent speaker recognition for native English vowels. The study used a standard approach for vowel classification based on formant frequencies, which depend on vocal tract shape and dimensions. Formants F1 and F2 were extracted from speech samples and used as features. Euclidean distance was used to measure similarity between test samples and reference formant values. The method achieved 80-95% recognition accuracy for vowels from male and female speakers. Vowels /a/ and /o/ had the highest recognition rates while /e/ and /i/ were more likely to be confused due to inter-speaker variation. The study demonstrated the viability of using formant frequencies for automatic vowel and speaker recognition.
2. 378 G.N. Kodandaramaiah et al
The extracted formants are compared with the threshold of reference formants.
Euclidean distance measure is applied in order to measure the similarity or the
dissimilarity between two spoken words, which take place after quantizing a spoken
word into its code book. The matching of an unknown vowel is performed by
measuring the Euclidean distance between the features vector (formants) of the
unknown vowel to the reference model (codebook) of the known vowel formants F1,
F2 in the database. The goal is to find the codebook that has the minimum distance
measurement in order to identify the unknown vowel (Franti et al., 1997). For
example, in the testing or identification session, the Euclidean distance between the
features vector formants F1, F2 and codebook for each spoken vowel is calculated and
the vowel with the smallest average minimum distance is picked as shown in the Eq.
(1.1). Note that xi is the i th input features vector (formants F1, F2), yi is the i th
features vector in the codebook (Reference Model) and distance d is the distance
between xi and yi.
d(x,y)=√[ ] 1.1
where D=2, xi is the ith input features vector (formants F1, F2), yi is the ith feature
vector in the code book (Reference Model) and d is the distance between xi and
yi.s,Wi = weight associated with ith feature vector, recognition score.
Decision Rule
The weights ‘w’ are important to use if the information contained in the underlining
features is not proportional to the feature variances. In this case of vowel recognition
based on formants F1 and F2, they do not uniformly contribute to vowel recognition.
Based on study, relative weights-F1 =2; F2 =1 are given but normalized such that the
sum of the weights is 1.0.
We refer to classification based on this distance as Maximum Likelihood
Regression, since this is based on Gaussian assumptions used to obtain the parameters
in the classifier. To provide verification that the vowels displayed are producing
accurate results, the MLR has calculated the distance of average features for the given
vowels.If the feature distance is within the threshold criteria Di (F1, F2), then equation
1.1 becomes
Di (f) < α√m 1.2
where m is number of features i.e. F1 and F2, α is arbitrary scale factor used for
performance tuning. Then, the vector xi is identified as the vector yi, otherwise not. If
it is too small the MLR rejects many correct vowel samples. If it is too large the
output of category vowels will not be rejected. In our work the threshold α=x has
given optimum results.
3. Independent Speaker Recognition for Native English Vowels 379
Figure 1.1: Block diagram of vowel recognition.
Result of Vowel Ecognition of Male and Female Speakers
Male Speakers
The table 3.1 gives the result for male vowel recognition based on MLR method.
Vowel /a/ has achieved perfect classification compared to other vowels. The detection
rate for vowel /u/ and /e/ is better than vowel /o/ and /i/ for all tested samples. Vowel
/e/ and vowel /i/ tend to mis-classify with each other due to the variations of
utterances from different inter-speakers. The Fig 3.1 shows vowel ‘X’ versus %
vowel recognition for 50 male samples, where ‘X’ is the actual vowel.
For vowel /a/, /a/ in /a/ is 46; /a/ in /e/ is 0; /a/ in /i/ is 4;/a/ in /o/ is 0;/a/ in /u/ is 0.
Hence the percentage correctness of recognition of vowel /a/ is = ( /a/ in /a/ )*
100÷(/a/ in all the vowels) = 46 * 100÷(46+0+4+0+0) = 46*100/50 = 92 %.
Figure 3.1: Vowel Vs % vowel recognition for male speaker.
4. 380 G.N. Kodandaramaiah et al
Table 3.1 Shows the percentage recognition for vowel of male speakers.
vowels Predicted
Actual /a/ /e/ /i/ /o/ /u/ % correct
/a/ 46 0 4 0 0 92%
/e/ 2 44 0 4 0 89%
/i/ 6 0 40 0 4 80%
/o/ 0 3 0 44 3 88%
/u/ 3 1 1 0 45 90%
Female Speakers
The table 3.2 gives the result for female vowel recognition based on MLR method.
Vowel /o/ has achieved perfect classification compared to other vowels. The detection
rate for vowel /u/ and /e/ is better than vowel /a/ and /i/ for all tested samples. Vowel
/a/ and vowel /i/ tend to mis-classify with each other due to the variations of
utterances from different inter-speakers. The Fig 3.2 shows the percentage of
recogniton of vowel for 40 female samples. For vowel /o/, /o/ in /a/ is 0; /o/ in /e/ is 0;
/o/ in /i/ is 0;/o/ in /o/ is 39;/o/ in /u/ is 1. Hence the Percentage correctness of
recognition of vowel /o/ is = ( /o/ in /o/ )* 100÷(/o/ in all the vowels) = 39 *
100÷(0+0+0+39+1) = 39*100/40 = 98 %.
Figure 3.2: Vowel Vs % vowel recognition for female speaker.
5. Independent Speaker Recognition for Native English Vowels 381
Table 3.2 Shows percentage recognition of vowel for female speakers
Vowel /a/ /e/ /i/ /o/ /u/ %correct
/a/ 34 4 0 0 2 85%
/e/ 0 37 0 3 0 92%
/i/ 0 4 34 0 2 86%
/o/ 0 0 0 39 1 98%
/u/ 3 0 0 0 37 94%
Conclussion
It was an attempt presents to the standard method for vocal tract shape estimation has
been the basis for many successful automatic speech recognition (ASR) systems. Here
we describe a “standard” approach for classification of vowels based on formants. We
achieved 80 to 95 percentage of speaker recognition using Euclidean distance
measure.
Acknowledgements
We would like to thanks the Management, Principal of Madanapalli Institute of
Technology and Science, Madanapalli, A.P., for their Cooperation and
Encouragement
References
[1] L.R.Rabiner and R.W.Schafer, Digital processing of Speech signals, Droling
Kindersly(india)pvt.Ltd.,licensees of pearson eduction in south asia, 1978, PP.
54-101,412-460.
[2] Thomas F. Quatieri, Discrete time speech signal processing principles and
practice,2002, pp 56-59.
[3] P. Ladefoged, R. Harshman, L. Goldstein, and L. Rice, “Generating vocal tract
shapes from formant frequencies,” J. Acoust. Soc. Am., vol. 64, no. 4, , 1978,
pp. 1027–1035.
[4] Mayukh Bhaowal & Kunal Chawla Isolated word Recognition for English
Language using LPC, Vq and HMM,pp.2-4.
[5] G.E Peterson and H.L Barney,” control methods used in a study of the vowels
” J.Acoustic.Soc.Amer., Volume 24,PP.175-184
[6] P.Rose,Long-and short-term within-speaker differences in the formants of
Australian hello, j.Int. Phonetic. Assoc. 29(1) (1999) 1-31.
[7] AhmedAli Safiullah Bhatti, dr.Munammad Sleem Miam. formants based
Analysis for speech recognition, IEEE 2006.