SlideShare une entreprise Scribd logo
1  sur  4
Télécharger pour lire hors ligne
3rd International Conference on Electrical & Computer Engineering
ICECE 2004, 28-30 December 2004, Dhaka, Bangladesh




       SPEAKER IDENTIFICATION USING MEL FREQUENCY
                  CEPSTRAL COEFFICIENTS
           Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani Md. Saifur Rahman
    Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology,
                                             Dhaka-1000
                                     E-mail: saif672@yahoo.com


                    ABSTRACT                                and techniques. The choice of which technology to
This paper presents a security system based on              use is application-specific. At the highest level, all
speaker identification. Mel frequency Cepstral Co-          speaker recognition systems contain two main
efficients{MFCCs} have been used for feature                modules feature extraction and feature matching
extraction and vector quantization technique is used        [2,3].
to minimize the amount of data to be handled .
                                                                3. SPEECH FEATURE EXTRACTION
               1. INTRODUCTION                              The purpose of this module is to convert the speech
                                                            waveform to some type of parametric representation
Speech is one of the natural forms of
                                                            (at a considerably lower information rate). The
communication. Recent development has made it
                                                            speech signal is a slowly time varying signal (it is
possible to use this in the security system.         In
                                                            called quasi-stationary). When examined over a
speaker identification, the task is to use a speech
                                                            sufficiently short period of time (between 5 and 100
sample to select the identity of the person that
                                                            ms), its characteristics are fairly stationary.
produced the speech from among a population of
                                                            However, over long periods of time (on the order of
speakers. In speaker verification, the task is to use a
                                                            0.2s or more) the signal characteristics change to
speech sample to test whether a person who claims
                                                            reflect the different speech sounds being spoken.
to have produced the speech has in fact done so[1].
                                                            Therefore, short-time spectral analysis is the most
This technique makes it possible to use the speakers’
                                                            common way to characterize the speech signal. A
voice to verify their identity and control access to
                                                            wide range of possibilities exist for parametrically
services such as voice dialing, banking by telephone,
                                                            representing the speech signal for the speaker
telephone shopping, database access services,
                                                            recognition task, such as Linear Prediction Coding
information services, voice mail, security control for
                                                            (LPC), Mel-Frequency Cepstrum Coefficients
confidential information areas, and remote access to
                                                            (MFCC), and others. MFCC is perhaps the best
computers.
.                                                           known and most popular, and this feature has been
        2. PRINCIPLES OF SPEAKER                            used in this paper. MFCC’s are based on the known
               RECOGNITION                                  variation of the human ear’s critical bandwidths with
                                                            frequency. The MFCC technique makes use of two
Speaker recognition methods can be divided into             types of filter, namely, linearly spaced filters and
text-independent and text-dependent methods. In a           logarithmically spaced filters. To capture the
text-independent system, speaker models capture             phonetically important characteristics of speech,
characteristics of somebody’s speech which show up          signal is expressed in the Mel frequency scale. This
irrespective of what one is saying. [1]                     scale has a linear frequency spacing below 1000 Hz
In a text-dependent system, on the other hand, the          and a logarithmic spacing above 1000 Hz. Normal
recognition of the speaker’s identity is based on his       speech waveform may vary from time to time
or her speaking one or more specific phrases, like          depending on the physical condition of speakers’
passwords, card numbers, PIN codes, etc. Every              vocal cord. Rather than the speech waveforms
technology of speaker recognition, identification and       themselves, MFFCs are less susceptible to the said
verification, whether text-independent and text-            variations [1,4].
dependent, each has its own advantages and
disadvantages and may require different treatments



ISBN 984-32-1804-4                                      565
3.1 The MFCC processor
                                                                 K      ~
                                                                                  1 π 
                                                              = ∑ (log S k ) n k −  ,....................(2)
                                                              ~
A block diagram of the structure of an MFCC                 cn k =1               2 K 
processor is given in Figure 1. The speech input is
recorded at a sampling rate of 22050Hz. This                where n=1,2,….K
sampling frequency is chosen to minimize the
effects of aliasing in the analog-to-digital                The number of mel cepstrum coefficients, K, is
conversion process. Figure 1. shows the block                                                                  ~

diagram of an MFCC processor .
                                                            typically chosen as 20. The first component,   c   0
                                                                                                                   , is
                                                            excluded from the DCT since it represents the mean
Continuous     Frame          Windowing           FFT       value of the input signal which carries little speaker
 Speech        Blocking                                     specific information. By applying the procedure
                                                            described above, for each speech frame of about 30
                                                            ms with overlap, a set of mel-frequency cepstrum
    mel          Cepstrum         Mel-frequency
                                                            coefficients is computed. This set of coefficients is
  cepstrum                        Wrapping                  called an acoustic vector. These acoustic vectors can
                                                            be used to represent and recognize the voice
    Sk
                                                            characteristic of the speaker [4]. Therefore each
Figure 1 Block diagram of the MFCC processor
                                                            input utterance is transformed into a sequence of
3.2 Mel-frequency wrapping                                  acoustic vectors. The next section describes how
                                                            these acoustic vectors can be used to represent and
The speech signal consists of tones with different          recognize the voice characteristic of a speaker.
frequencies. For each tone with an actual
Frequency, f, measured in Hz, a subjective pitch is                   4. FEATURE MATCHING
measured on the ‘Mel’ scale. The mel-frequency              The state-of-the-art feature matching techniques
scale is a linear frequency spacing below 1000Hz            used in speaker recognition include, Dynamic Time
and a logarithmic spacing above 1000Hz. As a                Warping (DTW), Hidden Markov Modeling
reference point, the pitch of a 1kHz tone, 40dB             (HMM), and Vector Quantization (VQ). The VQ
above the perceptual hearing threshold, is defined as       approach has been used here for its ease of
1000 mels. Therefore we can use the following               implementation and high accuracy.
formula to compute the mels for a given frequency f
in Hz[5]:                                                   4.1 Vector quantization
                                                            Vector quantization (VQ) is a lossy data
     mel(f)= 2595*log10(1+f/700) ……….. (1)                  compression method based on principle of
                                                            blockcoding [6]. It is a fixed-to-fixed length
One approach to simulating the subjective spectrum
is to use a filter bank, one filter for each desired mel-   algorithm. VQ may be thought as an
frequency component. The filter bank has a                  aproximator. Figure 2 shows an example of a 2-
triangular bandpass frequency response, and the             dimensional VQ.
spacing as well as the bandwidth is determined by a
constant mel-frequency interval.

3.3 CEPSTRUM
In the final step, the log mel spectrum has to be
converted back to time. The result is called the mel
frequency cepstrum coefficients (MFCCs). The
cepstral representation of the speech spectrum
provides a good representation of the local spectral
properties of the signal for the given frame analysis.
Because the mel spectrum coefficients are real
numbers(and so are their logarithms), they may be           Figure 2 An example of a 2-dimensional VQ
converted to the time domain using the Discrete
Cosine Transform (DCT). The MFCCs may be                    Here, every pair of numbers falling in a particular
calculated using this equation [3,5]:                       region are approximated by a star associated with
                                                            that region. In Figure 2, the stars are called
                                                            codevectors and the regions defined by the borders



                                                        566
are called encoding regions. The set of all
codevectors is called the codebook and the set of all
encoding regions is called the partition of the space
[6].

4.2 LBG design algorithm
 The LBG VQ design algorithm is an iterative
algorithm (as proposed by Y. Linde, A. Buzo & R.
Gray) which alternatively solves optimality criteria
[7]. The algorithm requires an initial codebook. The
initial codebook is obtained by the splitting method.
In this method, an initial codevector is set as the
average of the entire training sequence. This            Figure 4 Conceptual diagram to illustrate the VQ
codevector is then split into two. The iterative                  Process.
algorithm is run with these two vectors as the initial
codebook. The final two codevectors are split into       In the training phase, a speaker-specific VQ
four and the process is repeated until the desired       codebook is generated for each known speaker by
number of codevectors is obtained. The algorithm is      clustering his/her training acoustic vectors. The
summarized in the flowchart of Figure 3.                 resultant codewords (centroids) are shown in Figure
                                                         4 by circles and triangles at the centers of the
                                                         corresponding blocks for speaker1 and 2,
       Find                                              respectively. The distance from a vector to the
       Centroid                                          closest codeword of a codebook is called a VQ-
                                                         distortion. In the recognition phase, an input
                                                         utterance of an unknown voice is “vector-quantized”
       Split each                                        using each trained codebook and the total VQ
       Centroid
                                                         distortion is computed. The speaker corresponding
                                                         to the VQ codebook with the smallest total distortion
       m=2*m                                             is identified. Figure 5 shows the use of different
                                                         number of centroids for the same data field.
       Cluster
       vectors

       Find
       Centroids

       Compute D
       (distortion)
                                       Yes
          D'− D
                       Yes
                <∈              m<M
           D
                                             No
                  No            Stop
             D'=D

Figure 3 Flowchart of VQ-LBG algorithm

                                                         Figure 5 Pictorial view of codebook with 5 and 15
In figure 4 , only two speakers and two dimensions                centroids respectively.
of the acoustic space are shown. The circles refer to
the acoustic vectors from speaker 1 while the                                5. RESULTS
triangles are from speaker 2.
                                                         The system has been implemented in Matlab6.1 on
                                                         windowsXP platform.The result of the study has



                                                     567
been presented in Table 1 and Table 2. The speech        the system increases. It has been found that
database consists of 21 speakers, which includes 13      combination of Mel frequency and Hamming
male and 8 female speakers. Here, identification rate    window gives the best performance. It also suggests
is defined as the ratio of the number of speakers        that in order to obtain satisfactory result, the number
identified to the total number of speakers tested.       of centroids has to be increased as the number of
                                                         speakers increases. The study shows that the linear
Table 1: Identification rate (in %) for different        scale can also have a reasonable identification rate if
         windows [using Linear scale]                    a comparatively higher number of centroids is used.
                                                         However, the recognition rate using a linear scale
 Code    Triangular     Rectangular      Hamming         would be much lower if the number of speakers
 book                                                    increases. Mel scale is also less vulnerable to the
 size                                                    changes of speaker's vocal cord in course of time.
 1       66.67          38.95            57.14
 2       85.7           42.85            85.7            The present study is still ongoing, which may
 4       90.47          57.14            90.47           include following further works. HMM may be used
 8       95.24          57.14            95.24           to improve the efficiency and precision of the
 16      100            80.95            100             segmentation to deal with crosstalk, laughter and
 32      100            80.95            100             uncharacteristic speech sounds. A more effective
 64      100            85.7             100             normalization algorithm can be adopted on extracted
                                                         parametric representations of the acoustic signal,
   Table 2: Identification rate (in %) for different     which would improve the identification rate further.
            windows [using Mel scale]                    Finally, a combination of features (MFCC, LPC,
                                                         LPCC, Formant etc) may be used to implement a
 Code    Triangular     Rectangular      Hamming         robust parametric representation for speaker
 book                                                    identification.
 size
                                                                             REFERENCES
 1       57.14          57.14            57.14
 2       85.7           66.67            85.7            [1]    Lawrence Rabiner and Biing-Hwang Juang,
 4       90.47          76.19            100                   Fundamental of Speech Recognition”, Prentice-Hall,
 8       95.24          80.95            100                   Englewood Cliffs, N.J., 1993.
                                                         [2]    Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi,
 16      100            85.7             100                   Yu. (1999). “Binary Quantization of Feature Vectors
 32      100            90.47            100                   for Robust Text-Independent Speaker Identification”
 64      100            95.24            100                   in IEEE Transactions on Speech and Audio
                                                               Processing, Vol. 7, No. 1, January 1999. IEEE, New
Table 1 shows identification rate when triangular,             York, NY, U.S.A.
or rectangular, or hamming window is used for            [3]    F. Soong, E. Rosenberg, B. Juang, and L. Rabiner,
framing in a linear frequency scale. The table                 "A Vector Quantization Approach to Speaker
clearly shows that as codebook size increases, the             Recognition", AT&T Technical Journal, vol. 66,
identification rate for each of the three cases                March/April 1987, pp. 14-26
increases, and when codebook size is 16,                 [4]    Comp.speech Frequently Asked Questions WWW
                                                               site, http://svr-www.eng.cam.ac.uk/comp.speech/
identification rate is 100% for both the triangular
                                                         [5]    Jr., J. D., Hansen, J., and Proakis, J. Discrete-Time
and hamming windows.However, in case of Table2                 Processing of Speech Signals, second ed. IEEE Press,
the same windows are used along with a Mel scale               New York, 2000.
instead of aLinear scale. Here, too, identification      [6]    R. M. Gray, ``Vector Quantization,'' IEEE ASSP
rate increases with increase in the size of the                Magazine, pp. 4--29, April 1984.
codebook. In this case, 100% identification rate is      [7]    Y. Linde, A. Buzo & R. Gray, “An algorithm for
obtained with a codebook size of 4 when hamming                vector quantizer design”, IEEE Transactions on
window is used.                                                Communications, Vol. 28, pp.84-95, 1980.

                 6. CONCLUSION
The MFCC technique has been applied for speaker
identification. VQ is used to minimize the data of
the extracted feature. The study reveals that as
number of centroids increases, identification rate of



                                                       568

Contenu connexe

Tendances

Speaker recognition systems
Speaker recognition systemsSpeaker recognition systems
Speaker recognition systemsNamratha Dcruz
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in androidAnshuli Mittal
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognitionRichie
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisRushin Shah
 
Speaker recognition on matlab
Speaker recognition on matlabSpeaker recognition on matlab
Speaker recognition on matlabArcanjo Salazaku
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
 
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABA GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABsipij
 
Speaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajanSpeaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajanAbhishek Mahajan
 
Voice Identification And Recognition System, Matlab
Voice Identification And Recognition System, MatlabVoice Identification And Recognition System, Matlab
Voice Identification And Recognition System, MatlabSohaib Tallat
 
A Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemA Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemVani011
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognitionphyuhsan
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupLINAGORA
 
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITION
DEVELOPMENT OF SPEAKER VERIFICATION  UNDER LIMITED DATA AND CONDITIONDEVELOPMENT OF SPEAKER VERIFICATION  UNDER LIMITED DATA AND CONDITION
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITIONniranjan kumar
 
Digital modeling of speech signal
Digital modeling of speech signalDigital modeling of speech signal
Digital modeling of speech signalVinodhini
 
Real Time Speaker Identification System – Design, Implementation and Validation
Real Time Speaker Identification System – Design, Implementation and ValidationReal Time Speaker Identification System – Design, Implementation and Validation
Real Time Speaker Identification System – Design, Implementation and ValidationIDES Editor
 

Tendances (19)

Speaker recognition systems
Speaker recognition systemsSpeaker recognition systems
Speaker recognition systems
 
Speaker recognition in android
Speaker recognition in androidSpeaker recognition in android
Speaker recognition in android
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
 
Speaker recognition on matlab
Speaker recognition on matlabSpeaker recognition on matlab
Speaker recognition on matlab
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
 
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLABA GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
A GAUSSIAN MIXTURE MODEL BASED SPEECH RECOGNITION SYSTEM USING MATLAB
 
Speaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajanSpeaker recognition system by abhishek mahajan
Speaker recognition system by abhishek mahajan
 
Voice Identification And Recognition System, Matlab
Voice Identification And Recognition System, MatlabVoice Identification And Recognition System, Matlab
Voice Identification And Recognition System, Matlab
 
A Survey on Speaker Recognition System
A Survey on Speaker Recognition SystemA Survey on Speaker Recognition System
A Survey on Speaker Recognition System
 
Speech Signal Analysis
Speech Signal AnalysisSpeech Signal Analysis
Speech Signal Analysis
 
speech enhancement
speech enhancementspeech enhancement
speech enhancement
 
Voice biometric recognition
Voice biometric recognitionVoice biometric recognition
Voice biometric recognition
 
Deep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - MeetupDeep Learning in practice : Speech recognition and beyond - Meetup
Deep Learning in practice : Speech recognition and beyond - Meetup
 
Speech Signal Processing
Speech Signal ProcessingSpeech Signal Processing
Speech Signal Processing
 
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITION
DEVELOPMENT OF SPEAKER VERIFICATION  UNDER LIMITED DATA AND CONDITIONDEVELOPMENT OF SPEAKER VERIFICATION  UNDER LIMITED DATA AND CONDITION
DEVELOPMENT OF SPEAKER VERIFICATION UNDER LIMITED DATA AND CONDITION
 
Digital modeling of speech signal
Digital modeling of speech signalDigital modeling of speech signal
Digital modeling of speech signal
 
Real Time Speaker Identification System – Design, Implementation and Validation
Real Time Speaker Identification System – Design, Implementation and ValidationReal Time Speaker Identification System – Design, Implementation and Validation
Real Time Speaker Identification System – Design, Implementation and Validation
 
A017410108
A017410108A017410108
A017410108
 

Similaire à Speaker identification using mel frequency

44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognitionsunnysyed
 
A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...TELKOMNIKA JOURNAL
 
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...IDES Editor
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueCSCJournals
 
Wavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker RecognitionWavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker RecognitionCSCJournals
 
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...IDES Editor
 
Speaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsSpeaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsRoger Gomes
 
Broad phoneme classification using signal based features
Broad phoneme classification using signal based featuresBroad phoneme classification using signal based features
Broad phoneme classification using signal based featuresijsc
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelA Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelIDES Editor
 
Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features  Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features ijsc
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...IDES Editor
 
Speaker Identification
Speaker IdentificationSpeaker Identification
Speaker Identificationsipij
 
Emotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechEmotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechIOSR Journals
 
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...TELKOMNIKA JOURNAL
 

Similaire à Speaker identification using mel frequency (20)

52 57
52 5752 57
52 57
 
44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition44 i9 advanced-speaker-recognition
44 i9 advanced-speaker-recognition
 
A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...A comparison of different support vector machine kernels for artificial speec...
A comparison of different support vector machine kernels for artificial speec...
 
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
Effect of Time Derivatives of MFCC Features on HMM Based Speech Recognition S...
 
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition TechniqueA Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
A Novel, Robust, Hierarchical, Text-Independent Speaker Recognition Technique
 
Wavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker RecognitionWavelet Based Noise Robust Features for Speaker Recognition
Wavelet Based Noise Robust Features for Speaker Recognition
 
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Inte...
 
Speaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home ApplicationsSpeaker and Speech Recognition for Secured Smart Home Applications
Speaker and Speech Recognition for Secured Smart Home Applications
 
Broad phoneme classification using signal based features
Broad phoneme classification using signal based featuresBroad phoneme classification using signal based features
Broad phoneme classification using signal based features
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov ModelA Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
A Novel Method for Speaker Independent Recognition Based on Hidden Markov Model
 
Ijetcas14 426
Ijetcas14 426Ijetcas14 426
Ijetcas14 426
 
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
[IJET-V1I6P21] Authors : Easwari.N , Ponmuthuramalingam.P
 
Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features  Broad Phoneme Classification Using Signal Based Features
Broad Phoneme Classification Using Signal Based Features
 
Speaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract FeaturesSpeaker Recognition Using Vocal Tract Features
Speaker Recognition Using Vocal Tract Features
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
Designing an Efficient Multimodal Biometric System using Palmprint and Speech...
 
Speaker Identification
Speaker IdentificationSpeaker Identification
Speaker Identification
 
Emotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio SpeechEmotion Recognition Based On Audio Speech
Emotion Recognition Based On Audio Speech
 
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
Feature Extraction Analysis for Hidden Markov Models in Sundanese Speech Reco...
 

Dernier

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...Sapna Thakur
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfchloefrazer622
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingTechSoup
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 

Dernier (20)

Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
BAG TECHNIQUE Bag technique-a tool making use of public health bag through wh...
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Disha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdfDisha NEET Physics Guide for classes 11 and 12.pdf
Disha NEET Physics Guide for classes 11 and 12.pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
Grant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy ConsultingGrant Readiness 101 TechSoup and Remy Consulting
Grant Readiness 101 TechSoup and Remy Consulting
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 

Speaker identification using mel frequency

  • 1. 3rd International Conference on Electrical & Computer Engineering ICECE 2004, 28-30 December 2004, Dhaka, Bangladesh SPEAKER IDENTIFICATION USING MEL FREQUENCY CEPSTRAL COEFFICIENTS Md. Rashidul Hasan, Mustafa Jamil, Md. Golam Rabbani Md. Saifur Rahman Electrical and Electronic Engineering, Bangladesh University of Engineering and Technology, Dhaka-1000 E-mail: saif672@yahoo.com ABSTRACT and techniques. The choice of which technology to This paper presents a security system based on use is application-specific. At the highest level, all speaker identification. Mel frequency Cepstral Co- speaker recognition systems contain two main efficients{MFCCs} have been used for feature modules feature extraction and feature matching extraction and vector quantization technique is used [2,3]. to minimize the amount of data to be handled . 3. SPEECH FEATURE EXTRACTION 1. INTRODUCTION The purpose of this module is to convert the speech waveform to some type of parametric representation Speech is one of the natural forms of (at a considerably lower information rate). The communication. Recent development has made it speech signal is a slowly time varying signal (it is possible to use this in the security system. In called quasi-stationary). When examined over a speaker identification, the task is to use a speech sufficiently short period of time (between 5 and 100 sample to select the identity of the person that ms), its characteristics are fairly stationary. produced the speech from among a population of However, over long periods of time (on the order of speakers. In speaker verification, the task is to use a 0.2s or more) the signal characteristics change to speech sample to test whether a person who claims reflect the different speech sounds being spoken. to have produced the speech has in fact done so[1]. Therefore, short-time spectral analysis is the most This technique makes it possible to use the speakers’ common way to characterize the speech signal. A voice to verify their identity and control access to wide range of possibilities exist for parametrically services such as voice dialing, banking by telephone, representing the speech signal for the speaker telephone shopping, database access services, recognition task, such as Linear Prediction Coding information services, voice mail, security control for (LPC), Mel-Frequency Cepstrum Coefficients confidential information areas, and remote access to (MFCC), and others. MFCC is perhaps the best computers. . known and most popular, and this feature has been 2. PRINCIPLES OF SPEAKER used in this paper. MFCC’s are based on the known RECOGNITION variation of the human ear’s critical bandwidths with frequency. The MFCC technique makes use of two Speaker recognition methods can be divided into types of filter, namely, linearly spaced filters and text-independent and text-dependent methods. In a logarithmically spaced filters. To capture the text-independent system, speaker models capture phonetically important characteristics of speech, characteristics of somebody’s speech which show up signal is expressed in the Mel frequency scale. This irrespective of what one is saying. [1] scale has a linear frequency spacing below 1000 Hz In a text-dependent system, on the other hand, the and a logarithmic spacing above 1000 Hz. Normal recognition of the speaker’s identity is based on his speech waveform may vary from time to time or her speaking one or more specific phrases, like depending on the physical condition of speakers’ passwords, card numbers, PIN codes, etc. Every vocal cord. Rather than the speech waveforms technology of speaker recognition, identification and themselves, MFFCs are less susceptible to the said verification, whether text-independent and text- variations [1,4]. dependent, each has its own advantages and disadvantages and may require different treatments ISBN 984-32-1804-4 565
  • 2. 3.1 The MFCC processor K ~   1 π  = ∑ (log S k ) n k −  ,....................(2) ~ A block diagram of the structure of an MFCC cn k =1   2 K  processor is given in Figure 1. The speech input is recorded at a sampling rate of 22050Hz. This where n=1,2,….K sampling frequency is chosen to minimize the effects of aliasing in the analog-to-digital The number of mel cepstrum coefficients, K, is conversion process. Figure 1. shows the block ~ diagram of an MFCC processor . typically chosen as 20. The first component, c 0 , is excluded from the DCT since it represents the mean Continuous Frame Windowing FFT value of the input signal which carries little speaker Speech Blocking specific information. By applying the procedure described above, for each speech frame of about 30 ms with overlap, a set of mel-frequency cepstrum mel Cepstrum Mel-frequency coefficients is computed. This set of coefficients is cepstrum Wrapping called an acoustic vector. These acoustic vectors can be used to represent and recognize the voice Sk characteristic of the speaker [4]. Therefore each Figure 1 Block diagram of the MFCC processor input utterance is transformed into a sequence of 3.2 Mel-frequency wrapping acoustic vectors. The next section describes how these acoustic vectors can be used to represent and The speech signal consists of tones with different recognize the voice characteristic of a speaker. frequencies. For each tone with an actual Frequency, f, measured in Hz, a subjective pitch is 4. FEATURE MATCHING measured on the ‘Mel’ scale. The mel-frequency The state-of-the-art feature matching techniques scale is a linear frequency spacing below 1000Hz used in speaker recognition include, Dynamic Time and a logarithmic spacing above 1000Hz. As a Warping (DTW), Hidden Markov Modeling reference point, the pitch of a 1kHz tone, 40dB (HMM), and Vector Quantization (VQ). The VQ above the perceptual hearing threshold, is defined as approach has been used here for its ease of 1000 mels. Therefore we can use the following implementation and high accuracy. formula to compute the mels for a given frequency f in Hz[5]: 4.1 Vector quantization Vector quantization (VQ) is a lossy data mel(f)= 2595*log10(1+f/700) ……….. (1) compression method based on principle of blockcoding [6]. It is a fixed-to-fixed length One approach to simulating the subjective spectrum is to use a filter bank, one filter for each desired mel- algorithm. VQ may be thought as an frequency component. The filter bank has a aproximator. Figure 2 shows an example of a 2- triangular bandpass frequency response, and the dimensional VQ. spacing as well as the bandwidth is determined by a constant mel-frequency interval. 3.3 CEPSTRUM In the final step, the log mel spectrum has to be converted back to time. The result is called the mel frequency cepstrum coefficients (MFCCs). The cepstral representation of the speech spectrum provides a good representation of the local spectral properties of the signal for the given frame analysis. Because the mel spectrum coefficients are real numbers(and so are their logarithms), they may be Figure 2 An example of a 2-dimensional VQ converted to the time domain using the Discrete Cosine Transform (DCT). The MFCCs may be Here, every pair of numbers falling in a particular calculated using this equation [3,5]: region are approximated by a star associated with that region. In Figure 2, the stars are called codevectors and the regions defined by the borders 566
  • 3. are called encoding regions. The set of all codevectors is called the codebook and the set of all encoding regions is called the partition of the space [6]. 4.2 LBG design algorithm The LBG VQ design algorithm is an iterative algorithm (as proposed by Y. Linde, A. Buzo & R. Gray) which alternatively solves optimality criteria [7]. The algorithm requires an initial codebook. The initial codebook is obtained by the splitting method. In this method, an initial codevector is set as the average of the entire training sequence. This Figure 4 Conceptual diagram to illustrate the VQ codevector is then split into two. The iterative Process. algorithm is run with these two vectors as the initial codebook. The final two codevectors are split into In the training phase, a speaker-specific VQ four and the process is repeated until the desired codebook is generated for each known speaker by number of codevectors is obtained. The algorithm is clustering his/her training acoustic vectors. The summarized in the flowchart of Figure 3. resultant codewords (centroids) are shown in Figure 4 by circles and triangles at the centers of the corresponding blocks for speaker1 and 2, Find respectively. The distance from a vector to the Centroid closest codeword of a codebook is called a VQ- distortion. In the recognition phase, an input utterance of an unknown voice is “vector-quantized” Split each using each trained codebook and the total VQ Centroid distortion is computed. The speaker corresponding to the VQ codebook with the smallest total distortion m=2*m is identified. Figure 5 shows the use of different number of centroids for the same data field. Cluster vectors Find Centroids Compute D (distortion) Yes D'− D Yes <∈ m<M D No No Stop D'=D Figure 3 Flowchart of VQ-LBG algorithm Figure 5 Pictorial view of codebook with 5 and 15 In figure 4 , only two speakers and two dimensions centroids respectively. of the acoustic space are shown. The circles refer to the acoustic vectors from speaker 1 while the 5. RESULTS triangles are from speaker 2. The system has been implemented in Matlab6.1 on windowsXP platform.The result of the study has 567
  • 4. been presented in Table 1 and Table 2. The speech the system increases. It has been found that database consists of 21 speakers, which includes 13 combination of Mel frequency and Hamming male and 8 female speakers. Here, identification rate window gives the best performance. It also suggests is defined as the ratio of the number of speakers that in order to obtain satisfactory result, the number identified to the total number of speakers tested. of centroids has to be increased as the number of speakers increases. The study shows that the linear Table 1: Identification rate (in %) for different scale can also have a reasonable identification rate if windows [using Linear scale] a comparatively higher number of centroids is used. However, the recognition rate using a linear scale Code Triangular Rectangular Hamming would be much lower if the number of speakers book increases. Mel scale is also less vulnerable to the size changes of speaker's vocal cord in course of time. 1 66.67 38.95 57.14 2 85.7 42.85 85.7 The present study is still ongoing, which may 4 90.47 57.14 90.47 include following further works. HMM may be used 8 95.24 57.14 95.24 to improve the efficiency and precision of the 16 100 80.95 100 segmentation to deal with crosstalk, laughter and 32 100 80.95 100 uncharacteristic speech sounds. A more effective 64 100 85.7 100 normalization algorithm can be adopted on extracted parametric representations of the acoustic signal, Table 2: Identification rate (in %) for different which would improve the identification rate further. windows [using Mel scale] Finally, a combination of features (MFCC, LPC, LPCC, Formant etc) may be used to implement a Code Triangular Rectangular Hamming robust parametric representation for speaker book identification. size REFERENCES 1 57.14 57.14 57.14 2 85.7 66.67 85.7 [1] Lawrence Rabiner and Biing-Hwang Juang, 4 90.47 76.19 100 Fundamental of Speech Recognition”, Prentice-Hall, 8 95.24 80.95 100 Englewood Cliffs, N.J., 1993. [2] Zhong-Xuan, Yuan & Bo-Ling, Xu & Chong-Zhi, 16 100 85.7 100 Yu. (1999). “Binary Quantization of Feature Vectors 32 100 90.47 100 for Robust Text-Independent Speaker Identification” 64 100 95.24 100 in IEEE Transactions on Speech and Audio Processing, Vol. 7, No. 1, January 1999. IEEE, New Table 1 shows identification rate when triangular, York, NY, U.S.A. or rectangular, or hamming window is used for [3] F. Soong, E. Rosenberg, B. Juang, and L. Rabiner, framing in a linear frequency scale. The table "A Vector Quantization Approach to Speaker clearly shows that as codebook size increases, the Recognition", AT&T Technical Journal, vol. 66, identification rate for each of the three cases March/April 1987, pp. 14-26 increases, and when codebook size is 16, [4] Comp.speech Frequently Asked Questions WWW site, http://svr-www.eng.cam.ac.uk/comp.speech/ identification rate is 100% for both the triangular [5] Jr., J. D., Hansen, J., and Proakis, J. Discrete-Time and hamming windows.However, in case of Table2 Processing of Speech Signals, second ed. IEEE Press, the same windows are used along with a Mel scale New York, 2000. instead of aLinear scale. Here, too, identification [6] R. M. Gray, ``Vector Quantization,'' IEEE ASSP rate increases with increase in the size of the Magazine, pp. 4--29, April 1984. codebook. In this case, 100% identification rate is [7] Y. Linde, A. Buzo & R. Gray, “An algorithm for obtained with a codebook size of 4 when hamming vector quantizer design”, IEEE Transactions on window is used. Communications, Vol. 28, pp.84-95, 1980. 6. CONCLUSION The MFCC technique has been applied for speaker identification. VQ is used to minimize the data of the extracted feature. The study reveals that as number of centroids increases, identification rate of 568