Speaker Recognition

Under guidance of
Dr. G. Pradhan
NIT PATNA (ECE dept.)
Presented by -
Kamlesh Kalvaniya -(1104080)
Niranjan Kumar –(1104087)
Piyush Kumar-(1104091)
B.TECH 4th yr (ECE dept.)
4/30/2016 N.I.T. PATNA ECE, DEPTT. 1

1. Introduction
2. Baseline speaker verification system
3. Future Plan

Speaker Recognition is the computing task of validating
identity claim of a person from his/her voice.
Applications:-
Authentication
Forensic test
Security system
ATM Security Key
Personalized user interface
Multi speaker tracking
Surveillance

Identification v/s verification

Phase of Speaker Verification
• Enrollment Session or Training Phase
• Operating Session or Testing Phase

Training & Testing Phase
Training Reference model
Speech
Identity claim
Testing
Speech R
Accept/reject
Pre-
processing
Feature
extraction
Model
Building
Pre-
processing
Feature
extraction comparison
Decision
logic

Preprocessing
Preprocessing is an important step in a speaker verification system. This also called
voice activity detection (VAD).
VAD separates speech region from non-speech regions[2-3]
It is very difficult to implement a VAD algorithm which works consistently for
different type of data
VAD algorithms can be classified in two groups
 Feature based approach
 Statistical model based approach
 Each of the VAD method have its own merits and demerits depending on accuracy,
complexity etc.
Due to simplicity most of the speaker verification systems use signal energy for VAD.

The speech signal along with speaker information
contains many other redundant information like
recording sensor, channel, environment etc.
The speaker specific information in the speech
signal[2]
 Unique speech production system
 Physiological
 Behavioral aspects
Feature extraction module transforms speech to a set
of feature vectors of reduce dimensions
 To enhance speaker specific information
 Suppress redundant information.
Feature Extraction

• Robust against noise and distortion
• Occur frequently and naturally in speech
• Be easy to measure from speech signal
• Be difficult to impersonate/mimic
• Not be affected by the speaker’s health or long term variations in voice
Selection of Features

Types Of Features

Feature Extraction Techniques
A wide range of approaches may be used to parametrically represent the speech
signal to be used in the speaker recognition activity.
 Linear Prediction Coding
 Linear Predictive Ceptral Coefficients
 Mel Frequency Ceptral Coefficients
 Perceptual Linear Prediction
 Neural Predictive Coding
Most of the state-of-the-art speaker verification systems use Mel-frequency
Cepstral Coefficient (MFCC) appended to it’s first and second order derivative
as the feature vectors
Easy to extract
Provides best performance compared to other features
 MFCC mostly contains information about the resonance structure of the vocal
tract system

1. Analog to digital conversion
2. Pre emphasis
3. Framing & windowing
4. Fast Fourier Transform
5. Mel scale wrapping
6. MFCC

MFCC
Step 1:- Analog to digital conversion: is transformed to
digital form by sampling it at given frequency.

MFCC
Step 2:- Pre-emphasis: The amount of energy present in
the high frequency (important for speech) are boosted.

MFCC
Step 3:(framing)the signal is divided into frames
of given size.

MFCC FRAMING

MFCC FRAMING
25ms
10ms

MFCC WINDOWING
• The next step is to window individual frame to
minimize the signal discontinuities at the
beginning and end of each frame.
• The concept applied here is to minimize the
spectral distortion by using the window to
taper the signal to zero at the beginning and
end of each frame.
• We have used hamming window

MFCC

MEL FILTERBANK

MFCC
DCT

Speaker Modelling
• Vector Quantization
• Gaussian Mixture Model
• Gaussian Mixture Model-UBM
• Hidden Markov Model
• Artificial Neural Networks
• Super Vector Machines
• I-Vector
 Gaussian model assumes the feature vectors follow a Gaussian distribution,
characterized by mean vectors, covariance matrix and weights
 The data unseen in the training which appear in the test data will trigger a low
score
Speaker models the statistical information present in the
feature vectors it enhances the speaker information and
suppress the redundant information

 A Gaussian mixture density defined as-
A Gaussian function for D dimension is defined as-
where- Unimodal Gaussian
D=8,16,32,64
ʎ i = {wi , ∑i µi }
wi = Weight
µi = Mean ;
∑i = Covariance ;
i-No. of models(M=356)
4/30/2016
N.I.T. PATNA ECE, DEPTT.
27
Gaussian Mixture Model

 For a sequence of T training vector X={x1 , x2 ,…, xT }
the GMM likelihood can be defined as-
 For estimation of speaker specific GMM,
Expectation maximization algorithm is used .

ʎtarget : X(MFCC(TESTING DATA)) is from the hypothesized
speaker S
ʎUBM : X(MFCC(TESTING DATA)) is not from the
hypothesized speaker S
 The likelihood ratio test is given by-
LR(X)=
 The probability of alternative hypothesis
P(X/ʎUBM ) =F( P(X/ʎ1), P(X/ʎ2),..., P(X/ʎM))
F( ) is function such as average or maximum of likelihood
value of Background Speaker set ( P(X/ʎi) ) .
4/30/2016 N.I.T. PATNA ECE, DEPTT.
30

 Score Normalisation
Where-
s- Original Score = log(LR(X));
µI - Estimated mean of s
σI -standard deviation of s

PERFORMANCE EVALUATION
 NIST has conducted speaker recognition
benchmarking activity on annual basis since
1997.
NIST has provided speech files as development
data.
NIST 2003 data-
Testing Speech Data-2559
Train Speech Data-356
UBM Female Speech data-251
UBM male Speech data-251

For Baseline speaker verification the following parameter are
used
 VAD: Energy based VAD (0.6 * average energy)
 Feature vector: 13 dimension MFCC appended with delta
and delta-delta
 Modeling: GMM
 GMM size: 8, 16, 32, 64.0
 Comparison: log Likelihood score

.

DET
PLOT
FOR
TEST
15 Sec
AND
TRAIN
15
SEC

DET
PLOT
FOR
TEST
FULL
AND
TRAIN
15
SEC

DET
PLOT
FOR
TEST
15 Sec
AND
TRAIN
FULL

DET
PLOT
FOR
TEST
FULL
AND
TRAIN
FULL

Comparison of training data model
with Equal Error Rate
.
GAUSSIAN SIZE
8
16
32
64
TEST 15 Sec
TRAIN 15 SEC
Test Full
Train 15 sec
TEST 15 sec
Train Full
Test Full
Train Full
EQUAL ERROR
RATE(%)
EQUAL ERROR
RATE(%)
EQUAL ERROR
RATE(%)
EQUAL ERROR
RATE(%)
34.90 34.24 33.18 27.70
33.05 32.28 30.50 25.67
32.46 32.94 28.78 23.67
32.82 33.06 27.42 22.05

Conclusion
 Performance is more sensitive to training
data.

Future Plan
 Synthetically generating training and testing speech
from limited speech data.
 Validating the results on state-of-the-art i-vector
based speaker verification system.

Thank you

Speaker Recognition

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (9)

Similaire à Speaker Recognition

Similaire à Speaker Recognition (20)

Dernier

Dernier (20)

Speaker Recognition