Advantages of Hiring UIUX Design Service Providers for Your Business
Text independent speaker recognition system
1. Project Members
ASHOK SHARMA PAUDEL(066/BEX/405)
DEEPESH LEKHAK(066/BEX/414)
KESHAV BASHYAL(066/BEX/418)
SUSHMA SHRESTHA(066/BEX/444)
TEXT-INDEPENDENT SPEAKER
RECOGNITION SYSTEM
1
2. OVERVIEW OF PRESENTATION
1. Introduction
2. Objective
3. System Architecture
4. Methodology
5. Results and Analysis
6. Application area
7. Limitations
8. Problem Faced
9. Conclusion2
3. 1. INTRODUCTION
Speech - universal method of
communication.
Information through speech signal
1. high-level Characteristics -syntax, dialect, style, overall
meaning of a spoken message.
2. low-level Characteristics- pitch and phonemic spectra
associated much more with the physiology of vocal tract.
3
5. 1. INTRODUCTION(3)
Speech is a diverse field with many
applications.
Speech
Recognition
Language
Recognition
Speaker
Recognition
Words
Language Name
Speaker Name
“How are you?”
English
“ Deepesh”
Speech
Signal
5
6. 1. INTRODUCTION (4)
What is Speaker Recognition?
Recognition of who is speaking based on
characteristics of their speech signal.
Text-independent , Text-dependent
Speaker Identification: Determines which
registered speaker has spoken.
Speaker Verification: Accept or reject a
claimed identity of a speaker.
6
7. 1. INTRODUCTION (5)
Biometric: a human generated signal or
attribute for authenticating a person’s
identity
Why Voice ?
– natural signal to produce
– Only biometric that allows users to authenticate
remotely.
– does not require a specialized input device,
Implementation cost is low
– ubiquitous: telephones and microphone equipped PC
7
8. Strongest
security
• Voice biometric with other forms of security
– Something you have - badge
– Something you are - voice
HaveKnow
Are
– Something you know - password
1. INTRODUCTION(6)
Why text independent speaker recognition ?
- Independent of text, easy to access, cannot be
forgotten or misplaced,
- Independent of language, Acceptable by user8
9. 2. OBJECTIVE
The main goal of the project is to design and
implement a text-independent speaker
recognition system on FPGA.
The specific goals can be summarized as:
1. To learn about digital signal processing and FPGA.
2. To implement and analyze the system in MATLAB.
3. To design and implement the system on FPGA.
9
10. 3. SYSTEM ARCHITECTURE
Universal Asynchronous Receiver Transmitter
Mel-Frequency Cespstral Coefficients
Mel-Spectrum
Fast Fourier Transform
Framing and Windowing
Pre-emphasis
Double Data Rate SDRAM Storage
Analog to Digital Conversion
Conditioning
Input audio
10
11. 4. METHODOLOGY
Testing data Training data
Input signal
Feature extraction
Feature matching
Threshold
Output
11
12. 4.1. System Implementation on MATLAB
4.1.1. Voice Capturing and Storage
-input through microphone, saved .wav
format
-sound used in format of 22050Hz, 16-bits
PCM, Mono Channel.
12
16. 4.1.2. Pre-Processing(4)
1)Silence removal 2) Pre-emphasis 3)Framing 4)Windowing
x[n] = s’[n] . w[n-m]
if n=0,1,2,…,N-1
if n=m,m+1,…..m+N-1
[2]
[2] Shi-Huang Chen and Yu-Ren Luo , Speaker Verification Using
MFCC and Support Vector Machine16
17. 4.1.3. Feature Extraction using MFCC
MFCC : Mel Filter Cepstral Coefficients
Perceptual approach
the human perception of speech, are applied to
the sample frames extract the features of speech.
Steps for calculating MFCC
1. Discrete Fourier Transform using FFT and
Power spectrum , X[k]|2 of signal
17
18. 4.1.3. Feature Extraction using MFCC(2)
2. Mel scaling
Mel scale : linear up to 1 KHz and logarithmic after 1 KHz
. Mapping the powers of the spectrum onto the Mel scale,
using Mel filter bank-Mel spectral coefficients G[k]
Filter bank:
overlapping windows
18
19. 4.1.3. Feature Extraction using MFCC(3)
3.log of Mel spectral coefficients has been taken
log(G[k]).
4. Discrete Cosine Transform (DCT) ->Mel-cepstrum
c[q].
(Source: Shi-Huang Chen and Yu-Ren Luo , Speaker Verification
Using MFCC and Support Vector Machine)
(3.4)
19
21. 4.1.4. Feature Matching using GMM
Gaussian Mixture Model
Parametric probability
density function
Based on soft clustering
technique
Mixture of Gaussian
components
21
23. 4.1.4. Feature Matching using GMM(3)
The GMM modeling process consists of two
steps:
1. Initialization :
Initial value of mean, covariance & weight
assigned.
2. Expectation Maximization(EM)
Value of mean, covariance & weight
calculated adaptively by finding maximum
likelihood of parameters.23
24. 4.1.5. Identification & Verification
For speaker identification, maximum posteriori
probability of a speaker model within a group of
S speakers.
For verification, a threshold value for the log-
likelihood probability of speaker has been set on
the adaptive basis.
.
24
Feature
Extraction
Feature
Matching Decision
Accept if
> Threshold
Reject if
< Threshold
25. 4.2. System Implementation on FPGA
Mic Pre-
amplification
DC offset
shiter
Analog-to-
digital
conversion
Temporary
Buffer
Framing and
windowing
Fast Fourier
Transform
Mel
Spectrum
Log
Discrete
Cosine
Transform
MFCC
(UART)
Computer
(MATLAB)
25
26. 4.2. System Implementation on FPGA(2)
Sound Capture and Level Shifter
• The audio sound is captured using conditioner
microphone and amplified using Op-amp
• Dc offset of the input audio signal is shifted to 1.65
volt
Analog to digital conversion and Digital to
analog conversion
• Spartan 3E FPGA board has ADC module having SPI
operation
• 14 bit ADC sample values are obtained from ADC at
the rate of 25000 samples per seconds.
26
27. 4.2. System Implementation on FPGA(3)
Double Data Rate SDRAM
- ADC Samples are stored in DDR SDRAM
temporarily before further processing.
- Burst mode 4 with burst length 2 i.e. 64
bits are written in SDRAM.
- Wishbone communication protocol is
used for communication with DDR SDRAM.
27
28. 4.2. System Implementation on FPGA(4)
Framing and windowing
ADC samples stored in DDR are pre-
emphasized.
50 % overlapped frames having frame
length of 512 samples are used.
Fast Fourier Transform
512 point Radix-2 Fast Fourier Transform is
done using Xilinx Logicore.
28
30. 4.2. System Implementation on FPGA(6)
Mel-Spectrum
Spectrum (linear scale) => Mel Spectrum
Log calculation
Natural log using look up tables .
Input data : 24 bit
output : 12 bit
30
31. 4.2. System Implementation on FPGA(7)
Discrete Cosine Transform (DCT)
DCT core by poencores.org
Input : 1 bit
Output : 16 bit parallel
Universal Asynchronous Receiver
Transmitter(UART)
Baud rate of 19.2 kbps
Each MFCC (32 bits) are divided into four
8-bit components.
Implemented on unused pin in Jumper for
using UART protocol via CDC.
31
32. 4.3. Further processing in Matlab
MFCCs are received in MATLAB in int32
format.
Training phase :MFCC feature vectors =>
Gaussian Mixture Model
Testing phase : MFCC feature vectors =>
posterior probability (Recognition).
32
33. 5. RESULT AND ANALYSIS
33
5.1. Output in MATLAB
Training data:31 speakers (male – 20, female-11)
Testing data length= 10-30 seconds
Training data length= 1-10 seconds
No. of MFCCs= 8-20
Up to 99% recognition when
testing data length= 30 seconds
training data length= 10 seconds
No. of MFCCs= 20
35. Largest increase in performance when training data
increases from 10 to 20 sec. Increasing to 30 sec
improves the performance with little increment
At most 30 sec of speech to maintain high
performance.
Abrupt change in performance on increasing testing
speech duration from 1 to 5 seconds. Only slight
increase in performance when increased from 5
seconds to 10 seconds.
Using more training data improves the performance .
35
5.1. Output in MATLAB(3)
36. 77% unknown female voice is matched with
female voice 85% unknown male voice is matched
with male voice.
During the experiments, 4 languages English,
Nepali and Hindi, German - correct speaker
recognition regardless of the spoken text and
language.
36
5.1. Output in MATLAB(4)
37. Total Error Rate (TER) = FAR + FRR
Threshold for speaker verification was
calculated empirically using FAR and FRR.
.37
5.1. Output in MATLAB(5)
38. 5.2. Output Analysis in FPGA
Recognition rate less than that of software
implementation.
overall resource utilization in FPGA :
i. RAMs : 7
ii. ROMs : 3
iii. Multipliers : 15
iv. Adders/ Subtractors : 18
v. Counters : 9
vi. Registers : 132
vii. Comparators : 20
viii. Multiplexers : 238
39. Device Utilization summary
Logic utilization Used Available Utilizations
Number of Slice Flip-Flops 8225 9312 88%
Number of 4 input LUTs 8734 9312 93%
Number of occupied Slices 2355 4656 54%
Number of Slices containing only related
logic
1325 1325 100%
Number of Slices containing unrelated logic 0 1325 0%
Total Number of 4 inputs LUTs 8903 9312 94%
Number of bonded IOBs 215 232 94%
Number of RAMB16s 7 20 35%
Number of BUFGMUXs 2 24 8%
Number of MULT18X18SIOs 15 20 75%
Average Fanout of Non-Clock Nets 272
39
5.2. Output Analysis in FPGA (2)
40. Security
• Forensics for
voice sample
matching
• Transaction
authentication
• Toll fraud
prevention
Information and
physical facilities
• Telephone
credit card
purchases
• Remote time
and attendance
logging
• Information
retrieval
• Audio indexing
• Voice dialing
and voice mail
Monitoring
• Access control
• Access to
confidential
information
areas
• Computer and
data networks
• Remote access
of computers
40
6. APPLICATIONS
41. 41
Duration of speech signal limits the
performance .
The intrusion based on voice imitation
cannot be detected.
Optimal number of model order.
The silence removal process is not efficient.
7. LIMITATION
42. limited resources in the Spartan 3E.
Lack of sufficient block RAM & ROM memory.
Synchronization problem of different
modules/components.
42
8. PROBLEM FACED
43. The system has been implemented using
MFCC for feature extraction and GMM to
model the speakers.
The performance of software
implementation of systems is very good.
The implementation in FPGA is not
satisfactory
Noise reduction algorithms can be used to
improve the performance of the system.
43
9. CONCLUSION