The document describes a student project to implement speech recognition using FPGA technology. It aims to identify single words in a hardware system that is cost-effective, reliable and simple. Background theory on speech recognition is provided, including how sounds are converted to fingerprints using FFT and averaged amplitudes in the frequency domain for training. MATLAB code was created to test the concept before hardware implementation.
5. 101010101010101010101
010111100101011001011
Objectives
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● Hardware implementation of a simple speech recognition 101010001010111011000
101101011000110100101
system. 010100110111010100101
001010001010101010101
101010111010010101001
● Single word identification. 010111100101011001011
110101010101010101001
010100101010100110111
010101010001110101110
● Cost efficiency, reliability, and simplicity are the major 101010001010111011000
consideration. 101101011000110100101
010100110111010100101
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks010100101001010001010
5
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
6. 101010101010101010101
010111100101011001011
Background Theory
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● The sound identification is based on its frequency content.
101010001010111011000
101101011000110100101
010100110111010100101
001010001010101010101
● Two steps: 101010111010010101001
010111100101011001011
➔ Training 110101010101010101001
010100101010100110111
➔ Recognition 010101010001110101110
101010001010111011000
101101011000110100101
010100110111010100101
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks
010100101001010001010
6
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
7. 101010101010101010101
010111100101011001011
Background theory
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● A MATLAB™ implementation was devised to assess the 101010001010111011000
101101011000110100101
project feasibility. 010100110111010100101
001010001010101010101
101010111010010101001
● Two files were produced: 010111100101011001011
110101010101010101001
010100101010100110111
➔ train.m 010101010001110101110
101010001010111011000
➔ recogniz.m 101101011000110100101
010100110111010100101
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks 010100101001010001010
7
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
8. 101010101010101010101
010111100101011001011
Background Theory
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● Training: 101010001010111011000
101101011000110100101
010100110111010100101
➔ Input several versions of a sound. 001010001010101010101
101010111010010101001
➔ Translate them to the frequency domain by using the 010111100101011001011
FFT. 110101010101010101001
010100101010100110111
➔ Average their amplitude in the frequency domain. 010101010001110101110
101010001010111011000
101101011000110100101
010100110111010100101
fingerprint.
● This produces the sound's 110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks
010100101001010001010
8
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
9. 101010101010101010101
010111100101011001011
Background Theory
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● Note on the FFT: 101010001010111011000
101101011000110100101
010100110111010100101
➔ Only half of it is used. 001010001010101010101
101010111010010101001
➔ Five 1024-points FFTs are performed per sound 010111100101011001011
sample. 110101010101010101001
010100101010100110111
010101010001110101110
101010001010111011000
101101011000110100101
−2 i
N −1 010100110111010100101
nk
X =∑ x e N
k =0,... , N −1 110101010101010101001
k n 010100101010100110111
n=0 010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks
010100101001010001010
9
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
10. 101010101010101010101
010111100101011001011
Background Theory
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● User inputs .wav files. 101010001010111011000
101101011000110100101
010100110111010100101
001010001010101010101
● Decimate and quantize the input sound files. 101010111010010101001
010111100101011001011
110101010101010101001
● Sound acquisition parameters: 010100101010100110111
010101010001110101110
101010001010111011000
➔ Sound samples are quantized down to 101101011000110100101
8 bits.
010100110111010100101
➔ The sampling frequency is 5 kHz. 110101010101010101001
010100101010100110111
➔ Around one second (1.024s) of sound is stored. 010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks
010100101001010001010
10
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
11. 101010101010101010101
010111100101011001011
Background Theory
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● Sound detection: 101010001010111011000
101101011000110100101
010100110111010100101
➔ Compute the average of a window. 001010001010101010101
101010111010010101001
➔ Compare it to the average of the next window. 010111100101011001011
110101010101010101001
➔ If the difference is significant then the 010100101010100110111
sound is
assumed to start at that point. 010101010001110101110
101010001010111011000
101101011000110100101
010100110111010100101
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks010100101001010001010
11
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
13. 101010101010101010101
010111100101011001011
Background Theory
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● Store detected sound stream into a vector. 101010001010111011000
101101011000110100101
010100110111010100101
001010001010101010101
● Apply FFT to the above vector's first 1024 101010111010010101001
points and put it
in 's'. 010111100101011001011
110101010101010101001
010100101010100110111
010101010001110101110
● Store 's' as the first row in the matrix 'x' and repeat with the
101010001010111011000
following 1024 points until there are five rows in 'x'. 101101011000110100101
010100110111010100101
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks 010100101001010001010
13
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
14. 101010101010101010101
010111100101011001011
Background Theory
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● Sound recognition: 101010001010111011000
101101011000110100101
➔ Compute the fingerprint of a sound. 010100110111010100101
001010001010101010101
101010111010010101001
➔ Compute the distance between the sound's fingerprint 010111100101011001011
and the reference fingerprint 110101010101010101001
010100101010100110111
➔ If both are close enough, then the sound is assumed to 010101010001110101110
101010001010111011000
match the reference sound. 101101011000110100101
010100110111010100101
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks010100101001010001010
14
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
15. 101010101010101010101
010111100101011001011
Background Theory
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● Note on the distance computation: 101010001010111011000
101101011000110100101
010100110111010100101
➔ The sounds fingerprint and the reference fingerprint 001010001010101010101
are considered as 1024-dimensional vectors. 101010111010010101001
010111100101011001011
➔ The distance between them is computed using the 110101010101010101001
010100101010100110111
euclidean distance formula: 010101010001110101110
101010001010111011000
101101011000110100101
1024 010100110111010100101
2
∑ a −b
D= 110101010101010101001
i i 010100101010100110111
i=0
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks
010100101001010001010
15
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
22. 101010101010101010101
010111100101011001011
2
I C Bus
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
101101011000110100101
010100110111010100101
001010001010101010101
101010111010010101001
010111100101011001011
Source: Wolfson WM8731 data sheets, p.43
110101010101010101001
010100101010100110111
010101010001110101110
● RADDR → Base address = 0011010 101010001010111011000
101101011000110100101
● R/W → Read/Write =0 010100110111010100101
110101010101010101001
010100101010100110111
● B[15-9] → Control Address = 0000100 010100101001010001010
101010101101010101011
● B[8-0] → Control Data = 000001101 010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks 010100101001010001010
22
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
23. 101010101010101010101
010111100101011001011
2
I C Bus
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
101101011000110100101
010100110111010100101
001010001010101010101
101010111010010101001
010111100101011001011
Source: Wolfson WM8731 data sheets, p.43
110101010101010101001
010100101010100110111
'MIC BOOST' 010101010001110101110
101010001010111011000
'MUTE MIC' 101101011000110100101
010100110111010100101
'INSEL' 110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
● B[8-0] → Control Data = 000001101 010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks 010100101001010001010
23
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
24. 101010101010101010101
010111100101011001011
2
I C Bus – ACK Signal
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● ACK signal goes from the Wolfson to the FPGA 101010001010111011000
101101011000110100101
➔ Opposite direction from rest of data 010100110111010100101
001010001010101010101
101010111010010101001
➔ Only one data line 010111100101011001011
110101010101010101001
010100101010100110111
010101010001110101110
101010001010111011000
101101011000110100101
010100110111010100101
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks 010100101001010001010
24
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
25. 101010101010101010101
010111100101011001011
2
I C Bus – ACK Signal
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● ACK signal goes from the Wolfson to the FPGA 101010001010111011000
101101011000110100101
➔ Opposite direction from rest of data 010100110111010100101
001010001010101010101
101010111010010101001
➔ Only one data line 010111100101011001011
110101010101010101001
010100101010100110111
Solution... 010101010001110101110
101010001010111011000
101101011000110100101
010100110111010100101
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks 010100101001010001010
25
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
26. 101010101010101010101
010111100101011001011
2
I C Bus – ACK Signal
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
● ACK signal goes from the Wolfson to the FPGA 101010001010111011000
101101011000110100101
➔ Opposite direction from rest of data 010100110111010100101 001010001010101010101
101010111010010101001
➔ Only one data line 010111100101011001011
110101010101010101001
010100101010100110111
Solution... L P M _ B U S T R I e010101010001110101110
n a b le d t
101010001010111011000
d a ta []
101101011000110100101
010100110111010100101
110101010101010101001
Tri-state buffer! t r id a t a [ ] r e s u lt [ ]
010100101010100110111
010100101001010001010
101010101101010101011
e n a b le t r 010101010001110101110
in s t
101010001010111011000
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks 010100101001010001010
26
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011
31. 101010101010101010101
010111100101011001011
Fast Fourier Transform
110101010101010101001
010100101010100110111
010100101001010001010
101010101101010101011
010101010001110101110
®
● Altera IP MegaCore 1024-points FFT module: 101010001010111011000
101101011000110100101
010100110111010100101
➔ Natural order streaming data input. 001010001010101010101
101010111010010101001
➔ Bit-reversed streaming data output. 010111100101011001011
110101010101010101001
➔ Low latency. FFT
010100101010100110111
c lk
010101010001110101110
s in k _ r e a d y
➔ Time Limited Version. re s e t_ n s o u rc e _ e rro r[1 ..0 ]
101010001010111011000
in v e r s e s o u rc e _ s o p
101101011000110100101
s in k _ v a lid s o u rc e _ e o p
010100110111010100101
s in k _ s o p s o u r c e _ v a lid
110101010101010101001
s in k _ e o p s o u rc e _ e x p [5 ..0 ]
010100101010100110111
s in k _ r e a l[ 7 . . 0 ] s o u r c e _ r e a l[ 7 . . 0 ]
s in k _ im a g [ 7 . . 0 ] s o u r c e _ im a g [ 7 . . 0 ]
010100101001010001010
s in k _ e r r o r [ 1 . . 0 ]
101010101101010101011
s o u rc e _ re a d y
010101010001110101110
101010001010111011000
in s t 1
110101010101010101001
010100101010100110111
Introduction ● Hardware Implementation ● Demo ● Final Remarks 010100101001010001010
31
Carlos Asmat – David López Sanzò – Kanwen Wu 101010101101010101011