Speech signal time frequency representation

Methods and algorithms of speech recognition
course

Nikolay V. Karpov

nkarpov(а)hse.ru

Lection 2

 This lecture is concerned with the spectrogram as a
representation of how the frequency components within a
signal vary with time.
 Define what we mean by normalised time and frequency
 Define the short-term discrete Fourier transformaton
 Look at the effect of different window lengths on time and
frequency resolution
 Derive a quantitative description of the frequency
resolution of the short term DFT and compare the
performance of common windows
 Derive a quantitative description of the time resolution of
the short term DFT
 Explain the uncertainty principle and illustrate its effects
with some examples
 Review some of the properties of the DFT

Sampling
theorem says
that when the
analog signal
x(t) is band-
restricted
between 0 and
W[Hz] and when
x(t) is sampled
at every T =
1/2W, the
original signal
can be
completely
reproduced by

For those signals the
frequency bandwidths of
which are not known, a
low-pass filter is used to
restrict the bandwidths
before sampling. When a
signal is sampled
contrary to the sampling
theorem, aliasing
distortion occurs, which
distorts the high-
frequency components of
the signal, as shown

Designate
sin(2 wt )
g (t )
2 wt

i i i i
x(t ) x( ) g (t ) x( ) x( ) x(iT ) x(i )
i 1 2w 2w 2w Fs

Simple continuous signal x(t )
A cos(2 ft )
i k
x(t ) A cos(2 f ( ) ) A cos( i ) A cos(2 i )
Fs N
1 k
Normalised radian frequency 0 2 f( ) 2 (1) 0.5
Fs N
Normalised time i

With sampled data systems it is customary to change
the time scale and work in units of the sample period.
 In normalised units the sampling frequency (fs) and
period (T) both equal 1. They can therefore be
omitted from equations: any such omissions can be
deduced using dimensional consistency arguments.
 The Nyquist frequency is ½ Hz = πradians/second
 To convert back to real units:
◦ any time quantity must be multiplied by the real sample
period (or divided by the real sample frequency)
◦ any frequency or angular frequency quantity must be
multiplied by the real sample frequency (or divided by the
real sample period)

 The spectrogram shows the energy in a signal at each
frequency and at each time. We calculate this by
evaluating the short-term discrete Fourier transform
Z- tramsform
n
X ( z) x ( n) z
n

1
X ( z) X ( z ) z n 1dz
2 j c

The Fourier transform of x(n) can be computed from the z-
transform as:
i n
X( ) X ( z) z e i x ( n )e
n

The Fourier transform may be viewed as the z-transform evaluated
around the unit circle.
The Discrete Fourier Transform (DFT) is defined as a sampled
version of the Fourier shown above:
N 1 N 1
i 2 kn / N
X (k ) x ( n )e x ( n) X ( k )e i 2 kn / N

n 0 n 0

The Fast Fourier Transform (FFT) is simply an efficient computation
of the DFT.

 The Discrete Cosine Transform (DCT) is simply a
DFT of a signal that is assumed to be real and
even (real and even signals have real spectra!)
N 1
X (k ) x(0) 2 x(n) cos(2 kn / N ); k 0,1,..., N 1
n 0

Note that these are not the only transforms used in
speech processing
 Wavelets;
 Fractals;
 Wigner distributions;
 etc.

 We often want to estimate the “power
spectrum” of a non-stationary signal at a
particular instant of time.
 Multiply by a finite length window and take
the DFT.
 For a window of length N ending on sample m
we have:
N 1
i 2 k (m n) / N
X ( k ; m) w(n) x(m n)e
n 0
k-frequency, m-time

2
 X (k , m) X (k , m) X * (k , m) gives the power at a frequency of
k/N Hz (normalized) for a window centred at m-0.5(N-1).
 the (m-i) term in the exponent means that the phase origin
remains consistent by cancelling out the linear phase shift
introduced by a delay of m samples.
 the window samples are numbered backwards in time (for
convenience later) hence the summation is performed backwards
in time.
 The values X(k;m) are based on the N signal values from m–N+1
to m.
 the frequency resolution is 1/N Hz (normalised units)
 the spectrum is periodic and (since w and x are real) conjugate
symmetric:
X (k ) X (N k) X *( k)
 there are only 0.5N+1independent frequency values

Sound «is»

Hamming window of length
401 samples centred on 400

Line spectrum from larynx oscillation (at about 0.01
normalised Hz) is superimposed on vocal tract resonances

 Wide window (N=400)
 Independent frequency
values=N/2+1=201

 Short window eliminates the fine
detail (N=30)
 N/2+1=16

 Zero padded window gives more
spectrum points and an illusion of
more detail (N=480=30×16). This is
the normal case for a spectrogram
 N/2+1=241

N 1
X ( k ; m) w(n) x(m n) exp i 2 k ( m n) / N
n 0
By setting r= N–1–i we can rewrite this as:
2 j (m N 1) N 1
X ( k ; m) exp( ) * ym (r ) exp 2 jkr / N
N r 0

ym (r ) w( N 1 r ) x(m N 1 r )
This is a standard DFT multiplied by a phase-shift term that
is proportional to k: this compensates for the starting time
of the window: m–N+1

y(r) is a product of two signals so its DFT is the convolution
of the DFT’s of w(N–1–r) and x(m–N+1+r)

Rectangle window
w(n) 1
Slidelobe -13.3 dB
Mainlobe width(-3dB) 0.027
Leakage factor 9.14%
a=1.21, b=2

N
ck cos(2k (n ) / N)
2
Hamming window
w(n) 0.54 046 c1
Slidelobe -42.5 dB
Mainlobe width 0.039
Linkage factor 0.03%
a=1.81, b=4

Bartlett(L)
Blackman(L)
Rectwin(L)
Chebwin(L,R)
Hamming(L)
Hann(L)
Kaiser(L,beta)
taylorwin(L)
triang(L)

Спектр сигнала при использовании окна Блэкмана Спектр сигнала при использовании окна Блэкмана -
Наттала

BW = 45 Hz
NT = 44 ms

BW = 300 Hz
NT = 7 ms

/aI/ from “my” with 45Hz and 300Hz bandwidth spectrograms

45Hz, 44ms 300Hz, 7ms

 Vertical slice through
spectrogram:
 mT= 0.1s.
45 Hz gives finer frequency
resolution

 Horizontal slice through
spectrogram:
 k/NT= 3.5 kHz
300 Hz gives finer time
resolution

You cannot get good time resolution and good
frequency resolution from the same spectrogram.
 Duration of window = N/fs=N*T.
 Frequency resolution = fs×a/N
◦ Equal amplitude frequency components with this separation
will give distinct peaks
◦ Most windows have a≈2
 Time resolution = 2N/(a*fs)
◦ Amplitude variations with this period will be attenuated by
6 dB
 For all window functions, the product of the time and
frequency resolutions is equal to 2.
 Linguistic analysis typically uses a window length of
10–20 ms. The transfer function of the vocal tract
does not change significantly in this time.

 To keep all the information about time variation
of spectral components, you need only sample
the spectrum twice as fast as the spectral
magnitudes are varying. Using normalised
frequencies:
 –∞ dB bandwidth for an N-point window = b/N
◦ b = 4 for a Hamming window
 Significant variation of spectral components
occurs at frequencies below 2b/N
 we must sample the spectrum at a frequency of
b/N
◦ the separation between spectral samples is the window
width divided by b

 Exact line-spectrum of a periodic signal {xm}
 Sampled continuous spectrum of zero-extended {xm}
 Sampled continuous spectrum of infinite {xm} convolved with spectrum of
rectangular window
 FFT is an algorithm for calculating DFT

 Energy Conservation (Parseval’s theorem)

 Symmetries {xm} ⇒ {Xk}
 Discrete⇒Periodic:Xk+N=Xk
 Real⇒Hermitian: X–k= X*k
 Periodic:xm+N/r = x
 m ⇒Discrete:Xk= 0for k ≠ir
 Skew Periodic:xm+N/2r = –xm ⇒Odd Harmonics:
 Even:xm= x N–m ⇒Real
 Odd:xm= –x N–m ⇒Purely Imaginary
 Real & Even⇒Real & Even
 Real & Odd⇒Purely Imaginary and Odd

Speech signal time frequency representation

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (6)

Similaire à Speech signal time frequency representation

Similaire à Speech signal time frequency representation (20)

Plus de Nikolay Karpov

Plus de Nikolay Karpov (9)

Dernier

Dernier (20)

Speech signal time frequency representation