Dereverberation in the stft and log mel frequency feature domains

1 April 2012
Dereverberation in the STFT and
log mel-frequency feature domains
Takuya Yoshioka

Dereverberation is necessary
for many speech applications“ ”

0
10
20
30
0.2 0.3 0.4 0.5 0.6
ASR (connected digit recognition)
T60 in seconds
Worderrorratein%

ASR (LVCSR using WSJ-20K)
0
20
40
60
80
100
Clean training +MLLR Multi-style training
Worderrorratein%

Source separation
T60=0.3 s T60=0.5 s
0
2
4
6
8
10
12SNRindB

And others…
• Source localization
• Adaptive beamforming
• VAD

Acoustic feature extraction process
STFT
| ・ |2
Mel FB
Log compression
DCT
Δ, ΔΔ
Microphone
Decoder

STFT
| ・ |2
Mel FB
Log compression
DCT
Δ, ΔΔ
Microphone
Decoder
STFT coefficients
Fully benefit from
the use of
microphone arrays

STFT
| ・ |2
Mel FB
Log compression
DCT
Δ, ΔΔ
Microphone
Decoder
Power spectra
Easy to combine
with noise
suppressors

STFT
| ・ |2
Mel FB
Log compression
DCT
Δ, ΔΔ
Microphone
Decoder
Log mel-frequency
features
Efficient for reducing
the acoustic mismatch
between observations
and training data

n : frame index
ny : corrupted vector
nx : clean vector
nxˆ : estimate of xn
Notations

Optimal estimation in the MMSE sense
∫= nn xxˆ ),,|(p 1nnYY,|X past
yyx  ndx

∫= nn xxˆ ),,|(p 1nnYY,|X past
yyx  ndx
),,,|(p 11-nnnYX,|Y past
yyxy  )(p nX x
×
Clean speech modelReverberation model
Generative approach (using Bayes rule)

STFT domain
Clean speech model
Reverberation model
Posterior distribution
Parameter estimation
Clean speech model
Reverberation model
Log mel-frequency feature domain
Linear
prediction
VTS

STFT domain
Clean speech model
Reverberation model
Clean speech model
Reverberation model
Log mel-frequency feature domain

n : frame index
ny : corrupted complex-valued spectrum
(consisting of 257 bins)
nx : clean complex-valued spectrum
Notations

∏=
j
X
jn,jn,CNnX,nX )λ;0,(xf)Λ;(p x
Clean STFT coefficients:
normally distributed
X
Jn,
X
n,1 λ,...,λ
X
nP1,...,p
X
pn, σ,)(a =
2
p
piωX
pn,
X
nX
jn,
j
ea1
σ
λ
∑
−
−
=
All-pole model
No model
Model Form Parameters
Clean PSD

1-source 1-microphone case:
multi-step LP
∑≥
−
∗
+=
Δp
jp,njp,jn,jn, ygxy
1,2,...njn, )(y =
1,2,...njn, )(x =

1-source 1-microphone case:
multi-step LP
∑≥
−
∗
+=
Δp
jp,njp,jn,jn, ygxy
＋
1,2,...njn, )(y =
1,2,...njn, )(x =

)xygδ(y
)Λ;y,,y,x|(yp
jn,jn,p jp,jn,
Rj1,j1,-njn,jn,YX,|Y past
−−= ∑ ∗


When model parameters are known
jn,p jp,jn,jn, ygyx ∑ ∗
−= ˆˆ
)ygyδ(x jn,p jp,jn,jn, ∑ ∗
+−= ˆ
)Λ,Λ;y,y|(xp RXj1,jn,jn,YY,|X past
ˆˆ
Inverse filtering

ML for parameter estimation
∑∑=
j n
RXj1,j1,-njn,Y|YRX )Λ,Λ;y,y|(ylogp)Λ,L(Λ past


∑∑=
j n

∫
×
)xygδ(y
)Λ;y,,y,x|(yp
jn,jn,p jp,jn,
Rj1,j1,-njn,jn,YX,|Y past
−−= ∑ ∗

∏=
j
X
jn,jn,CN
nX,nX
)λ;0,(xf
)Λ;(p x

∑∑=
j n

∑∑
∑ −
∗
−
−−=
j n
X
jn,
2
p jp,njp,jn,X
jn,
λ
|ygy|
)log(λ

∑∑=
j n

∑∑
∑ −
∗
−
−−=
j n
X
jn,
2
p jp,njp,jn,X
jn,
λ
|ygy|
)log(λ
∑
∑ −
∗
−
=
n
X
jn,
2
p jp,njp,jn,
Λ
jR,
λ
|ygy|
argminΛ
jR,
ˆ
ˆ
If is knownX
jn,λˆ

Iterative optimization
Initializing ΛR
Inverse filtering
Updating ΛR
Convergent?
Updating ΛR
RΛˆ
RΛˆ
XΛˆ

Why LP model for reverberation?
Chain rule is applicable to derive the
likelihood function

Drawback
Non-minimum phase terms cannot be
accurately modeled
“ ”Solution:
using extra microphones

Extensions
• Integration with source separation
• Integration with additive noise reduction
• Adaptive inverse filtering
– Using an RLS-like algorithm
• Application to music signals
– Using a clean source model accounting for strong
harmonic structures
• Exploiting prior knowledge on room properties

n : frame index
ny : corrupted log mel-frequency feature
(consisting of 24 coefficients)
nx : clean log mel-frequency feature
Notations

∑=
k
X
k
X
knNkXnX ),;(fπ)Λ;(p Σμxx
Clean features: pre-trained GMM
)Λk;|(p XnK|X
xDenoted by

Reverberation model
Early
reflections
Late reverberation
Direct
sound

Reverberation model
Early
reflections
Late reverberation
H=nY +⋅ nX nR
Direct
sound

Reverberation model
Early
reflections
Late reverberation
＊
Clean speech RIR > 50ms
H=nY +⋅ nX nR
Direct
sound

Reverberation model
Early
reflections
Late reverberation
),,(
))--exp(log(1
nn
nnnn
hrxg
hxrhxy
=
+++=
)),,(δ()Λ;,|(p nnnRnnnRX,|Y
hrxgyrxy −=
Direct
sound

Reverberation model
)),,(δ()Λ;,|(p nnnRnnnRX,|Y
hrxgyrxy −=
);( RR
-nnNR11-nnY|R
,f)Λ;,,|(p past
Σβyryyr += ∆
∫×

Connected digit recognition
• 1024-component GMM for VTS
• Clean complex back-end defined in Aurora2
• Evaluation data set consisting of 4004
reverberant utterances
– Simulated data
– Impulse responses measured in a varechoic room
– Speaker-microphone distance = 3.5 m
– T60 = 0.2~0.6 sec

0
5
10
15
20
25
30
35
0.2 0.3 0.4 0.5 0.6
Unprocessed
Dereverberated
Dereverberated
(lower bound)
Worderrorratein%
T60 in seconds

Concluding remarks
• Dereverberation can be performed in
different domains
• Reverberation model must accounts for
the strong statistical dependencies
between consecutive observation frames

Dereverberation in the stft and log mel frequency feature domains

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (12)

Similaire à Dereverberation in the stft and log mel frequency feature domains

Similaire à Dereverberation in the stft and log mel frequency feature domains (20)

Dernier

Dernier (20)

Dereverberation in the stft and log mel frequency feature domains