DNN-based frequency component prediction for frequency-domain audio source separation. The paper proposes a new framework that combines frequency-domain audio source separation with DNN to achieve high quality separation with lower computational cost. The framework applies multichannel NMF to separate sources in the low frequency band. A DNN then predicts the separated source components in the high frequency band based on the low frequency separated sources and mixture. Experiments show the mixture components help the DNN expand the bandwidth of separated sources, and the proposed framework achieves similar separation quality to fullband NMF with half the computational cost.
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
DNN-based frequency component prediction for frequency-domain audio source separation
1. DNN-based frequency component
prediction for frequency-domain
audio source separation
Rui Watanabe, Daichi Kitamura (National Institute of Technology, Japan)
Hiroshi Saruwatari (The University of Tokyo, Japan)
Yu Takahashi, Kazunobu Kondo (Yamaha Corporation, Japan)
28th European Signal Processing Conference (EUSIPCO) SS-2.4
2. Background
Audio source separation
– aims to separate audio sources such as speech, singing
voice, and musical instruments.
Products with audio source separation
– Intelligent speaker
– Hearing-aid system
– Music editing by users etc.
1
3. Background
Multichannel audio source separation (MASS)
– estimates separation system using multichannel
observation without knowing mixing system
Popular methods for each condition
– Underdetermined (number of mics < number of sources)
• Multichannel nonnegative matrix factorization (MNMF) [Sawada+, 2013]
• Approaches based on deep neural networks (DNNs)
– Overdetermined (number of mics ≥ number of sources)
• Frequency-domain Independent component analysis [Smaragdis, 1998]
• Independent vector analysis [Kim+, 2007]
• Independent low-rank matrix (ILRMA) [Kitamura+, 2016]
2
4. Background
Frequency-domain MASS
– performs a short-time Fourier transform to the
observed time-domain signal to obtain the
spectrograms
– estimates frequency-wise separation filter
3
5. Conventional frequency-domain MASS
Multichannel nonnegative matrix factorization
(MNMF) [Sawada+, 2013]
– Unsupervised source separation algorithm without any
prior information or training
– High quality MASS can be achieved
– Huge computational cost for estimating the
parameters
4
6. Proposed method: motivation
High-quality MASS with low computational cost
A new framework combining frequency-domain MASS and
DNN
– Separate specific frequencies via MNMF and obtain separated
source components
– The estimated source components of the other frequencies will
be predicted by DNN
5
7. Proposed method: interpretation of DNN
DNN in proposed framework can be interpreted in two
ways
1. Audio source separation of specific frequencies (high-
frequency band)
• Low-frequency bands can be used for predicting high-frequency separated
components
2. Audio bandwidth expansion of each source
• High-frequency band of the mixture is a strong cue for expanding bandwidth
6
8. Proposed method: details of framework
Observed multichannel spectrograms and are
divided into low- and high-frequency bands
Apply MNMF to the low frequency band and to
obtain the separated source components and
– High-frequency band and are not separated in
this step
7
9. Proposed method: details of framework
Input , , and to DNN
– DNN outputs softmasks and such that the high-
frequency bands and are estimated from
8
Apply softmasks
10. Proposed method: input vector of DNN
DNN prediction is performed for each time frame
(each column of spectrograms)
– Input vector is a concatenation of several time frames
around th frame in , , and
– Normalize the concatenated vector
9
11. Proposed method: DNN architecture
Simple full-connected networks
– Four hidden layers with Swish or Softmax functions
10
12. Experiment 1: bandwidth expansion
Validation of the proposed framework
– Evaluate bandwidth expansion performance from the low-
frequency band of true sources with/without mixture
– Confirm validity of the proposed framework that utilizes
mixture components for predicting the separated sources
– Use sources-to-artifact ratio (SAR) [Vincent+, 2006]
11
13. Experiment 1: bandwidth expansion
Training conditions of DNN
Test dataset (SiSEC2011) [Araki+, 2012] for evaluation
12
Training dataset
100 drums (Dr.) and vocal (Vo.) songs in SiSEC2016
Database [Liutkus+, 2016]
FFT length/Shift length 128 ms/64 ms
Boundary frequency 4 kHz (Half of Nyquist frequency)
Epochs/batch size 1000/128
Optimizer Adam (learning rate=0.001)
Song ID Song name Signal length [s]
1 dev1__bearlin-roads (Dr. & Vo.) 14.0
2 dev2__another_dreamer-the_ones_we_love (Dr. & Vo.) 25.0
3 dev2__fort_minor-remember_the_name (Dr. & Vo.) 24.0
4 dev2_ultimate_nz_tour (Dr. & Vo.) 18.0
14. Experiment 1: bandwidth expansion
Mixture components help to predict the high-
frequency band of the separated sources
13
Song ID DNN w/o mixture DNN w/ mixture
1
Dr. : 21.1 dB Dr. : 28.0 dB
Vo. : 21.8 dB Vo. : 31.5 dB
2
Dr. : 22.0 dB Dr. : 21.8 dB
Vo. : 12.7 dB Vo. : 19.6 dB
3
Dr. : 15.0 dB Dr. : 20.4 dB
Vo. : 11.2 dB Vo. : 18.5 dB
4
Dr. : 11.0 dB Dr. : 18.2 dB
Vo. : 10.4 dB Vo. : 15.3 dB
15. Experiment 2: evaluate proposed MASS framework
Compare conventional fullband MNMF and the
proposed framework
– In terms of separation accuracy (source-to-distortion
ratio: SDR [Vincent+, 2006]) and computational efficiency
14
16. Experiment 2: evaluate proposed MASS framework
Experimental conditions of MNMF
15
Multichannel observed
signal
Produce two-channel mixture by convoluting
E2A impulse responses to the sources of the
test dataset
Boundary frequency 4 kHz
Number of bases in MNMF 13
18. Song ID 4
– Since the number of frequencies is reduced by half,
the proposed method is twice faster
– In Fullband MNMF, 13dB
was achieved in 120s
– Proposed method
achieved 13 dB in less
than 50s
17
Experiment 2: evaluate proposed MASS framework
20. Conclusion
In this paper
– We proposed a computationally efficient audio source
separation framework combined frequency-domain
MASS and frequency component prediction based on
DNN
– In the proposed framework, MASS is applied to only
the limited frequencies, and DNN predicts the other
frequency components of the sources
– By comparing fullband MNMF, the proposed method
can achieve almost the same quality with the half-
reduced computational cost
19
Thank you for your attention!
Notes de l'éditeur
Hi everyone, I’m Rui Watanabe / from National Institute of Technology(テクナーロジィ), / Kagawa College, / Japan.
I’m gonna talk about / DNN-based frequency component prediction / for frequency-domain audio source separation.
Audio source separation / is a technique(テクニーク)to separate audio sources / such as speech,↑ / singing voice,↑ / musical instruments↑, and so on↓.
This technology(テクナーロジィ)can be used for many products / including an intelligent speakers,↑/ hearing-aid systems,↑ / and music editing by users↓.
In particular, / multichannel(マーチチャネル)audio source separation, / MASS(エムエイエスエス)in short, / estimates a separation system W / using multichannel(マーチチャネル) observation / without knowing the mixing system A.(WとAは指し示しながら)
This technique(テクニーク)can be divided into two categories(キャテゴリーズ), / for underdetermined / and overdetermined situations(スィテュエイションズ).
The underdetermined situation(スィテュエイション) is that / the number of microphones / is less than the number of sources in the mixture.
For this case, multichannel(マーチチャネル)nonnegative matrix(メイトリクス)factorization, / MNMF in short, / is a popular(パピュラー)algorithm.
Also, / many DNN-based approaches / have been proposed so far in this case.
On the other hand, / in the overdetermined situation(スィテュエイション), / the number of microphones is equal to / or larger than the number of sources.
In this case, / frequency-domain independent component analysis / and independent low-rank matrix(メイトリクス)analysis / are the most reliable approaches.
In this presentation,↑/ we only treat “frequency-domain MASS”.
In this algorithm↑, / we perform / a short-time Fourier transform to the observed time-domain signal / and obtain the multichannel(マーチチャネル)spectrograms.(図の紫部分を指しながら)
Then, / we estimate a frequency-wise separation filter, / which is applied to each frequency like this(図の中央を指しながら)/ to estimate the separated source signals.
Let me introduce the conventional frequency-domain MASS called “multichannel(マーチチャネル)nonnegative matrix(メイトリクス)factorization,” / MNMF in short.
This is an unsupervised source separation algorithm / and does not require any prior(プライォア)information or training(テュレイニン).
As an unsupervised technique(テクニーク), MNMF tends to provide high quality separation performance.
In MNMF, / the observed multichannel signal / is represented by the time-frequency-wise channel correlation matrices / denoted by X.
Since X is a frequency-by-time matrix whose element is a channel-by-channel matrix↑, / this is a matrix of matrices, / which is a fourth-order tensor(テンサー). (frequency-by-timeのところは指し示しながら)
MNMF decomposes X / into the source-wise spatial model(マドー)/ and the low-rank spectral model(マドー)of all the sources.
Thus, / by clustering the spectral model(マドー)into each source using the estimated spatial model(マドー), the source separation is achieved.
However, it requires a huge computational cost / for estimating the parameters(パラミターズ)/ because there are so many parameters(パラミターズ)in this model(マドー).
In this presentation, / our motivation(モーティベイシュン) is that / we want to achieve a high-quality MASS / with a low computational cost.
And we propose / a new source separation framework / combining frequency-domain MASS / and deep neural networks.
In this framework, / as an initial process, / the mixture signal in specific frequencies are separated by MNMF, / and we obtain the separated source components in that frequencies.
In this figure, / since only the low-frequency band of the mixture is input to MNMF, we can get the separated components in the low-frequency band.
Of course, / the high-frequency bands of the separated sources / are missing.
(しっかり間を開ける)
As a post process, / we apply DNN-based frequency component prediction, / namely, / the missing high-frequency bands of the separated sources are predicted by DNN, / where we input not only the separated low-frequency bands / but also the mixture of the high-frequency band.(inputの矢印をそれぞれ指し示しながら)
Since the DNN prediction process is much faster than MNMF process, / we can reduce / the total computational cost in this framework.
For example, if we divide the frequency bands in half like figure,↑/ we can reduce the computational time / almost half.
In our framework, / the post DNN process can be interpreted in two ways.
First, / the DNN is an audio source separation of specific frequencies, / high-frequency band in this figure.
Please note that / the low-frequency bands can be used for predicting high-frequency separated components in our DNN model(マドー).
Second, / the DNN seems to be a bandwidth expansion of each source / because the high-frequency bands are predicted.
In general, a bandwidth expansion is a hard task / even for DNN.
However, / in our model,(マドー)/ the high-frequency band of the mixture / becomes a strong cue / to achieve the bandwidth expansion.
The details of the proposed method is as follows.
First, / the observed multichannel spectrograms / M1 and M2 / are divided into low- and high-frequency bands.
Then, / we apply MNMF to only the low-frequency band / M1(L) and M2(L) / to obtain the separated source components / Y1(L) and Y2(L).
The high-frequency band / M1(H) and M2(H) / are not separated in this step.
Next, / we input the high-frequency band of the mixture / and the low-frequency bands of the separated sources / like this figure.
The DNN / outputs two soft-masks / W1 and W2 / such that the high-frequency bands of the separated sources are calculated from M1(H) / by multiplying them.
Of course, / the masks are the matrices with the elements between zero(ジロー)and one, / and the sum of each element in W1 and W2 is always unity.
The DNN prediction is performed for each time frame j, / which is / each column of spectrograms.
To utilize the information along time in the prediction, / the input vector for DNN is a concatenation of several time frames around j in the mixture and the separated sources.
Also, / before we input the vector to the DNN, we normalize it to stabilize the model(マドー)training(テュレイニン), / where the normalization coefficient is added / to keep the information of the signal volume.
The DNN model(マドー)in the proposed method is very simple.
We have full-connected four hidden layers, / and we apply Swish function / to each hidden layer.
Just before the output, / we apply frequency-wise Softmax function, / to ensure(インシュァー)that / the sum of the masks equals unity in each frequency.
The mean squared error / between the separated source vector and the label(レイボーゥ)vector↑/ is used as a loss function of the DNN training(テュレイニン).
To confirm the validity of the proposed method, / we have done two experiments(イクスペリメンツ).
In the first experiment, / we evaluate the performance of the DNN model(マドー)/ as the bandwidth expansion.
That is, / the DNN restores the high-frequency band from the low-frequency band of the completely separated sources, / where we confirm whether the high-frequency band of the mixture is effective / by comparing these two models(マドーズ).(図を指しながら)
Therefore, / we can confirm the validity of the proposed framework / that utilizes mixture components / for predicting the separated sources.
As an evaluation score, / we use sources-to-artifact ratio(レイシオ), / SAR, / which shows the absence of artificial distortions / in the estimated audio signals.
This slide shows experimental conditions.
For the training of DNN, / we used 100 songs with drums and vocals / in the SiSEC2016 database.(トゥエンティシックスティーンと発音)
The boundary frequency between low- and high-frequency bands / was set to 4kHz, / which is a half of Nyquist frequency.
As the test dataset, / we used four songs included in the SiSEC2011 database(トゥエンティイレブンと発音), / where these songs are the mixture of drums and vocals.
This is the result of bandwidth expansion.
For each song, / we showed the SAR values of Drums and Vocals.
Higher SAR indicates better audio quality.
Two columns show the results of DNN without mixture / and DNN with mixture.
In almost all results, / the DNN with mixture outperforms the DNN without mixture.
From this result, / we can confirm that / the mixture components help to predict the high-frequency band of the separated sources.
Thus, / we can expect that / the proposed framework will perform effectively / in a source separation task.
Next, / we conducted the MASS experiment.
We compare the conventional MNMF and the proposed framework.
The conventional method separates fullband mixture by MNMF / whereas the proposed framework separates only the low-frequency band by MNMF, / and the high-frequency band is predicted by DNN post process.
We expect that / the computational time is reduced by skipping half number of frequencies in the MNMF process / while the separation performance is almost the same.
As a source separation score, / we used source-to-distortion ratio(レイシオ), / SDR, / which represents the total performance of source separation including both “degree of separation” and the “quality of separated signals.”
The other conditions are shown in this slide.(ここのtheは重要,otherとthe otherは違う)
DNN is trained using the same dataset in the previous experiment.
For the MASS test data, / we produced two-channel observed mixtures / by convoluting E2A impulse responses to the Drums and Vocals sources of the test dataset, / where the recording condition of E2A impulse response is depicted here.(図を指しながら)
The reverberation time of E2A is 300 ms.
The number of bases in MNMF was set to 13, / which provides the best result for both the conventional and proposed methods.
This is the result for each song.
The vertical axis indicates SDR score / averaged over 10 random initial values.(指しながら)
The horizontal axis shows the average elapsed time.(指しながら)
The black line is the conventional method, / fullband MNMF, / and the red circles are the results of the proposed framework.
Since the elapsed time depends on the number of iterations of parameter update in MNMF,↑/ for the proposed framework,↑/ we plot the results with every 10 iterations in the MNMF process.
Of course, / the computational time for the DNN prediction process / is included in each red circle, / although the DNN process requires less than 0.1 s(ジロポイントワンセカンズ).
In all the results, / we can confirm the efficacy of the proposed method.
In particular, / Song ID 4 shows the result just as we expected, / so let me explain the result of Song ID 4.
In the case of Song ID 4, / the proposed method achieves 13 dB / in less than 50 s, / whereas fullband MNMF converged to 13 dB in 120 s.
This is because / the number of frequencies in MNMF is reduced by half.
In addition, / the proposed method outperforms fullband MNMF in Song IDs 1, 2, and 4.
In particular, / the improvement in Song ID 1 is very large.
The reason of these improvements might be that / the proposed method performed more accurate estimation of high-frequency band sources / based on the training with 100 songs.
Also, / in the case of Song ID 1, / fullband MNMF might be trapped into a bad local minimum during the iterative optimization.