SlideShare une entreprise Scribd logo
1  sur  53
Conv-TasNet:
Surpassing Ideal Time-Frequency
Magnitude Masking for Speech Separation
Presenter : 何冠勳 61047017s
Date : 2022/04/14
Authors: Yi Luo, Nima Mesgarani
Published in arXiv:1809.07454v3 [cs.SD] 15 May 2019
Outline
2
1 3 5
6
4
2
Introduction Experiment Samples
Architecture Results Discussions &
Conclusions
Introduction
Why Conv-TasNet? What is Conv-TasNet?
1
3
4
Why Conv-TasNet?
Plot shows scale-invariant SDR improvement (SI-SDRi) on wsj0-mix2 (higher is better)
Source: https://paperswithcode.com/task/speech-separation
5
Why Conv-TasNet?
Milestone 1
Milestone 2
Why Conv-TasNet?
◉ Robust speech processing in real-world acoustic environments often requires automatic speech
separation.
◉ Most previous speech separation approaches have been formulated in the time-frequency
representation (spectrogram) of the mixture signal.
◉ Alternatively, a weighting function (mask) can be estimated for each source to multiply each T-F bin in
the mixture spectrogram to recover the individual sources.
6
Noisy
spectrogram
Clean
spectrogram 1
Nonlinear
regression
techniques
Seperated
spectrogram 1
Seperated
spectrogram 2
Clean
spectrogram 2
Loss
Why Conv-TasNet?
◉ In both the direct method and the mask estimation method, the waveform of each source is
calculated using the STFT and iSTFT.
◉ This method has several shortcomings.
○ STFT is a generic signal transformation, not necessarily optimal for speech separation.
○ Phase problem!! Accurate reconstruction of the phase of the clean sources is a
nontrivial problem.
○ Successful separation from the time-frequency representation requires a
high-resolution frequency decomposition.
◉ Because these issues arise from formulating the separation problem in the time-frequency
domain, a logical approach is to directly formulating the separation in the time domain.
7
What is Conv-TasNet?
◉ Convolutional Time-domain Audio Separation Network
◉ The main idea is to replace the STFT and iSTFT for feature extraction with a data-driven representation
that is jointly optimized with an end-to-end training paradigm.
◉ While original TasNet uses LSTM as separation module, it significantly limited its applicability.
○ Difficult determination on kernel size (training unmanageable)
○ Computation cost
○ Inconsistent accuracy, while changing the starting point of same mixture.
◉ Conv-TasNet uses TCN to replace the deep LSTM networks for the separation step.
◉ Simple as its structure is, Conv-TasNet only consists encoder, separation mask, and decoder.
8
What is Conv-TasNet?
9
Architecture
- Architecture
- Formulating
- Separation module
- Normalization
2
10
Architecture
11
Architecture
◉ Encoder module is used to transform short segments of the mixture waveform into their
corresponding representations in an intermediate feature space.
◉ This representation is then used to estimate a multiplicative function (separation mask) for
each source at each time step.
◉ The source waveforms are then reconstructed by transforming the masked encoder features
using a decoder module.
12
Formulating
13
encoder
representation
segment
mixture source
If C = 2, two speakers are involved;
if C = 3, three speakers are involved.
Uses depthwise separable convolution
in every TCN layer, who produces
masks.
mask
masked
If C = 2, two masks are estimated;
if C = 3, three masks are estimated, along
with constraint .
seperated
seperated
source
( )
decoder
Architecture
14
Separation module
◉ Motivated by TCN, we propose a fully-convolutional separation module that consists of stacked
1-D dilated convolutional blocks.
◉ Each layer in a TCN consists of 1-D convolutional blocks with increasing dilation factors. The
dilation factors increase exponentially to ensure a sufficiently large temporal context window to
take advantage of the long-range dependencies of the speech signal.
◉ The input to each block is zero padded accordingly to ensure the output length is the same as the
input.
◉ The 1-D convolutional block consists of:
○ Residual path of a block serves as the input to the next block
○ Skip-connection paths for all blocks are summed up and used as the output of the TCN.
15
Dilated?
16
Normal convolution Dilated convolution
◉ Enable networks to consider context not just only in
the neighborhood.
◉ Preserves the input resolution throughout the
network.
◉ Increase the dilation factor exponentially results in
exponential receptive field growth with depth.
“The receptive field, or sensory space, is a delimited medium where some physiological
stimuli can evoke a sensory neuronal response in specific organisms.”
Residual & Skip connections
◉ Both residual and parameterised skip connections are
used throughout the network, to speed up convergence
and enable training of much deeper models.
◉ To further decrease the number of parameters, depthwise
separable convolution is used to replace standard
convolution in each convolutional block.
17
Residual?
◉ In traditional neural networks, each layer feeds into the next layer. In a network with residual
blocks, each layer feeds into the next layer and directly into the layers about 2–3 hops away.
18
Optimizing the residual F(x) := H(x) − x is more tractable for
deeper networks.
Normalization
The choice of the normalization method in the network depends on the causality requirement.
19
◉ For noncausal configuration, we found
empirically that global layer normalization
(gLN) outperforms all other normalization
methods.
◉ In causal configuration, gLN cannot be applied
since it relies on the future values of the signal at
any time step. Instead, we designed a cumulative
layer normalization (cLN) operation.
Experiments
- Dataset
- Training configurations
- Evaluation metrics
- Comparison
3
20
Dataset
◉ We evaluated our system on two-speaker and three-speaker speech separation problems using
the WSJ0-2mix and WSJ0- 3mix datasets. (30 hours of training and 10 hours of validation data)
◉ The speech mixtures are generated by randomly selecting utterances from different speakers in
the WSJ0 and mixing them at random signal- to-noise ratios (SNR) between -5 dB and 5 dB.
◉ 5 hours of evaluation set is generated in the same way using utterances from 16 unseen speakers.
◉ All the waveforms are resampled at 8 kHz.
21
Training configurations
◉ The networks are trained for 100 epochs on
4-second long segments.
◉ The initial lr is set to 1e−3. The learning rate is
halved if the accuracy of validation set is not
improved in 3 consecutive epochs.
◉ Adam is used as the optimizer. A 50% stride size
is used in the convolutional autoencoder.
◉ Gradient clipping with maximum L2-norm of 5 is
applied during training.
22
Training objectives
◉ The objective of training the end-to-end system is maximizing the improvement of the
scale-invariant source-to-noise ratio (SI-SNR), replacing the standard source-to-distortion
ratio (SDR).
◉ Utterance-level permutation invariant training (uPIT) is applied during training to address the
source permutation problem.
23
Evaluation metrics
◉ Report SI-SNRi & SDRi as objective measures of separation accuracy.
◉ Evaluate the quality of the separated mixtures using
○ perceptual evaluation of subjective quality (PESQ)
○ mean opinion score (MOS) by asking 40 normal hearing subjects to rate the quality of the
separated mixtures.
24
Comparison
◉ Following the common configurations mentioned above, the ideal time-frequency masks were
calculated using STFT with a 32 ms window size and 8 ms hop size with a Hanning window.
◉ The ideal masks include the ideal binary mask (IBM), ideal ratio mask (IRM), and Wiener
filter-like mask (WFM).
25
Results
- Encoder & decoder
- Best parameters
- Comparison
- Dissection of autoencoder
4
26
27
Visualization of the encoder and decoder basis functions,
encoder representation, and source masks for a sample
2-speaker mixture.
The estimated masks for the two speakers highly resemble
their encoder representations.
Encoder & decoder
◉ Review:
, the term (·) is actually an optional nonlinear function.
◉ Compare five different encoder-decoder configurations:
a. Linear encoder with its pseudo-inverse (Pinv) as decoder, with Softmax function for mask
estimation. Formulated as follows:
&
b. Linear encoder and decoder, with Softmax or Sigmoid function for mask estimation.
Formulated as follows:
&
c. Encoder with ReLU activation and linear decoder, with Softmax or Sigmoid function for
mask estimation. Formulated as follows:
& 28
encoder
representation
segment
Encoder & decoder
◉ We can see that pseudo-inverse
autoencoder leads to the worst
performance, indicating that an explicit
autoencoder configuration does not
necessarily improve the separation
score.
◉ We use linear encoder and decoder with
Sigmoid function in all the following
experiments.
29
Best parameters
30
Best 5
Best parameters
31
Increasing the number of
basis signals in the
encoder/decoder increases
the overcompleteness of the
basis signals and improves
the performance.
Best parameters
32
Shorter segment length
consistently improves
performance.
Note that the best system
uses a filter length of only 2
ms (L = 16/8k = 0.002s),
which makes it very difficult
to train a deep LSTM network
with the same L due to the
large number of time steps in
the encoder output.
Best parameters
33
The bottleneck H/B was found
to be best around 5.
Regardless of receptive field,
deeper networks lead to better
performance, possibly due to
the increased model capacity.
Different models
34
◉ While the BLSTM-TasNet already outperformed IRM
and IBM, the non-causal Conv-TasNet significantly
surpasses the performance of all three ideal T-F masks
in SI-SNRi and SDRi metrics
◉ Moreover, with a significantly smaller model size
comparing with all previous methods.
Different models
35
◉ The non-causal Conv-TasNet system
significantly outperforms all previous
STFT-based systems in SDRi.
◉ The causal Conv- TasNet significantly
outperforms even the other two non-causal
STFT-based systems.
Quality evaluation - PESQ
◉ In addition to SDRi and SI-SNRi, we evaluated the subjective (by MOS) and objective (by PESQ)
quality of the separated speech and compared with three ideal time-frequency magnitude
masks.
◉ However, since PESQ aims to predict the subjective quality of speech, human quality evaluation
can be considered as the ground truth.
36
Quality evaluation - MOS
◉ Therefore, we conducted a psychophysics experiment in which we asked 40 normal hearing
subjects to listen and rate the quality of the separated speech sounds.
◉ We randomly chose 25 two-speaker mixture sounds from WSJ0-2mix. We avoided a possible
selection bias by ensuring that the average PESQ scores for the IRM and Conv-TasNet separated
sounds for the selected 25 samples were equal to the average PESQ scores over the entire test
set.
◉ The subjects were asked to rate the quality of the clean utterances, the IRM-separated
utterances, and the Conv-TasNet separated utterances on the scale of 1 to 5.
1: bad, 2: poor, 3: fair, 4: good, 5: excellent
37
Quality evaluation - MOS
◉ MOS for Conv-TasNet is higher than the MOS for the IRM.
38
Quality evaluation - MOS
◉ The subjective ratings of almost all utterances for
Conv-TasNet are higher than their corresponding PESQ
scores.
◉ Observation shows that PESQ consistently underestimates
MOS for Conv-TasNet separated utterances, which may be
due to the dependence of PESQ on the magnitude
spectrogram of speech which could produce lower scores
for time-domain approaches.
39
Speed
◉ Here compares the processing speed of LSTM-TasNet and
causal Conv-TasNet. The speed is evaluated as the
average processing time for the systems to separate each
frame, time per frame (TPF).
40
◉ The processing in LSTM-TasNet is done sequentially, meaning each time frame must wait for the
completion of the previous time frame.
◉ Since Conv-TasNet decouples the processing of consecutive frames, the processing of subsequent
frames does not have to wait, allowing the possibility of parallel computing.
◉ TPF << frame length (2 ms) in CPU configuration -> Conv-TasNet can perform real-time separation.
Stability
◉ Unlike language processing tasks where sentences have determined starting words, it is difficult
to define a general starting sample or frame for speech separation and enhancement tasks.
◉ A robust audio processing system should therefore be insensitive to the starting point of the
mixture.
◉ We systematically examined the robustness of LSTM-TasNet and causal Conv-TasNet to the
starting point of the mixture by evaluating the separation accuracy for each mixture in the
WSJ0-2mix test set with different sample shifts of the input.
41
Stability
◉ Figure(A) shows how an example mixture
perform in different model.
◉ Figure(B) shows the standard deviation of SDRi
across all test set.
◉ The causal Conv-TasNet performs consistently
well for all shift values of the input mixture.
42
◉ One explanation for this inconsistency may be due to the sequential processing constraint in
LSTM-TasNet which means that failures in previous frames can accumulate and affect the separation
performance in all following frames.
Dissection of autoencoder
◉ One of the motivations for replacing the STFT with
the convolutional encoder was to construct a
representation of the audio that is optimized for
speech separation.
◉ The majority of the filters are tuned to lower
frequencies. In addition, it shows that filters with
the same frequency tuning express various phase
values for that frequency.
◉ This result suggests an important role for
low-frequency features of speech such as pitch as
well as explicit encoding of the phase information
to achieve superior performance.
43
Samples
Enough numbers, let’s listen.
5
44
Stability
◉ Official samples: tasnet.html
◉ Inference samples (using pre-trained model not from official, but here):
45
mixture spk1 spk2 clean1 clean2
Discussions &
Conclusions
6
46
Discussions
◉ Most of the filters are tuned to low acoustic frequencies. This pattern of frequency representation,
which we found using a data-driven method, roughly resembles the well-known mel-frequency scale.
◉ In addition, the overexpression of lower frequencies may indicate the importance of accurate pitch
tracking in speech separation, similar to what has been reported in human multitalker perception
studies.
◉ The filters with the same frequency tuning explicitly express various phase information. In contrast, this
information is implicit in the STFT operations
◉ However, limitations are:
○ The long- term tracking of an individual speaker may fail.
○ The generalization to noisy and reverberant conditions must be further tested.
47
Conclusion
◉ The improvements are accomplished by replacing the STFT with a convolutional
encoder-decoder architecture.
◉ Separation in Conv-TasNet is done using TCN architecture together with a depthwise
separable convolution operation to address the challenges of LSTM networks.
◉ Conv-TasNet significantly outperforms STFT speech separation systems, never mention
possessing much smaller model size and a shorter minimum latency.
◉ In summary, Conv-TasNet represents a significant step toward the realization of speech
separation algorithms and opens many future research directions that would further
improve it, which could eventually make automatic speech separation a common and
necessary feature of every speech processing technology designed for real-world
applications. 48
Architecture of WaveNet
49
There are four mechanisms in WaveNet:
Causal dilated convolutions, Gated activation function, Residual and Skip connection.
Architecture of Conv-TasNet
50
Gated activation units
51
Gated Activation Unit
◉ Gated activation units, same as used in gated PixelCNN
∗: convolution; ⊙: element-wise multiplication
σ: sigmoid function
k: layer index; f: filter; g: gate
W : convolution filter
◉ Empirically found to outperform ReLUs.
1-D Convolutional block
◉ Both residual and parameterised skip connections are
used throughout the network, to speed up convergence
and enable training of much deeper models.
◉ To further decrease the number of parameters, depthwise
separable convolution is used to replace standard
convolution in each convolutional block.
52
Any questions ?
You can find me at
◉ jasonho610@gmail.com
◉ NTNU-SMIL
Thanks!
53

Contenu connexe

Tendances

GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析Shinnosuke Takamichi
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Edureka!
 
私は如何にして心配するのを止めてPyTorchを愛するようになったか
私は如何にして心配するのを止めてPyTorchを愛するようになったか私は如何にして心配するのを止めてPyTorchを愛するようになったか
私は如何にして心配するのを止めてPyTorchを愛するようになったかYuta Kashino
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Hojin Yang
 
基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)
基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)
基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)Daichi Kitamura
 
DeepLearning 14章 自己符号化器
DeepLearning 14章 自己符号化器DeepLearning 14章 自己符号化器
DeepLearning 14章 自己符号化器hirono kawashima
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデルMasahiro Suzuki
 
Coreset+SVM (論文紹介)
Coreset+SVM (論文紹介)Coreset+SVM (論文紹介)
Coreset+SVM (論文紹介)Naotaka Yamada
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversionYuki Saito
 
クラシックな機械学習入門:付録:よく使う線形代数の公式
クラシックな機械学習入門:付録:よく使う線形代数の公式クラシックな機械学習入門:付録:よく使う線形代数の公式
クラシックな機械学習入門:付録:よく使う線形代数の公式Hiroshi Nakagawa
 
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learningベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learningssuserca2822
 
[DL輪読会]DNN-based Source Enhancement to Increase Objective Sound Quality Asses...
[DL輪読会]DNN-based Source Enhancement to Increase Objective Sound Quality Asses...[DL輪読会]DNN-based Source Enhancement to Increase Objective Sound Quality Asses...
[DL輪読会]DNN-based Source Enhancement to Increase Objective Sound Quality Asses...Deep Learning JP
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」Keisuke Sugawara
 
8.4 グラフィカルモデルによる推論
8.4 グラフィカルモデルによる推論8.4 グラフィカルモデルによる推論
8.4 グラフィカルモデルによる推論sleepy_yoshi
 
猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoderSho Tatsuno
 
音声感情認識の分野動向と実用化に向けたNTTの取り組み
音声感情認識の分野動向と実用化に向けたNTTの取り組み音声感情認識の分野動向と実用化に向けたNTTの取り組み
音声感情認識の分野動向と実用化に向けたNTTの取り組みAtsushi_Ando
 

Tendances (20)

GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
やさしく音声分析法を学ぶ: ケプストラム分析とLPC分析
 
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
Autoencoders Tutorial | Autoencoders In Deep Learning | Tensorflow Training |...
 
私は如何にして心配するのを止めてPyTorchを愛するようになったか
私は如何にして心配するのを止めてPyTorchを愛するようになったか私は如何にして心配するのを止めてPyTorchを愛するようになったか
私は如何にして心配するのを止めてPyTorchを愛するようになったか
 
Variational Autoencoder Tutorial
Variational Autoencoder Tutorial Variational Autoencoder Tutorial
Variational Autoencoder Tutorial
 
基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)
基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)
基底変形型教師ありNMFによる実楽器信号分離 (in Japanese)
 
DeepLearning 14章 自己符号化器
DeepLearning 14章 自己符号化器DeepLearning 14章 自己符号化器
DeepLearning 14章 自己符号化器
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
 
Coreset+SVM (論文紹介)
Coreset+SVM (論文紹介)Coreset+SVM (論文紹介)
Coreset+SVM (論文紹介)
 
第7回WBAシンポジウム:基調講演
第7回WBAシンポジウム:基調講演第7回WBAシンポジウム:基調講演
第7回WBAシンポジウム:基調講演
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
クラシックな機械学習入門:付録:よく使う線形代数の公式
クラシックな機械学習入門:付録:よく使う線形代数の公式クラシックな機械学習入門:付録:よく使う線形代数の公式
クラシックな機械学習入門:付録:よく使う線形代数の公式
 
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learningベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
ベイズ深層学習5章 ニューラルネットワークのベイズ推論 Bayesian deep learning
 
[DL輪読会]DNN-based Source Enhancement to Increase Objective Sound Quality Asses...
[DL輪読会]DNN-based Source Enhancement to Increase Objective Sound Quality Asses...[DL輪読会]DNN-based Source Enhancement to Increase Objective Sound Quality Asses...
[DL輪読会]DNN-based Source Enhancement to Increase Objective Sound Quality Asses...
 
Ea2015 7for ss
Ea2015 7for ssEa2015 7for ss
Ea2015 7for ss
 
Canopy k-means using Hadoop
Canopy k-means using HadoopCanopy k-means using Hadoop
Canopy k-means using Hadoop
 
PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」PRML第9章「混合モデルとEM」
PRML第9章「混合モデルとEM」
 
8.4 グラフィカルモデルによる推論
8.4 グラフィカルモデルによる推論8.4 グラフィカルモデルによる推論
8.4 グラフィカルモデルによる推論
 
猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder猫でも分かるVariational AutoEncoder
猫でも分かるVariational AutoEncoder
 
音声感情認識の分野動向と実用化に向けたNTTの取り組み
音声感情認識の分野動向と実用化に向けたNTTの取り組み音声感情認識の分野動向と実用化に向けたNTTの取り組み
音声感情認識の分野動向と実用化に向けたNTTの取り組み
 

Similaire à Conv-TasNet.pdf

Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdfssuser849b73
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with PerformersJoonhyung Lee
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkeSAT Journals
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchSatoru Katsumata
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)Susang Kim
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)IRJET Journal
 
Improving The Performance of Viterbi Decoder using Window System
Improving The Performance of Viterbi Decoder using Window System Improving The Performance of Viterbi Decoder using Window System
Improving The Performance of Viterbi Decoder using Window System IJECEIAES
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitShubham Verma
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Editor IJCATR
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewEditor IJCATR
 
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
Dynamic Texture Coding using Modified Haar Wavelet with CUDADynamic Texture Coding using Modified Haar Wavelet with CUDA
Dynamic Texture Coding using Modified Haar Wavelet with CUDAIJERA Editor
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...csandit
 
Optimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizerOptimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizera3labdsp
 
Analysis of Microstrip Finger on Bandwidth of Interdigital Band Pass Filter u...
Analysis of Microstrip Finger on Bandwidth of Interdigital Band Pass Filter u...Analysis of Microstrip Finger on Bandwidth of Interdigital Band Pass Filter u...
Analysis of Microstrip Finger on Bandwidth of Interdigital Band Pass Filter u...IJREST
 
Biomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxBiomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxSandeep Kumar
 

Similaire à Conv-TasNet.pdf (20)

Sepformer&DPTNet.pdf
Sepformer&DPTNet.pdfSepformer&DPTNet.pdf
Sepformer&DPTNet.pdf
 
EEND-SS.pdf
EEND-SS.pdfEEND-SS.pdf
EEND-SS.pdf
 
Sudormrf.pdf
Sudormrf.pdfSudormrf.pdf
Sudormrf.pdf
 
Et25897899
Et25897899Et25897899
Et25897899
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Isolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural networkIsolated words recognition using mfcc, lpc and neural network
Isolated words recognition using mfcc, lpc and neural network
 
Lexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam searchLexically constrained decoding for sequence generation using grid beam search
Lexically constrained decoding for sequence generation using grid beam search
 
[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)[Paper] Multiscale Vision Transformers(MVit)
[Paper] Multiscale Vision Transformers(MVit)
 
FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)FORECASTING MUSIC GENRE (RNN - LSTM)
FORECASTING MUSIC GENRE (RNN - LSTM)
 
Improving The Performance of Viterbi Decoder using Window System
Improving The Performance of Viterbi Decoder using Window System Improving The Performance of Viterbi Decoder using Window System
Improving The Performance of Viterbi Decoder using Window System
 
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR ToolkitImplemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
Implemetation of parallelism in HMM DNN based state of the art kaldi ASR Toolkit
 
Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...Methodology of Implementing the Pulse code techniques for Distributed Optical...
Methodology of Implementing the Pulse code techniques for Distributed Optical...
 
Digital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief ReviewDigital Watermarking Applications and Techniques: A Brief Review
Digital Watermarking Applications and Techniques: A Brief Review
 
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
Dynamic Texture Coding using Modified Haar Wavelet with CUDADynamic Texture Coding using Modified Haar Wavelet with CUDA
Dynamic Texture Coding using Modified Haar Wavelet with CUDA
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
B.tech_project_ppt.pptx
B.tech_project_ppt.pptxB.tech_project_ppt.pptx
B.tech_project_ppt.pptx
 
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
Objective Evaluation of a Deep Neural Network Approach for Single-Channel Spe...
 
Optimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizerOptimized implementation of an innovative digital audio equalizer
Optimized implementation of an innovative digital audio equalizer
 
Analysis of Microstrip Finger on Bandwidth of Interdigital Band Pass Filter u...
Analysis of Microstrip Finger on Bandwidth of Interdigital Band Pass Filter u...Analysis of Microstrip Finger on Bandwidth of Interdigital Band Pass Filter u...
Analysis of Microstrip Finger on Bandwidth of Interdigital Band Pass Filter u...
 
Biomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptxBiomedical Signals Classification With Transformer Based Model.pptx
Biomedical Signals Classification With Transformer Based Model.pptx
 

Dernier

Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...soginsider
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueBhangaleSonal
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoordharasingh5698
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptDineshKumar4165
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfRagavanV2
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdfKamal Acharya
 

Dernier (20)

Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Netaji Nagar, Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
Hazard Identification (HAZID) vs. Hazard and Operability (HAZOP): A Comparati...
 
Double Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torqueDouble Revolving field theory-how the rotor develops torque
Double Revolving field theory-how the rotor develops torque
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoorTop Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
Top Rated Call Girls In chittoor 📱 {7001035870} VIP Escorts chittoor
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Unit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdfUnit 2- Effective stress & Permeability.pdf
Unit 2- Effective stress & Permeability.pdf
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
University management System project report..pdf
University management System project report..pdfUniversity management System project report..pdf
University management System project report..pdf
 

Conv-TasNet.pdf

  • 1. Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation Presenter : 何冠勳 61047017s Date : 2022/04/14 Authors: Yi Luo, Nima Mesgarani Published in arXiv:1809.07454v3 [cs.SD] 15 May 2019
  • 2. Outline 2 1 3 5 6 4 2 Introduction Experiment Samples Architecture Results Discussions & Conclusions
  • 4. 4 Why Conv-TasNet? Plot shows scale-invariant SDR improvement (SI-SDRi) on wsj0-mix2 (higher is better) Source: https://paperswithcode.com/task/speech-separation
  • 6. Why Conv-TasNet? ◉ Robust speech processing in real-world acoustic environments often requires automatic speech separation. ◉ Most previous speech separation approaches have been formulated in the time-frequency representation (spectrogram) of the mixture signal. ◉ Alternatively, a weighting function (mask) can be estimated for each source to multiply each T-F bin in the mixture spectrogram to recover the individual sources. 6 Noisy spectrogram Clean spectrogram 1 Nonlinear regression techniques Seperated spectrogram 1 Seperated spectrogram 2 Clean spectrogram 2 Loss
  • 7. Why Conv-TasNet? ◉ In both the direct method and the mask estimation method, the waveform of each source is calculated using the STFT and iSTFT. ◉ This method has several shortcomings. ○ STFT is a generic signal transformation, not necessarily optimal for speech separation. ○ Phase problem!! Accurate reconstruction of the phase of the clean sources is a nontrivial problem. ○ Successful separation from the time-frequency representation requires a high-resolution frequency decomposition. ◉ Because these issues arise from formulating the separation problem in the time-frequency domain, a logical approach is to directly formulating the separation in the time domain. 7
  • 8. What is Conv-TasNet? ◉ Convolutional Time-domain Audio Separation Network ◉ The main idea is to replace the STFT and iSTFT for feature extraction with a data-driven representation that is jointly optimized with an end-to-end training paradigm. ◉ While original TasNet uses LSTM as separation module, it significantly limited its applicability. ○ Difficult determination on kernel size (training unmanageable) ○ Computation cost ○ Inconsistent accuracy, while changing the starting point of same mixture. ◉ Conv-TasNet uses TCN to replace the deep LSTM networks for the separation step. ◉ Simple as its structure is, Conv-TasNet only consists encoder, separation mask, and decoder. 8
  • 10. Architecture - Architecture - Formulating - Separation module - Normalization 2 10
  • 12. Architecture ◉ Encoder module is used to transform short segments of the mixture waveform into their corresponding representations in an intermediate feature space. ◉ This representation is then used to estimate a multiplicative function (separation mask) for each source at each time step. ◉ The source waveforms are then reconstructed by transforming the masked encoder features using a decoder module. 12
  • 13. Formulating 13 encoder representation segment mixture source If C = 2, two speakers are involved; if C = 3, three speakers are involved. Uses depthwise separable convolution in every TCN layer, who produces masks. mask masked If C = 2, two masks are estimated; if C = 3, three masks are estimated, along with constraint . seperated seperated source ( ) decoder
  • 15. Separation module ◉ Motivated by TCN, we propose a fully-convolutional separation module that consists of stacked 1-D dilated convolutional blocks. ◉ Each layer in a TCN consists of 1-D convolutional blocks with increasing dilation factors. The dilation factors increase exponentially to ensure a sufficiently large temporal context window to take advantage of the long-range dependencies of the speech signal. ◉ The input to each block is zero padded accordingly to ensure the output length is the same as the input. ◉ The 1-D convolutional block consists of: ○ Residual path of a block serves as the input to the next block ○ Skip-connection paths for all blocks are summed up and used as the output of the TCN. 15
  • 16. Dilated? 16 Normal convolution Dilated convolution ◉ Enable networks to consider context not just only in the neighborhood. ◉ Preserves the input resolution throughout the network. ◉ Increase the dilation factor exponentially results in exponential receptive field growth with depth. “The receptive field, or sensory space, is a delimited medium where some physiological stimuli can evoke a sensory neuronal response in specific organisms.”
  • 17. Residual & Skip connections ◉ Both residual and parameterised skip connections are used throughout the network, to speed up convergence and enable training of much deeper models. ◉ To further decrease the number of parameters, depthwise separable convolution is used to replace standard convolution in each convolutional block. 17
  • 18. Residual? ◉ In traditional neural networks, each layer feeds into the next layer. In a network with residual blocks, each layer feeds into the next layer and directly into the layers about 2–3 hops away. 18 Optimizing the residual F(x) := H(x) − x is more tractable for deeper networks.
  • 19. Normalization The choice of the normalization method in the network depends on the causality requirement. 19 ◉ For noncausal configuration, we found empirically that global layer normalization (gLN) outperforms all other normalization methods. ◉ In causal configuration, gLN cannot be applied since it relies on the future values of the signal at any time step. Instead, we designed a cumulative layer normalization (cLN) operation.
  • 20. Experiments - Dataset - Training configurations - Evaluation metrics - Comparison 3 20
  • 21. Dataset ◉ We evaluated our system on two-speaker and three-speaker speech separation problems using the WSJ0-2mix and WSJ0- 3mix datasets. (30 hours of training and 10 hours of validation data) ◉ The speech mixtures are generated by randomly selecting utterances from different speakers in the WSJ0 and mixing them at random signal- to-noise ratios (SNR) between -5 dB and 5 dB. ◉ 5 hours of evaluation set is generated in the same way using utterances from 16 unseen speakers. ◉ All the waveforms are resampled at 8 kHz. 21
  • 22. Training configurations ◉ The networks are trained for 100 epochs on 4-second long segments. ◉ The initial lr is set to 1e−3. The learning rate is halved if the accuracy of validation set is not improved in 3 consecutive epochs. ◉ Adam is used as the optimizer. A 50% stride size is used in the convolutional autoencoder. ◉ Gradient clipping with maximum L2-norm of 5 is applied during training. 22
  • 23. Training objectives ◉ The objective of training the end-to-end system is maximizing the improvement of the scale-invariant source-to-noise ratio (SI-SNR), replacing the standard source-to-distortion ratio (SDR). ◉ Utterance-level permutation invariant training (uPIT) is applied during training to address the source permutation problem. 23
  • 24. Evaluation metrics ◉ Report SI-SNRi & SDRi as objective measures of separation accuracy. ◉ Evaluate the quality of the separated mixtures using ○ perceptual evaluation of subjective quality (PESQ) ○ mean opinion score (MOS) by asking 40 normal hearing subjects to rate the quality of the separated mixtures. 24
  • 25. Comparison ◉ Following the common configurations mentioned above, the ideal time-frequency masks were calculated using STFT with a 32 ms window size and 8 ms hop size with a Hanning window. ◉ The ideal masks include the ideal binary mask (IBM), ideal ratio mask (IRM), and Wiener filter-like mask (WFM). 25
  • 26. Results - Encoder & decoder - Best parameters - Comparison - Dissection of autoencoder 4 26
  • 27. 27 Visualization of the encoder and decoder basis functions, encoder representation, and source masks for a sample 2-speaker mixture. The estimated masks for the two speakers highly resemble their encoder representations.
  • 28. Encoder & decoder ◉ Review: , the term (·) is actually an optional nonlinear function. ◉ Compare five different encoder-decoder configurations: a. Linear encoder with its pseudo-inverse (Pinv) as decoder, with Softmax function for mask estimation. Formulated as follows: & b. Linear encoder and decoder, with Softmax or Sigmoid function for mask estimation. Formulated as follows: & c. Encoder with ReLU activation and linear decoder, with Softmax or Sigmoid function for mask estimation. Formulated as follows: & 28 encoder representation segment
  • 29. Encoder & decoder ◉ We can see that pseudo-inverse autoencoder leads to the worst performance, indicating that an explicit autoencoder configuration does not necessarily improve the separation score. ◉ We use linear encoder and decoder with Sigmoid function in all the following experiments. 29
  • 31. Best parameters 31 Increasing the number of basis signals in the encoder/decoder increases the overcompleteness of the basis signals and improves the performance.
  • 32. Best parameters 32 Shorter segment length consistently improves performance. Note that the best system uses a filter length of only 2 ms (L = 16/8k = 0.002s), which makes it very difficult to train a deep LSTM network with the same L due to the large number of time steps in the encoder output.
  • 33. Best parameters 33 The bottleneck H/B was found to be best around 5. Regardless of receptive field, deeper networks lead to better performance, possibly due to the increased model capacity.
  • 34. Different models 34 ◉ While the BLSTM-TasNet already outperformed IRM and IBM, the non-causal Conv-TasNet significantly surpasses the performance of all three ideal T-F masks in SI-SNRi and SDRi metrics ◉ Moreover, with a significantly smaller model size comparing with all previous methods.
  • 35. Different models 35 ◉ The non-causal Conv-TasNet system significantly outperforms all previous STFT-based systems in SDRi. ◉ The causal Conv- TasNet significantly outperforms even the other two non-causal STFT-based systems.
  • 36. Quality evaluation - PESQ ◉ In addition to SDRi and SI-SNRi, we evaluated the subjective (by MOS) and objective (by PESQ) quality of the separated speech and compared with three ideal time-frequency magnitude masks. ◉ However, since PESQ aims to predict the subjective quality of speech, human quality evaluation can be considered as the ground truth. 36
  • 37. Quality evaluation - MOS ◉ Therefore, we conducted a psychophysics experiment in which we asked 40 normal hearing subjects to listen and rate the quality of the separated speech sounds. ◉ We randomly chose 25 two-speaker mixture sounds from WSJ0-2mix. We avoided a possible selection bias by ensuring that the average PESQ scores for the IRM and Conv-TasNet separated sounds for the selected 25 samples were equal to the average PESQ scores over the entire test set. ◉ The subjects were asked to rate the quality of the clean utterances, the IRM-separated utterances, and the Conv-TasNet separated utterances on the scale of 1 to 5. 1: bad, 2: poor, 3: fair, 4: good, 5: excellent 37
  • 38. Quality evaluation - MOS ◉ MOS for Conv-TasNet is higher than the MOS for the IRM. 38
  • 39. Quality evaluation - MOS ◉ The subjective ratings of almost all utterances for Conv-TasNet are higher than their corresponding PESQ scores. ◉ Observation shows that PESQ consistently underestimates MOS for Conv-TasNet separated utterances, which may be due to the dependence of PESQ on the magnitude spectrogram of speech which could produce lower scores for time-domain approaches. 39
  • 40. Speed ◉ Here compares the processing speed of LSTM-TasNet and causal Conv-TasNet. The speed is evaluated as the average processing time for the systems to separate each frame, time per frame (TPF). 40 ◉ The processing in LSTM-TasNet is done sequentially, meaning each time frame must wait for the completion of the previous time frame. ◉ Since Conv-TasNet decouples the processing of consecutive frames, the processing of subsequent frames does not have to wait, allowing the possibility of parallel computing. ◉ TPF << frame length (2 ms) in CPU configuration -> Conv-TasNet can perform real-time separation.
  • 41. Stability ◉ Unlike language processing tasks where sentences have determined starting words, it is difficult to define a general starting sample or frame for speech separation and enhancement tasks. ◉ A robust audio processing system should therefore be insensitive to the starting point of the mixture. ◉ We systematically examined the robustness of LSTM-TasNet and causal Conv-TasNet to the starting point of the mixture by evaluating the separation accuracy for each mixture in the WSJ0-2mix test set with different sample shifts of the input. 41
  • 42. Stability ◉ Figure(A) shows how an example mixture perform in different model. ◉ Figure(B) shows the standard deviation of SDRi across all test set. ◉ The causal Conv-TasNet performs consistently well for all shift values of the input mixture. 42 ◉ One explanation for this inconsistency may be due to the sequential processing constraint in LSTM-TasNet which means that failures in previous frames can accumulate and affect the separation performance in all following frames.
  • 43. Dissection of autoencoder ◉ One of the motivations for replacing the STFT with the convolutional encoder was to construct a representation of the audio that is optimized for speech separation. ◉ The majority of the filters are tuned to lower frequencies. In addition, it shows that filters with the same frequency tuning express various phase values for that frequency. ◉ This result suggests an important role for low-frequency features of speech such as pitch as well as explicit encoding of the phase information to achieve superior performance. 43
  • 45. Stability ◉ Official samples: tasnet.html ◉ Inference samples (using pre-trained model not from official, but here): 45 mixture spk1 spk2 clean1 clean2
  • 47. Discussions ◉ Most of the filters are tuned to low acoustic frequencies. This pattern of frequency representation, which we found using a data-driven method, roughly resembles the well-known mel-frequency scale. ◉ In addition, the overexpression of lower frequencies may indicate the importance of accurate pitch tracking in speech separation, similar to what has been reported in human multitalker perception studies. ◉ The filters with the same frequency tuning explicitly express various phase information. In contrast, this information is implicit in the STFT operations ◉ However, limitations are: ○ The long- term tracking of an individual speaker may fail. ○ The generalization to noisy and reverberant conditions must be further tested. 47
  • 48. Conclusion ◉ The improvements are accomplished by replacing the STFT with a convolutional encoder-decoder architecture. ◉ Separation in Conv-TasNet is done using TCN architecture together with a depthwise separable convolution operation to address the challenges of LSTM networks. ◉ Conv-TasNet significantly outperforms STFT speech separation systems, never mention possessing much smaller model size and a shorter minimum latency. ◉ In summary, Conv-TasNet represents a significant step toward the realization of speech separation algorithms and opens many future research directions that would further improve it, which could eventually make automatic speech separation a common and necessary feature of every speech processing technology designed for real-world applications. 48
  • 49. Architecture of WaveNet 49 There are four mechanisms in WaveNet: Causal dilated convolutions, Gated activation function, Residual and Skip connection.
  • 51. Gated activation units 51 Gated Activation Unit ◉ Gated activation units, same as used in gated PixelCNN ∗: convolution; ⊙: element-wise multiplication σ: sigmoid function k: layer index; f: filter; g: gate W : convolution filter ◉ Empirically found to outperform ReLUs.
  • 52. 1-D Convolutional block ◉ Both residual and parameterised skip connections are used throughout the network, to speed up convergence and enable training of much deeper models. ◉ To further decrease the number of parameters, depthwise separable convolution is used to replace standard convolution in each convolutional block. 52
  • 53. Any questions ? You can find me at ◉ jasonho610@gmail.com ◉ NTNU-SMIL Thanks! 53