Real-time neural text-to-speech with sequence-to-sequence acoustic model and WaveGlow or single Gaussian WaveRNN vocoders
1. Real-time neural text-to-speech
with sequence-to-sequence acoustic model
and WaveGlow or single Gaussian WaveRNN vocoders
Takuma Okamoto1, Tomoki Toda2,1, Yoshinori Shiga1 and Hisashi Kawai1
1National Institute of Information and Communications Technology (NICT), Japan
2Nagoya University, Japan
1
2. Introduction!
Problems and purpose!
Sequence-to-sequence acoustic model with full-context label input!
Real-time neural vocoders!
WaveGlow vocoder
Proposed single Gaussian WaveRNN vocoder
Experiments!
Alternative sequence-to-sequence acoustic model (NOT included in proceeding)!
Conclusions
Outline
2
3. High-fidelity text-to-speech (TTS) systems!
WaveNet outperformed conventional TTS systems in 2016 -> End-to-end neural TTS
Tacotron 2 (+ WaveNet vocoder) J. Shen et al., ICASSP 2018
Text (English) -> [Tacotron 2] -> mel-spectrogram -> [WaveNet vocoder] -> speech waveform
Jointly optimizing text analysis, duration and acoustic models with a single neural network
No text analysis, no phoneme alignment, and no fundamental frequency analysis
Problem
NOT directly applied to pitch accent languages
Tacotron for pitch accent language (Japanese) Y. Yasuda et al., ICASSP 2019
Phoneme and accentual type sequence input (instead of character sequence)
Conventional pipeline model with full-context label input > sequence-to-sequence acoustic model
Introduction
Realizing high-fidelity synthesis comparable to human speech!!
3
4. Problems in real-time neural TTS systems!
Results of sequence-to-sequence acoustic model for pitch accent language
Full-context label input > phoneme and accentual type sequence
Many investigations for end-to-end TTS
Introducing Autoregressive (AR) WaveNet vocoder -> CANNOT realize real-time synthesis
Parallel WaveNet with linguistic feature input
High-quality real-time TTS but complicated teacher-student training with additional loss functions required
Purpose: Developing real-time neural TTS for pitch accent languages !
Sequence-to-sequence acoustic model with full-context label input based on Tacotron structure
Jointly optimizing phoneme duration and acoustic models
Real-time neural vocoders without complicated teether-student training
WaveGlow vocoder
Proposed single Gaussian WaveRNN vocoder
Problems and purpose
4
5. Sequence-to-sequence acoustic model with full-context label input based on Tacotron
structure!
Input: full-context label vector (phoneme level sequence)
Reducing past and future 2 contexts based on bidirectional LSTM structure (478 dims -> 130 dims)
1 x 1 convolution layer instead of embedding layer
Sequence-to-sequence acoustic model
layer layers
Bidirectional
LSTM
layers
2 LSTM
Full-context label
vector
Linear
projection
Linear
projection
Stop token
3 conv
2 layer
pre-net
5 conv layer
post-net
Location
sensitive
attention
1 × 1 conv
Mel-spectrogram
+
Neural
vocoder
Speech
waveform
Input text
Text analyzer
Replaced components 5
6. Generative flow-based model!
Image generative model: Glow + raw audio generative model: WaveNet
Training stage: speech waveform + acoustic feature -> white noise
Synthesis stage: white noise + acoustic feature -> speech waveform
Investigated WaveGlow vocoder!
Acoustic feature: mel-spectrogram (80 dims)
Training time
About 1 month using 4 GPUs (NVIDIA V100)
Inference time as real time factor (RTF)
0.1: using a GPU (NVIDIA V100)
4.0: using CPUs (Intel Xeon Gold 6148)
WaveGlow
R. Prenger et al., ICASSP 2019
Directly training real-time parallel generative model
without teacher-student training
Acoustic feature hGround-truth waveform x
WaveNet
Upsampling layer
z
xa xb
Affine
Coupling layer
xa x′
b
Invertible 1×1
convolution
Squeeze to
vectors
× 12
W k
log sj, tj
fi
Affine
transform
6
7. WaveRNN!
Sparse WaveRNN
Real-time inference with a mobile CPU
Dual-softmax
16 bit linear PCM is split into coarse and fine 8 bits
-> two samplings are required to synthesize one audio sample
Proposed single Gaussian WaveRNN!
Predicting mean and standard deviation of next sample
Continuous values can be predicted
Initially proposed in ClariNet (W. Ping et al., ICLR 2019)
Applied to FFTNet (T. Okamoto et al., ICASSP 2019)
Only one sampling is sufficient to synthesize one audio sample
WaveRNN vocoders for CPU inference
Acoustic feature h
Upsampling layer
Masked GRU
Acoustic feature h
Upsampling layer
Ground-truth waveform xt−1
+
O1 O2GRU µt, log σt
Oh Ox
37 or 80
37 or 80
1024 1024
1024
1024 256 2
1
Concat
Split
O1 O2 Softmax for ct
Ground-truth waveform
1. Past coarse 8-bit: ct−1
2. Past fine 8-bit: ft−1
3. Current coarse 8-bit: ct
37 or 80
37 or 80
40 or 83
1024
512 256 256
Softmax for ftO3 O4
512 256 256
3
(a) WaveRNN with dual-softmax
(b) Proposed SG-WaveRNN
Early investigation for real-time synthesis using a CPU
N. Kalchbrenner et al., ICML 2018
7
8. Noise shaping method considering auditory perception!
Improving synthesis quality by reducing spectral distortion due to prediction error
Implemented by MLSA filter with averaged mel-cepstra
Effiective for categorical and single Gaussian WaveNet and FFTNet vocoders
T. Okamoto et al., SLT 2018, ICASSP 2019
Noise shaping for neural vocoders
K. Tachibana et al., ICASSP 2018
(a) Training stage
Speech signal
Acoustic features
Residual signal
(b) Synthesis stage
Acoustic features
WaveNet / FFTNet
Reconstructed speech signal
Speech
corpus
Source signal
f [Hz]
AmplitudeAmplitude
f [Hz]
Residual signal
Amplitude
f [Hz]
Amplitude
f [Hz]
f [Hz]
AmplitudeAmplitude
f [Hz]
Amplitude
f [Hz]
Reconstructed
Amplitude
f [Hz]
Filtering Quantization
Training of
WaveNet / FFTNet
WaveNet / FFTNet
Time-invariant
noise weighting filter
Calculation of time-invariant
noise shaping fileter
Generation of
residual signal
Dequantization Inverse filtering
Extraction of
acoustic features
Investigating impact for WaveGlow and WaveRNN vocoders 8
9. Speech corpus!
Japanese female corpus: about 22 h (test set: 20 utterances)
Sampling frequency: 24 kHz
Sequence-to-sequence acoustic model (introducing Tacotron 2’s setting)!
Input: full-context label vector (130 dim)
Neural vocoders (w/wo noise shaping)!
Single Gaussian AR WaveNet
Vanilla WaveRNN with dual softmax
Proposed single Gaussian WaveRNN
WaveGlow
Acoustic features!
Simple acoustic features (SAF): fundamental frequency + mel-cepstra (37 dims)
Mel-spectrograms (MELSPC): 80 dims
Experimental conditions
9
10. Subjective evaluation!
Listening subjects: 15 Japanese native speakers
18 conditions x 20 utterances = 360 sentences / a subject
Results!
Vanilla and single Gaussian WaveRNNs require noise shaping
Noise shaping is NOT effective for WaveGlow
Neural TTS systems with sequence-to-sequence acoustic model and neural vocoders can realize higher quality
synthesis than STRAIGHT vocoder with analysis-synthesis condition
MOS results and demo
SG-WaveRNNWaveRNN WaveGlow
: MELSPC : MELSPC (NS) : TTS : TTS (NS)
STRAIGHT
Original
: SAF : SAF (NS)
AR SG-WaveNet
12 3 4 5
10
11. Evaluation condition!
Using a GPU (NVIDIA V100)
Simple PyTorch implementation
Results!
Sequence-to-sequence acoustic model + WaveGlow can realize real-time neural TTS with an RTF of 0.16
Single Gaussian WaveRNN can synthesize about twice as fast as vanilla WaveRNN
Results of real-time factor (RTF)
Real-time high fidelity neural TTS for Japanese can be realized 11
12. Real-time neural TTS with sequence-to-sequence acoustic model and WaveGlow or
single Gaussian WaveRNN vocoders!
Sequence-to-sequence acoustic model with full-context label input
WaveGlow and proposed single Gaussian WaveRNN vocoders
Realizing real-time high-fidelity neural TTS using sequence-to-sequence acoustic model and WaveGlow vocoder with a
real time factor of 0.16
Future work!
Implementing real-time inference with a CPU (such as sparse WaveRNN and LPCNet)
Comparing sequence-to-sequence acoustic model with conventional pipeline TTS models
T. Okamoto, T. Toda, Y. Shiga and H. Kawai, “Tacotron-based acoustic model using phoneme alignment for practical
neural text-to-speech systems,” IEEE ASRU 2019@Singapore, Dec. 2019 (to appear)
Conclusions
12