UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED  SPEECH SYNTHESIS USING  SIMPLE RECURRENT UNIT

UTTERANCE-LEVEL SEQUENTIAL MODELING
FOR DEEP GAUSSIAN PROCESS BASED 
SPEECH SYNTHESIS USING 
SIMPLE RECURRENT UNIT
Tomoki Koriyama, Hiroshi Saruwatari
The University of Tokyo, Japan
May 7, 2020
TH2.PB.10, SPE-P12, ICASSP 2020

Background: deep learning on speech synthesis
‣ Neural network (NN)-based speech synthesis
•Model the relationship between texts and speech parameters
‣ Differentiable components enable complicated models
•RNN (LSTM, GRU), CNN, Self-attention, Attention
‣ RNN for speech synthesis
•Can capture continuously changing speech parameters
•Was used in the best framework in Blizzard Challenge 2019
[Jiang2019]
•Is included in end-to-end frameworks (e.g. Tacotron [Wang2017])

Background: deep Gaussian process
‣ Deep Gaussian process (DGP) [Damianou2013][Salimbeni2018]
•Multi-layer Gaussian process regressions (GPRs)
•Nonlinear regression by kernel methods
•Bayesian learning considering model complexity
•Differentiable by variational approximation
‣ DGP-based speech synthesis [Koriyama2019]
Outperformed DNN-based method
Restricted model to feedforward architecture
p(y|x)
x
p(h1
|x)
p(h2
|x)
GPR
GPR
GPR
Is it possible to apply recurrent architecture to DGP?

Extension of DGPs
‣ Convolutional DGP [Kumar2018], TICK-GP [Dutordoir2019]
•Incorporate CNN architecture into DGP
‣ Probabilistic recurrent state space model (PR-SSM)
[Doerr2018]
•Incorporate RNN architecture into DGP
•Perform GPR in each time step
•Require much time for utterance-level training and generation

Purpose of study
To incorporate recurrent architecture into DGP with fast
computation to enable utterance-level sequential modeling
‣ Approach
•Utilize simple recurrent unit (SRU) [Lei2018]
•Separate parallel computation of GPR from recurrent architecture

Simple recurrent unit (SRU) [Lei2018]
SRU does not use the past hidden-layer value  
to calculate gates or update memory cell
hℓ
t−1
cℓ
t
<latexit sha1_base64="Zn0U1cf4rLt4wCaqCE5UdU8jWIw=">AAAUI3iclZi/b9tGFMfPStu46o/Y1VKgC1HHQYoixikIkCJAgEQOjARJEP+S5SRMBJE+iYRJkSBp+QehTu3SoWvmFOhQ9M/oEqBb0Q4Z+gcUBbpk6NKh746UREl3fBcStI7vvc/3vbujjmdZoefGCaWvFyrn3nn3vfOL71c/+PCjjy8sLX+yFwdHkc2aduAF0b7ViZnn9lkzcROP7YcR6/iWx1rW4Tr3twYsit2gv5uchuyZ3+n13a5rdxIwtZcX/jFDx220k+cm8zzj0k3DjN2e3zFMj3WTy2ar0eYBmdd0eGDK21fqQ+NLw7SKbrgfTIUHB0FimHajnSYQL4xm5Pac5AvDMM0q9xTyFuuQoVxfFFU3rkxFjzRzaFy4XVK1PV9NVDoKUYlYVOx/hHeeZ3OmskXSno/7nXV5HDTfVaekOqeYu720QteoOIz5Rj1vrJD82AyWF78nJjkgAbHJEfEJI32SQNsjHRLD+ZTUCSUh2J6RFGwRtFzhZ2RIqsAeQRSDiA5YD+FvD+6e5tY+3HPNWNA2ZPHgioA0yCr9g/5E39BX9Gf6F/1PqZUKDV7LKXxaGcvC9oXvPt35F6V8+EyIM6FKa05Il3wlanWh9lBYeC/sjB+cvXizc2N7Nb1Ef6B/Q/0v6Wv6C30FmtlpQiwjx6Knvsjdh7FNwc5HricsJ6DFLaPKAshTvI+glc7ED8HvQU0+XAnUPtTKxfvwNrmyeP1c/LlxxejhOUaRQ3FWS+bNgvghMpqcikGRP0190sh7aYlMs94hMlo+1DGrwG0YtwMZshmapSceTOOBeEotuJtXKfownQ70NYSxnleZeDANC+yJRGFkx3g+13KFiQfT2FVq7Gpr7MPoz9LchnEnEu5Eg2uWPovNt3oWQ0kNoUYNj8Qq25OM28SDaWyKOmf5zIqx8jnTm687YPek/MSD1+5Ia3fGta/C+p15DkCjK9Y6lo9vkQ1zpnyWZNnCQrYy+vaYHK2QFtSTgh3rZUNBNjS+23LSQklbQdooua4g1zWeCDl5ByW7CrKLkhsKcgMlHQXpoORdBXkXJe8pyHsoeaggD1HyvoK8r/GWk5MPUPKhgnyo8VaXk77GWionH2msQnJyU2Pdl5MhSm4pyC2U3FaQ2ygZKchIY88kJ3c0dgZyclfj/Swnmyh5pCDxveFAQQ5Qck9B7qHksYI8RsmWgmxp7Jnk5InGLk1O7qPkqYI8RcnHCvIxSp4pyDOUfKIgn2jkZPCNChQ8RflAeOR0fbxPKX+bjmrgGim5CJZ2QSv7z457vwYf/m4eqfXhisXaVq56cVxltnvTrXNjPDq63HRFE77aXlqpz/56Mt/Yu7pWp2v1rWsrtxr5LyuL5DPyObkMI32d3II3+iasN3alWUkr31S+rf1Y+7X2W+33LLSykDM1MnXU/vwfEmpa0Q==</latexit>
<latexit sha1_base64="wW4g1/uON3A4J2o4GJ588dFDFZ4=">AAAVK3iclZhLb9tGEMc3dh+u+rCd6pCiF6KOAwdBjVURoEWBAqkcGAmSIH7JUhKmgkivRMKkSJC0/CDUQ4/9Aj23t6Lop+jF96KHFL32EPSYQy89dHapB0XtcjYiZC1n5jfz3yG1XMsKPTdOKH1xZWHxjTffenvpncq7773/wfLK6tXDODiJbNawAy+IWlYnZp7bZ43ETTzWCiPW8S2PNa3jLe5vDlgUu0H/IDkP2XO/0+u7XdfuJGBqry58Z4aOW28n35jM84wbXxlm7Pb8jmF6rJtsmM16mwdkXtPhgSkff1obGrcMs1F0pwl4shAeYJhWPgKIwQwRHAWJYdpjjhvNyO05yU1AzYpZrqxMFqJpRpCGGi7GnhGTb5uM5LnnIibS7RLptlq60G7P9yko7VNQUiwo71OQ71MwP9v5LjkzUoJiB4S2jTx8s72yRjepeBnzg9posEZGr51gdemSmOSIBMQmJ8QnjPRJAmOPdEgMxzNSI5SEYHtOUrBFMHKFn5EhqQB7AlEMIjpgPYa/PTh7NrL24ZznjAVtQxUP3hGQBlmnf9Cf6St6SX+hL+l/ylypyMG1nMOnlbEsbC9/f23/X5Ty4TMhzpQq1ZyQLvlCaHVBeygsfBZ2xg8ufni1/+XeenqD/kT/Af0/0hf0N3oJObPDhFhGTsVMfVG7D71Nwc471xOWM8jFLWNlAdTJn0cwSgvxQ/B7oMmHdwLah1q1+Bxep1YWr1+L3zeu6B5eYxw5FEel5LpZED9EusmpGDLyu6lP6qNZWqJS0TtEuuWDjmIGbsO4faiQXaEiPfVgOR6Ku9SCs/kseR+WpwNzDaHX81mmHiyHBfZEkmFsx3h+reUZph4sx4Eyx4F2jhZ0v0hzG8adSbgzDa5Rei82XuteDCUaQg0Nj8Uq25P0berBcuwInUU+s2Ks/JrpXa+7YPek/NSDa3ek2p2J9nVYvzPPEeToirWOjfqbZ8MRU36VZNXCXLUy+usJOV4hLdCTgh2bZV1B1jW+23LSQklbQdoouaUgtzTuCDl5FyW7CrKLktsKchslHQXpoOQ9BXkPJV0FiX9T7yvI+yh5rCCPUfKBgnyg8XyUkw9R8pGCfKSxH5CTPkoGCjLQWL/l5GONlU9O7mg8a+RkiJK7CnIXJfcU5B5KRgoy0tinycl9jd2InDzQ2BPIyQZKnihIfD86UJADlDxUkIcoeaogT1GyqSCbGvs0OXmGki0F2ULJcwV5jpJPFOQTlLxQkBco+VRBPtWoyeAbpVrHqMYKyD1yujbZG5U/wccaeI6UXAdLO5cr+2+Se78FH74fGGfrwzsWa1t51usTldmOUVfn9qQ7utysoilfaa+s1Yq/2MwPDj/brNHN2u7ttTv10a85S+Rj8gnZgE5/Tu7ALmIH1ht74eXi8uK1xY+qv1Z/r/5Z/SsLXbgyYj4kM6/q3/8DE72k7w==</latexit>
LSTM
SRU

Simple recurrent unit (SRU) [Lei2018]
SRU can be decomposed into two blocks:
parallel computation and light recurrent
state
gate
state
layer output
layer input
gate
Light recurrent block
Linear
Parallel computation block

Simple recurrent unit for DGP
Replace linear transformation by GPR 
in parallel computation block
state
gate
state
layer output
layer input
gate
Light recurrent block
GPR
Parallel computation block

SRU-DGP-based speech synthesis
Speech param
Context
GPR
GPR
SRU-layer w/ GPR
GPR
Context
SRU-layer w/ GPR
GPR
Context
GPR GPR
Speech param Speech param
# of SRU 
layers
Time t

Utterance-level sampling for training
‣ In training process of DGP, inference and sampling is
repeatedly performed for each layer [Salimbeni19]
‣ Utterance-level predictive distribution is multivariate
Gaussian distribution:
•Hidden-layer values of adjacent frames are correlated
‣ Although the sampling can be performed by using
Cholesky decomposition of , this often unstable
‣ Use random feature expansion [Rahimi2008, Cutajar2017] for
stability of training
Σ
𝒩(h; μ, Σ)

Methods for experiments
Architecture
Models Kernel Bayes FeedForward LSTM SRU
NN - - FF-NN LSTM-RNN SRU-NN
BayesNN - ✓ FF-BayesNN -
SRU-
BayesNN
DGP ✓ ✓ FF-DGP -
SRU-DGP 
(proposed)

Experimental conditions: database
Database
JSUT corpus [Sonobe2017]
1 female, BASIC0001~BASIC2000
Train / valid / test 
sentences
1788 (1.95 h) / 60 / 60
Input featrue 575 dim. linguistic feature vector
Output feature
187 dim. acoustic feature vector
(Mel-cepstrum, log F0, code aperiodicity, v/uv & Δ, Δ2)

Experimental conditions: model conﬁgurations
Hidden layer dim. 256
# of inducing points 1024
Kernel function ArcCos [Cho09]
Optimizer Adam (learning rate: 0.01)
DGP
Hidden units 1024
Activation ReLU
Optimizer Adam (learning rate: 10-5)
BayesNN
NN: Hyperparameters were tuned by Optuna [Akiba2019] with 100 trials.

Objective evaluation: spectral feature distortion
Bayesian and SRU models yield smaller distortions
1 2 3 4 5 6 7 8
Number of layers
5.5
5.6
5.7
5.8
5.9
6.0
6.1
MCD[dB]
FF-NN (best)
FF-BayesNN
SRU-BayesNN
SRU-DGP
FF-DGP
LSTM-RNN (best)
SRU-NN (best)

Subjective evaluation
Proposed SRU-DGP gave higher score than other methods
1 2 3 4 5
Score
Method MOS
LSTM-RNN
SRU-NN
SRU-BayesNN
FF-DGP
SRU-DGP
ORIG
2.99
2.98
3.09
3.01
3.19
3.97

Computation time
SRU-DGP can generate speech faster than LSTM-RNN
1 2 3 4 5 6 7 8
Number of layers
SRU-DGP
FF-DGP
LSTM-RNN
SRU-NN
0.00
0.02
0.04
0.06
0.08
0.10Real-timefactor

Conclusions
‣ Incorporate simple recurrent unit (SRU) into DGP
‣ Achieve utterance-level sequential modeling
‣ The proposed SRU-DGP
•Outperformed feedforward (FF)-DGP and LSTM-RNN
•Achieved faster generation than LSTM-RNN
‣ Future work
•Investigate other differentiable components in DGP
- attention, self-attention

Additional speech samples
https://hyama5.github.io/demo_SRU_DGP_TTS/
Thank you for listening!

UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED  SPEECH SYNTHESIS USING  SIMPLE RECURRENT UNIT

Recommandé

Recommandé

Contenu connexe

Similaire à UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED  SPEECH SYNTHESIS USING  SIMPLE RECURRENT UNIT

Similaire à UTTERANCE-LEVEL SEQUENTIAL MODELING FOR DEEP GAUSSIAN PROCESS BASED  SPEECH SYNTHESIS USING  SIMPLE RECURRENT UNIT (20)

Plus de Tomoki Koriyama

Plus de Tomoki Koriyama (12)

Dernier

Dernier (20)