3. /46
Speech synthesis and its benefits
๏ Speech synthesis: a method to synthesize speech by a computer
โ Text-To-Speech (TTS) [Sagisaka et al., 1988.]
โ Voice Conversion (VC) [Stylianou et al., 1988.]
๏ What is required?
โ Flexible control of voice beyond ability of one human
โ High-quality speech generation like human
3
Text TTS
VC
4. /46
Statistical parametric speech synthesis
๏ Statistical parametric speech synthesis [Zen et al., 2009.]
โ Statistical modeling of relationship between input/output
โ Better flexibility than unit selection synthesis [Iwahashi et al., 1993.]
๏ HMM-based TTS & GMM-based VC* [Tokuda et al., 2013.] [Toda et
al., 2007.]
โ Mathematical support of the flexibility
โ Application from other research areas
โ Butโฆ
4*HMM: Hidden Markov Model, GMM: Gaussian Mixture Model
5. /46
Natural speech vs. synthetic speech
in speech quality
5
Natural speech
spoken by human
Synthetic speech
of HMM-based TTS & GMM-based VC
Why?
6. /46
Problem definition and rest of this talk
6
Text
Parameteri-
zation error
Insufficient
modeling
Over-
smoothing
Parameteri-
zation error
Text
analysis
Speech
analysis
Speech
parameter
generation
Waveform
synthesis
Acoustic
Modeling
Approaches
in this thesis
Modeling of individual
speech segment
Chapter 3
Modulation spectrum
for over-smoothing
Chapter 4Chapter 5
Chapter 2
7. /46
Speech synthesis
Analysis Generation SynthesisModeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 4Chapter 5
Text
Chapter 2
Chapter 3
8. /46
2 approaches to speech synthesis
๏ Unit selection synthesis [Iwahashi et al., 1993.]
โ High quality but low flexibility
8
Pre-recorded speech database Synthetic speech
Segment selection
Text
Text
analysis
Speech
analysis
Param.
Gen.
Wave.
synthesis
Acoustic
modeling
๏ Statistical parametric speech synthesis [Zen et al., 2009.]
โ High flexibility but low quality
9. /46
Text/speech analysis
and waveform synthesis
๏ Text analysis (e.g., [Sagisaka et al., 1990.])
๏ Speech analysis (e.g., [Kawahara et al., 1999.])
9ana gen synmodel
j i
ใ ใ ใ ใ ็พ ๅฎ ใใปใปใปSentence
Accent phrase
a tsr a y u r u g e n u oPhoneme
Low
High
Power
Frequency
Fourier transform
& pow.
Envelope = spectral parameters
Periodicity in detail = Pitch (F0)
10. /46
Acoustic modeling in HMM-based TTS
10
๐ = argmax ๐ ๐|๐ฟ, ๐
๏ ML training of HMM parameter sets ๐
ana gen synmodel
โHelloโ
Speech
analysis
Text
analysis
Context labels
๐Speech features
Time
sil-h+e h-e+l e-l+o
HMM
๐
Context-tied
Gaussian distribution
๐ โ ; ๐, ๐บ
e-l+o a-l+o o-l+o
๐ฟ
[Zen et al., 2007.]
11. /46
Acoustic modeling in GMM-based VC
๏ ML training of GMM parameter sets ๐
11
๐ = argmax ๐ ๐ ๐ก, ๐ฟ ๐ก|๐
๐ฟSpeech features
๐Speech features
Speech
analysis
Speech
analysis
ana gen synmodel
GMM
๐
๐ฟ ๐ก
๐ ๐ก
๐ โ ; ๐, ๐บ ๐ฟ ๐ก
๐ ๐ก
Joint vector
at time ๐ก
[Stylianou et al., 1988.]
12. /46
Probability to generate features
in HMM-based TTS
12
Text
analysis
HMM parameter sets ๐
โHelloโ
๐ฟ
ana gen synmodel
๏ Probability to generate the synthetic speech features ๐
๐ ๐|๐ฟ, ๐, ๐ = ๐ ๐; ๐ฌ ๐, ๐ซ ๐
โhโ
โoโ
๐1
๐2
๐ ๐
๐ ๐ก
๐ฌ ๐
๐ฎ1
โ1
๐ฎ2
โ1
๐ฎ ๐
โ1
๐ฎ ๐ก
โ1
๐ซ ๐
โ1
Mean vector Covariance matrix
๐
[Tokuda et al., 2000.]
13. /46
Probability to generate features
in GMM-based VC
13
๏ Probability to generate the synthetic speech features ๐
๐ ๐|๐ฟ, ๐, ๐ = ๐ ๐; ๐ฌ ๐, ๐ซ ๐
Speech
analysis
GMM parameter sets ๐
๐ฟ
๐1
๐2
๐ ๐
๐ ๐ก
๐ฌ ๐
๐ฎ1
โ1
๐ฎ2
โ1
๐ฎ ๐
โ1
๐ฎ ๐ก
โ1
๐ซ ๐
โ1
Mean vector Covariance matrix
๐
[Toda et al., 2007.]
ana gen synmodel
14. /46
Speech parameter generation
๏ ML generation of synthetic speech parameters ๐ ๐
โ Computationally-efficient generation (solved in a closed form)
14
๐ ๐ = argmax ๐ ๐|๐ฟ, ๐, ๐ = argmax ๐ ๐, ฮ๐|๐ฟ, ๐, ๐
Time
Static๐
Temporal
deltaฮ๐
๐ ๐
ฮ๐ ๐
Mean and variance
[Tokuda et al., 2000.]
ana gen synmodel
15. /46
Statistical sample-based speech synthesis
Analysis Generation SynthesisModeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 3 Chapter 4Chapter 5
Text
Chapter 2
16. /46
Quality degradation
by acoustic modeling
16
๏ Averaging across input features
Context-tied Gaussian in HMM-based TTS
โ Robust to the unseen context
โ Loses info. of individual speech parameters.
๐ โ ; ๐, ๐บ
e-l+o
a-l+o
o-l+o
๏ Proposed approach
โ Models individual speech parameters while keeping robustness.
โ Select one model in parameter generation.
โ โ Able to alleviate the quality degradation caused by averaging
17. /46
Acoustic modeling
of the proposed method
17
๏ From the tied model to Rich context-GMM (R-GMM)
๐ โ ; ๐, ๐บ
e-l+o
a-l+o
o-l+o
Rich context models [Yan et al., 2009.]
Less-averaged models having robustness
R-GMM
Model that is formed as the same as the
conventional tied model
Update the mean while tying the covariance.
Gathers them with the same mixture weights.
18. /46
Speech parameter generation
from R-GMMs
18
๏ ML generation of synthetic speech parameters ๐ ๐
โ Iterative generation with the explicit model selection*
๐ ๐ = argmax ๐ ๐, ฮ๐|๐, ๐ฟ ๐ ๐|๐, ฮ๐, ๐ฟ
Mean
ยฑ variance
Time Time
Staticfeature๐
๐
Tied model R-GMM
โ ๐ (HMM/GMM parameter sets) is omitted.
19. /46
Discussion
๏ Initialization of the parameter generation (Sec. 3.5)
โ Uses speech parameters from the over-trained statistics.
โ โ Avoids averaging by initialization, and alleviating over-training by
parameter generation.
๏ Comparison to unit selection synthesis (Sec. 2.2)
โ The model selection corresponds to the waveform segment selection.
โ โ Integrates unit selection in the statistical modeling.
๏ Comparison to conventional hybrid methods
โ Able to apply voice controlling methods, e.g., [Yamagishi et al., 2007.].
โ โ Better flexibility than [Yan et al., 2009.][Ling et al., 2007.] (Sec. 2.8)
19
20. /46
Subjective evaluation
(preference test on speech quality)
20
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
HMM-based TTS
Spectrum H H R R T
F0 H R H R T
H/G: HMM/GMM (= tied model), R: R-GMM, T: Target (``Rโโ using reference)
95% conf. interval
GMM-based VC
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
G R T
21. /46
Modulation spectrum-based post-filter
Analysis Generation SynthesisModeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 4Chapter 5
Text
Chapter 2
Chapter 3
26. /46
Modulation Spectrum (MS) definition
๏ MS: power spectrum of the sequence
โ Represents temporal fluctuation. [Atlas et al., 2003.]
โ Segment features in speech recognition [Thomas et al., 2009.]
โ Captures speech intelligibility. [Drullman et al., 1994.]
26
2nd
moment
DFT
& pow.
GV (scalar)
MS (vector)
Time
Speechparameter
DFT: Discrete Fourier Transform
28. /46
Post-filtering process
28
Training data
Speech param. MS
Statistics
(Gaussian)
HMMs
Filtering
Training
Synthesis
๏ Post-filtering in the MS domain
โ Linear conversion (interpolation) using 2 Gaussian distributions
29. /46
Filtered speech parameter sequence
29
Time
HMM
HMM+GV
natural
HMM โ post-filter
Spectralparameter
Generate fluctuating speech parameters by the post-filtering!
30. /46
Discussion 1: What is the MS?
30
Speech
parameter GV (temporal power)
Freq. 1
Freq. 2
Freq. ๐ทs
+
+
โฆ=
MS (frequency power)
Sum of MSs = GV
Fourier transform
Time
31. /46
Discussion 2
๏ Why post-filter?
โ Independent on the original speech synthesis process
โ โ High portability and high quality
๏ Further application
โ For spectrum, F0 (non-continuous), duration (unactual param.)
โ Segment-level filter (faster process)
๏ Advantages compared to the conventional post-filters
โ Automatic design/tuning [Eyben et al., 2014.][Yoshimura et al., 1999.]
31
32. /46
Subjective evaluation
(preference test on speech quality)
32
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
Spectrum
in HMM-based TTS
Spectrum
in GMM-based VC
HMM GMM
+GV
HMM
+GV
post-filtering
33. /46
Speech synthesis integrating modulation spectrum
Analysis Generation SynthesisModeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 5
Text
Chapter 2
Chapter 3 Chapter 4
34. /46
Problems of the MS-based post-filter
๏ MS-based post-filter
โ External process for MS emphasis
โ โ Causes over-emphasis ignoring speech synthesis criteria.
โ โ Difficult to utilize flexibility that HMM/GMMs have
34
๏ Approaches: Joint optimization using HMM/GMMs and MS
โ Integrate MS statistics as the one of the acoustic models.
โ Speech parameter generation with MS โฆ high-quality
โ Acoustic model training with MS โฆ high-quality and fast
35. /46
Speech parameter generation
considering the MS
35
๏ ML generation with MS constraint
๐ ๐ = argmax ๐ ๐, ฮ๐|๐ฟ ๐ ๐ ๐
๐
๐ ๐ : MS (= power spectrum), ๐: weight of the MS term
๐ฌ ๐ ๐ซ ๐
๐ ๐, ฮ๐|๐ฟ = ๐ ๐, ฮ๐ ; ๐ฌ ๐, ๐ซ ๐ ๐ ๐ ๐ = ๐ ๐ ๐ ; ๐s, ๐บ ๐ฌ
Modulation freq.MS
Natural
Quadratic function of ๐
36. /46
Discussion
(comparison to MS-based post-filter)
๏ Initialization
โ Basic ML generation (``HMMโโ) โ MS-based post-filter
โ โ Part optimization by initialization, and joint optimization by iteration
36
HMM โ post-filter
HMM
Time
Spectralparameter
HMM+MS
37. /46
Effect in the MS
37
HMM
HMM+GV
natural
HMM+MS
Modulation frequency
Modulationspectrum
Fills the gap by the proposed generation algorithm!
38. /46
Effect in the GV
38
Index of speech parameters
LogGV
HMM
HMM+GV
natural
Recovers the GV w/o considering the GV!
HMM+MS
40. /46
Problems of parameter generation
and MS-constrained training
40
๏ Speech parameter generation considering the MS
โ Iterative process in synthesis
โ โ Computationally-inefficient speech synthesis
๐ = argmax ๐ ๐|๐ฟ ๐ ๐ ๐
๐
๐ ๐|๐ฟ = ๐ ๐; ๐ ๐, ๐ฎ : Trajectory likelihood (Sec. 2.8)
๐ ๐ ๐ = ๐ ๐ ๐ ; ๐ ๐ ๐ , ๐ฎ ๐ฌ : MS likelihood
Minimizes difference between ๐ and ๐ ๐.
Minimizes difference between ๐ ๐ and ๐ ๐ ๐ .
๏ Acoustic model training constrained with MS
โ Train HMMs/GMMs ๐ to generate param. ๐ ๐ having natural MS
41. /46
Trained HMM parameters
41
Basic training (Sec. 2.4-5)
Trajectory training (Sec. 2.8)
Time
Deltafeature
Updates HMM/GMM param. to generate fluctuating param.!
MS-constrained training
42. /46
Discussion
๏ Computational efficiency in parameter generation
โ Basic generation algorithm (Sec. 2.6) can be used without MS.
โ โ Not only high-quality but also computationally-efficient
๏ Which is better in quality, proposed param. gen. or training?
โ Structures of HMMs/GMMs have limitation for recovering MS.
โ โ The parameter generation considering the MS is better.
42
Portability Quality Computation time
Post-filter Best! (no depend-
ency on models)
Better Better (120 ms)
Param. gen. Better Best!(optimization
in synthesis)
Worse (1 min~)
Training Worse Better Best! (5 ms)
43. /46
Subjective evaluation
(preference test on speech quality)
43
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
Preferencescore 0.0
0.2
0.4
0.6
0.8
1.0
HMM-based TTS GMM-based VC
HMMTRJ GV MS-
TRJ
GMMTRJ GV MS-
TRJ
HMM/GMM Basic HMM/GMM training (Sec. 2.4-5)
TRJ Trajectory HMM training (Sec. 2.8)
GV GV-constrained training (Sec. 2.9)
MS-TRJ MS-constrained trajectory training
45. /46
Conclusion
๏ Problem in this thesis
โ Quality degradation in synthetic speech, which is caused by
parameterization error, insufficient modeling, and over-smoothing
๏ Chapter 3: statistical parametric speech synthesis
โ Addresses the insufficiency in the acoustic modeling.
โ Models the individual speech parameter with rich context models.
๏ Chapter 4 & 5: approaches using Modulation Spectrum (MS)
โ Addresses the over-smoothing in the parameter generation.
โ 1. MS-based post-filter: high portability
โ 2. Parameter generation w/ MS: highest quality
โ 3. MS-constrained training: computationally-efficient generation
45
46. /46
Future work
๏ Improvements of rich context modeling
โ Quality degradation even if the best models are selected. (Sec. A.5)
๏ Theoretical analysis of MS
โ Why is the speech quality improved by the MS?
๏ MS for DNN-based speech synthesis
โ More flexible structures to integrate the MS
๏ GPU implementation of the proposed methods
โ Rich-context-model selection & param. generation with the MS
46