SlideShare a Scribd company logo
1 of 46
Download to read offline
2015ยฉShinnosuke TAKAMICHI
12/22/2015
Acoustic modeling and speech parameter generation
for high-quality statistical parametric speech synthesis
๏ผˆ้ซ˜้Ÿณ่ณชใช็ตฑ่จˆ็š„ใƒ‘ใƒฉใƒกใƒˆใƒชใƒƒใ‚ฏ้ŸณๅฃฐๅˆๆˆใฎใŸใ‚ใฎ
้Ÿณ้Ÿฟใƒขใƒ‡ใƒชใƒณใ‚ฐๆณ•ใจ้Ÿณๅฃฐใƒ‘ใƒฉใƒกใƒผใ‚ฟ็”Ÿๆˆๆณ•๏ผ‰
Nara Institute of Science and Technology
Shinnosuke Takamichi
Ph.D defense
/46
Research target
2
Speech
/46
Speech synthesis and its benefits
๏ƒ˜ Speech synthesis: a method to synthesize speech by a computer
โ€“ Text-To-Speech (TTS) [Sagisaka et al., 1988.]
โ€“ Voice Conversion (VC) [Stylianou et al., 1988.]
๏ƒ˜ What is required?
โ€“ Flexible control of voice beyond ability of one human
โ€“ High-quality speech generation like human
3
Text TTS
VC
/46
Statistical parametric speech synthesis
๏ƒ˜ Statistical parametric speech synthesis [Zen et al., 2009.]
โ€“ Statistical modeling of relationship between input/output
โ€“ Better flexibility than unit selection synthesis [Iwahashi et al., 1993.]
๏ƒ˜ HMM-based TTS & GMM-based VC* [Tokuda et al., 2013.] [Toda et
al., 2007.]
โ€“ Mathematical support of the flexibility
โ€“ Application from other research areas
โ€“ Butโ€ฆ
4*HMM: Hidden Markov Model, GMM: Gaussian Mixture Model
/46
Natural speech vs. synthetic speech
in speech quality
5
Natural speech
spoken by human
Synthetic speech
of HMM-based TTS & GMM-based VC
Why?
/46
Problem definition and rest of this talk
6
Text
Parameteri-
zation error
Insufficient
modeling
Over-
smoothing
Parameteri-
zation error
Text
analysis
Speech
analysis
Speech
parameter
generation
Waveform
synthesis
Acoustic
Modeling
Approaches
in this thesis
Modeling of individual
speech segment
Chapter 3
Modulation spectrum
for over-smoothing
Chapter 4Chapter 5
Chapter 2
/46
Speech synthesis
Analysis Generation SynthesisModeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 4Chapter 5
Text
Chapter 2
Chapter 3
/46
2 approaches to speech synthesis
๏ƒ˜ Unit selection synthesis [Iwahashi et al., 1993.]
โ€“ High quality but low flexibility
8
Pre-recorded speech database Synthetic speech
Segment selection
Text
Text
analysis
Speech
analysis
Param.
Gen.
Wave.
synthesis
Acoustic
modeling
๏ƒ˜ Statistical parametric speech synthesis [Zen et al., 2009.]
โ€“ High flexibility but low quality
/46
Text/speech analysis
and waveform synthesis
๏ƒ˜ Text analysis (e.g., [Sagisaka et al., 1990.])
๏ƒ˜ Speech analysis (e.g., [Kawahara et al., 1999.])
9ana gen synmodel
j i
ใ‚ ใ‚‰ ใ‚† ใ‚‹ ็พ ๅฎŸ ใ‚’ใƒปใƒปใƒปSentence
Accent phrase
a tsr a y u r u g e n u oPhoneme
Low
High
Power
Frequency
Fourier transform
& pow.
Envelope = spectral parameters
Periodicity in detail = Pitch (F0)
/46
Acoustic modeling in HMM-based TTS
10
๐€ = argmax ๐‘ƒ ๐’€|๐‘ฟ, ๐€
๏ƒ˜ ML training of HMM parameter sets ๐€
ana gen synmodel
โ€œHelloโ€
Speech
analysis
Text
analysis
Context labels
๐’€Speech features
Time
sil-h+e h-e+l e-l+o
HMM
๐€
Context-tied
Gaussian distribution
๐‘ โ‹…; ๐, ๐šบ
e-l+o a-l+o o-l+o
๐‘ฟ
[Zen et al., 2007.]
/46
Acoustic modeling in GMM-based VC
๏ƒ˜ ML training of GMM parameter sets ๐€
11
๐€ = argmax ๐‘ƒ ๐’€ ๐‘ก, ๐‘ฟ ๐‘ก|๐€
๐‘ฟSpeech features
๐’€Speech features
Speech
analysis
Speech
analysis
ana gen synmodel
GMM
๐€
๐‘ฟ ๐‘ก
๐’€ ๐‘ก
๐‘ โ‹…; ๐, ๐šบ ๐‘ฟ ๐‘ก
๐’€ ๐‘ก
Joint vector
at time ๐‘ก
[Stylianou et al., 1988.]
/46
Probability to generate features
in HMM-based TTS
12
Text
analysis
HMM parameter sets ๐€
โ€œHelloโ€
๐‘ฟ
ana gen synmodel
๏ƒ˜ Probability to generate the synthetic speech features ๐’€
๐‘ƒ ๐’€|๐‘ฟ, ๐’’, ๐€ = ๐‘ ๐’€; ๐‘ฌ ๐’’, ๐‘ซ ๐’’
โ€œhโ€
โ€œoโ€
๐1
๐2
๐ ๐‘‡
๐ ๐‘ก
๐‘ฌ ๐’’
๐œฎ1
โˆ’1
๐œฎ2
โˆ’1
๐œฎ ๐‘‡
โˆ’1
๐œฎ ๐‘ก
โˆ’1
๐‘ซ ๐’’
โˆ’1
Mean vector Covariance matrix
๐’’
[Tokuda et al., 2000.]
/46
Probability to generate features
in GMM-based VC
13
๏ƒ˜ Probability to generate the synthetic speech features ๐’€
๐‘ƒ ๐’€|๐‘ฟ, ๐’’, ๐€ = ๐‘ ๐’€; ๐‘ฌ ๐’’, ๐‘ซ ๐’’
Speech
analysis
GMM parameter sets ๐€
๐‘ฟ
๐1
๐2
๐ ๐‘‡
๐ ๐‘ก
๐‘ฌ ๐’’
๐œฎ1
โˆ’1
๐œฎ2
โˆ’1
๐œฎ ๐‘‡
โˆ’1
๐œฎ ๐‘ก
โˆ’1
๐‘ซ ๐’’
โˆ’1
Mean vector Covariance matrix
๐’’
[Toda et al., 2007.]
ana gen synmodel
/46
Speech parameter generation
๏ƒ˜ ML generation of synthetic speech parameters ๐’š ๐’’
โ€“ Computationally-efficient generation (solved in a closed form)
14
๐’š ๐’’ = argmax ๐‘ƒ ๐’€|๐‘ฟ, ๐’’, ๐€ = argmax ๐‘ƒ ๐’š, ฮ”๐’š|๐‘ฟ, ๐’’, ๐€
Time
Static๐’š
Temporal
deltaฮ”๐’š
๐’š ๐’’
ฮ”๐’š ๐’’
Mean and variance
[Tokuda et al., 2000.]
ana gen synmodel
/46
Statistical sample-based speech synthesis
Analysis Generation SynthesisModeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 3 Chapter 4Chapter 5
Text
Chapter 2
/46
Quality degradation
by acoustic modeling
16
๏ƒ˜ Averaging across input features
Context-tied Gaussian in HMM-based TTS
โ†’ Robust to the unseen context
โ†’ Loses info. of individual speech parameters.
๐‘ โ‹…; ๐, ๐šบ
e-l+o
a-l+o
o-l+o
๏ƒ˜ Proposed approach
โ€“ Models individual speech parameters while keeping robustness.
โ€“ Select one model in parameter generation.
โ€“ โ†’ Able to alleviate the quality degradation caused by averaging
/46
Acoustic modeling
of the proposed method
17
๏ƒ˜ From the tied model to Rich context-GMM (R-GMM)
๐‘ โ‹…; ๐, ๐šบ
e-l+o
a-l+o
o-l+o
Rich context models [Yan et al., 2009.]
Less-averaged models having robustness
R-GMM
Model that is formed as the same as the
conventional tied model
Update the mean while tying the covariance.
Gathers them with the same mixture weights.
/46
Speech parameter generation
from R-GMMs
18
๏ƒ˜ ML generation of synthetic speech parameters ๐’š ๐’’
โ€“ Iterative generation with the explicit model selection*
๐’š ๐’’ = argmax ๐‘ƒ ๐’š, ฮ”๐’š|๐’Ž, ๐‘ฟ ๐‘ƒ ๐’Ž|๐’š, ฮ”๐’š, ๐‘ฟ
Mean
ยฑ variance
Time Time
Staticfeature๐’š
๐’Ž
Tied model R-GMM
โˆ— ๐€ (HMM/GMM parameter sets) is omitted.
/46
Discussion
๏ƒ˜ Initialization of the parameter generation (Sec. 3.5)
โ€“ Uses speech parameters from the over-trained statistics.
โ€“ โ†’ Avoids averaging by initialization, and alleviating over-training by
parameter generation.
๏ƒ˜ Comparison to unit selection synthesis (Sec. 2.2)
โ€“ The model selection corresponds to the waveform segment selection.
โ€“ โ†’ Integrates unit selection in the statistical modeling.
๏ƒ˜ Comparison to conventional hybrid methods
โ€“ Able to apply voice controlling methods, e.g., [Yamagishi et al., 2007.].
โ€“ โ†’ Better flexibility than [Yan et al., 2009.][Ling et al., 2007.] (Sec. 2.8)
19
/46
Subjective evaluation
(preference test on speech quality)
20
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
HMM-based TTS
Spectrum H H R R T
F0 H R H R T
H/G: HMM/GMM (= tied model), R: R-GMM, T: Target (``Rโ€™โ€™ using reference)
95% conf. interval
GMM-based VC
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
G R T
/46
Modulation spectrum-based post-filter
Analysis Generation SynthesisModeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 4Chapter 5
Text
Chapter 2
Chapter 3
/46
Over-smoothing
in parameter generation
22
Time
Natural speech parameters
Time
Synthetic speech parameters
Speech
parameter
generation
Acoustic
modeling
/46
Revisits speech parameter generation
(Sec. 2.6)
๏ƒ˜ ML generation of synthetic speech parameters ๐’š ๐’’*
23
Time
Spectralparameter
Natural
๐’š ๐’’ = argmax ๐‘ƒ ๐’š, ฮ”๐’š|๐‘ฟ
๐‘ฟ: input features
๐€ (HMM/GMM parameter sets) is omitted.
[Tokuda et al., 2000.]
HMM
/46
Global Variance (GV) and
parameter generation w/ GV
๏ƒ˜ ML generation with GV constraint
24
Time
Natural
HMM
HMM+GV
Spectralparameter
๐’—(๐’š)
๐’š ๐’’ = argmax ๐‘ƒ ๐’š, ฮ”๐’š|๐‘ฟ ๐‘ƒ ๐’— ๐’š
๐œ”
๐’— ๐’š : GV (= 2nd moment), ๐œ”: weight of the GV term
[Toda et al., 2007.]
Something is still different between them...
โ†’ What is it?
/46
Modulation Spectrum (MS) definition
๏ƒ˜ MS: power spectrum of the sequence
โ€“ Represents temporal fluctuation. [Atlas et al., 2003.]
โ€“ Segment features in speech recognition [Thomas et al., 2009.]
โ€“ Captures speech intelligibility. [Drullman et al., 1994.]
26
2nd
moment
DFT
& pow.
GV (scalar)
MS (vector)
Time
Speechparameter
DFT: Discrete Fourier Transform
/46
HMM
natural
Modulation frequency
Modulationspectrum
Speech quality will be improved by filling this gap!
Example of the MS
HMM+GV
27
/46
Post-filtering process
28
Training data
Speech param. MS
Statistics
(Gaussian)
HMMs
Filtering
Training
Synthesis
๏ƒ˜ Post-filtering in the MS domain
โ€“ Linear conversion (interpolation) using 2 Gaussian distributions
/46
Filtered speech parameter sequence
29
Time
HMM
HMM+GV
natural
HMM โ†’ post-filter
Spectralparameter
Generate fluctuating speech parameters by the post-filtering!
/46
Discussion 1: What is the MS?
30
Speech
parameter GV (temporal power)
Freq. 1
Freq. 2
Freq. ๐ทs
+
+
โ€ฆ=
MS (frequency power)
Sum of MSs = GV
Fourier transform
Time
/46
Discussion 2
๏ƒ˜ Why post-filter?
โ€“ Independent on the original speech synthesis process
โ€“ โ†’ High portability and high quality
๏ƒ˜ Further application
โ€“ For spectrum, F0 (non-continuous), duration (unactual param.)
โ€“ Segment-level filter (faster process)
๏ƒ˜ Advantages compared to the conventional post-filters
โ€“ Automatic design/tuning [Eyben et al., 2014.][Yoshimura et al., 1999.]
31
/46
Subjective evaluation
(preference test on speech quality)
32
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
Spectrum
in HMM-based TTS
Spectrum
in GMM-based VC
HMM GMM
+GV
HMM
+GV
post-filtering
/46
Speech synthesis integrating modulation spectrum
Analysis Generation SynthesisModeling
Modeling of individual
speech segment
Modulation spectrum
for over-smoothing
Chapter 5
Text
Chapter 2
Chapter 3 Chapter 4
/46
Problems of the MS-based post-filter
๏ƒ˜ MS-based post-filter
โ€“ External process for MS emphasis
โ€“ โ†’ Causes over-emphasis ignoring speech synthesis criteria.
โ€“ โ†’ Difficult to utilize flexibility that HMM/GMMs have
34
๏ƒ˜ Approaches: Joint optimization using HMM/GMMs and MS
โ€“ Integrate MS statistics as the one of the acoustic models.
โ€“ Speech parameter generation with MS โ€ฆ high-quality
โ€“ Acoustic model training with MS โ€ฆ high-quality and fast
/46
Speech parameter generation
considering the MS
35
๏ƒ˜ ML generation with MS constraint
๐’š ๐’’ = argmax ๐‘ƒ ๐’š, ฮ”๐’š|๐‘ฟ ๐‘ƒ ๐’” ๐’š
๐œ”
๐’” ๐’š : MS (= power spectrum), ๐œ”: weight of the MS term
๐‘ฌ ๐’’ ๐‘ซ ๐’’
๐‘ƒ ๐’š, ฮ”๐’š|๐‘ฟ = ๐‘ ๐’š, ฮ”๐’š ; ๐‘ฌ ๐’’, ๐‘ซ ๐’’ ๐‘ƒ ๐’” ๐’š = ๐‘ ๐’” ๐’š ; ๐s, ๐šบ ๐ฌ
Modulation freq.MS
Natural
Quadratic function of ๐’š
/46
Discussion
(comparison to MS-based post-filter)
๏ƒ˜ Initialization
โ€“ Basic ML generation (``HMMโ€™โ€™) โ†’ MS-based post-filter
โ€“ โ†’ Part optimization by initialization, and joint optimization by iteration
36
HMM โ†’ post-filter
HMM
Time
Spectralparameter
HMM+MS
/46
Effect in the MS
37
HMM
HMM+GV
natural
HMM+MS
Modulation frequency
Modulationspectrum
Fills the gap by the proposed generation algorithm!
/46
Effect in the GV
38
Index of speech parameters
LogGV
HMM
HMM+GV
natural
Recovers the GV w/o considering the GV!
HMM+MS
/46
Subjective evaluation
(preference test on speech quality)
39
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
GMM-based VC
HMM
+MS
HMM
+GV
GMM
+MS
GMM
+GV
HMM-based TTS
*+GV Parameter generation w/ GV (Sec. 2.9)
*+MS Parameter generation w/ MS
/46
Problems of parameter generation
and MS-constrained training
40
๏ƒ˜ Speech parameter generation considering the MS
โ€“ Iterative process in synthesis
โ€“ โ†’ Computationally-inefficient speech synthesis
๐€ = argmax ๐‘ƒ ๐’š|๐‘ฟ ๐‘ƒ ๐’” ๐’š
๐œ”
๐‘ƒ ๐’š|๐‘ฟ = ๐‘ ๐’š; ๐’š ๐’’, ๐œฎ : Trajectory likelihood (Sec. 2.8)
๐‘ƒ ๐’” ๐’š = ๐‘ ๐’” ๐’š ; ๐’” ๐’š ๐’’ , ๐œฎ ๐ฌ : MS likelihood
Minimizes difference between ๐’š and ๐’š ๐’’.
Minimizes difference between ๐’” ๐’š and ๐’” ๐’š ๐’’ .
๏ƒ˜ Acoustic model training constrained with MS
โ€“ Train HMMs/GMMs ๐€ to generate param. ๐’š ๐’’ having natural MS
/46
Trained HMM parameters
41
Basic training (Sec. 2.4-5)
Trajectory training (Sec. 2.8)
Time
Deltafeature
Updates HMM/GMM param. to generate fluctuating param.!
MS-constrained training
/46
Discussion
๏ƒ˜ Computational efficiency in parameter generation
โ€“ Basic generation algorithm (Sec. 2.6) can be used without MS.
โ€“ โ†’ Not only high-quality but also computationally-efficient
๏ƒ˜ Which is better in quality, proposed param. gen. or training?
โ€“ Structures of HMMs/GMMs have limitation for recovering MS.
โ€“ โ†’ The parameter generation considering the MS is better.
42
Portability Quality Computation time
Post-filter Best! (no depend-
ency on models)
Better Better (120 ms)
Param. gen. Better Best!(optimization
in synthesis)
Worse (1 min~)
Training Worse Better Best! (5 ms)
/46
Subjective evaluation
(preference test on speech quality)
43
Preferencescore
0.0
0.2
0.4
0.6
0.8
1.0
Preferencescore 0.0
0.2
0.4
0.6
0.8
1.0
HMM-based TTS GMM-based VC
HMMTRJ GV MS-
TRJ
GMMTRJ GV MS-
TRJ
HMM/GMM Basic HMM/GMM training (Sec. 2.4-5)
TRJ Trajectory HMM training (Sec. 2.8)
GV GV-constrained training (Sec. 2.9)
MS-TRJ MS-constrained trajectory training
/46
Conclusion
/46
Conclusion
๏ƒ˜ Problem in this thesis
โ€“ Quality degradation in synthetic speech, which is caused by
parameterization error, insufficient modeling, and over-smoothing
๏ƒ˜ Chapter 3: statistical parametric speech synthesis
โ€“ Addresses the insufficiency in the acoustic modeling.
โ€“ Models the individual speech parameter with rich context models.
๏ƒ˜ Chapter 4 & 5: approaches using Modulation Spectrum (MS)
โ€“ Addresses the over-smoothing in the parameter generation.
โ€“ 1. MS-based post-filter: high portability
โ€“ 2. Parameter generation w/ MS: highest quality
โ€“ 3. MS-constrained training: computationally-efficient generation
45
/46
Future work
๏ƒ˜ Improvements of rich context modeling
โ€“ Quality degradation even if the best models are selected. (Sec. A.5)
๏ƒ˜ Theoretical analysis of MS
โ€“ Why is the speech quality improved by the MS?
๏ƒ˜ MS for DNN-based speech synthesis
โ€“ More flexible structures to integrate the MS
๏ƒ˜ GPU implementation of the proposed methods
โ€“ Rich-context-model selection & param. generation with the MS
46

More Related Content

What's hot

Speaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained DataSpeaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained Data
sipij
ย 
Mjfg now
Mjfg nowMjfg now
Mjfg now
Prabha P
ย 
Baum2
Baum2Baum2
Baum2
dmolina87
ย 
Voice Morphing System for People Suffering from Laryngectomy
Voice Morphing System for People Suffering from LaryngectomyVoice Morphing System for People Suffering from Laryngectomy
Voice Morphing System for People Suffering from Laryngectomy
International Journal of Science and Research (IJSR)
ย 
Voice morphing document
Voice morphing documentVoice morphing document
Voice morphing document
himadrigupta
ย 
Bz33462466
Bz33462466Bz33462466
Bz33462466
IJERA Editor
ย 

What's hot (18)

Speaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained DataSpeaker Identification From Youtube Obtained Data
Speaker Identification From Youtube Obtained Data
ย 
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...
Comparison of Feature Extraction MFCC and LPC in Automatic Speech Recognition...
ย 
BERT Finetuning Webinar Presentation
BERT Finetuning Webinar PresentationBERT Finetuning Webinar Presentation
BERT Finetuning Webinar Presentation
ย 
Mjfg now
Mjfg nowMjfg now
Mjfg now
ย 
High Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech SynthesisHigh Quality Arabic Concatenative Speech Synthesis
High Quality Arabic Concatenative Speech Synthesis
ย 
Baum2
Baum2Baum2
Baum2
ย 
COLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech AnalysisCOLEA : A MATLAB Tool for Speech Analysis
COLEA : A MATLAB Tool for Speech Analysis
ย 
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...
A New Method for Pitch Tracking and Voicing Decision Based on Spectral Multi-...
ย 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
ย 
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
[Paper Introduction] Translating into Morphologically Rich Languages with Syn...
ย 
Voice Morphing System for People Suffering from Laryngectomy
Voice Morphing System for People Suffering from LaryngectomyVoice Morphing System for People Suffering from Laryngectomy
Voice Morphing System for People Suffering from Laryngectomy
ย 
Text independent speaker recognition system
Text independent speaker recognition systemText independent speaker recognition system
Text independent speaker recognition system
ย 
FPGA-based implementation of speech recognition for robocar control using MFCC
FPGA-based implementation of speech recognition for robocar control using MFCCFPGA-based implementation of speech recognition for robocar control using MFCC
FPGA-based implementation of speech recognition for robocar control using MFCC
ย 
A017410108
A017410108A017410108
A017410108
ย 
Matlab: Speech Signal Analysis
Matlab: Speech Signal AnalysisMatlab: Speech Signal Analysis
Matlab: Speech Signal Analysis
ย 
Voice morphing document
Voice morphing documentVoice morphing document
Voice morphing document
ย 
Bz33462466
Bz33462466Bz33462466
Bz33462466
ย 
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
Missing Component Restoration for Masked Speech Signals based on Time-Domain ...
ย 

Viewers also liked

GMMใซๅŸบใฅใๅ›บๆœ‰ๅฃฐๅค‰ๆ›ใฎใŸใ‚ใฎๅค‰่ชฟใ‚นใƒšใ‚ฏใƒˆใƒซๅˆถ็ด„ไป˜ใใƒˆใƒฉใ‚ธใ‚งใ‚ฏใƒˆใƒชๅญฆ็ฟ’ใƒป้ฉๅฟœ
GMMใซๅŸบใฅใๅ›บๆœ‰ๅฃฐๅค‰ๆ›ใฎใŸใ‚ใฎๅค‰่ชฟใ‚นใƒšใ‚ฏใƒˆใƒซๅˆถ็ด„ไป˜ใใƒˆใƒฉใ‚ธใ‚งใ‚ฏใƒˆใƒชๅญฆ็ฟ’ใƒป้ฉๅฟœGMMใซๅŸบใฅใๅ›บๆœ‰ๅฃฐๅค‰ๆ›ใฎใŸใ‚ใฎๅค‰่ชฟใ‚นใƒšใ‚ฏใƒˆใƒซๅˆถ็ด„ไป˜ใใƒˆใƒฉใ‚ธใ‚งใ‚ฏใƒˆใƒชๅญฆ็ฟ’ใƒป้ฉๅฟœ
GMMใซๅŸบใฅใๅ›บๆœ‰ๅฃฐๅค‰ๆ›ใฎใŸใ‚ใฎๅค‰่ชฟใ‚นใƒšใ‚ฏใƒˆใƒซๅˆถ็ด„ไป˜ใใƒˆใƒฉใ‚ธใ‚งใ‚ฏใƒˆใƒชๅญฆ็ฟ’ใƒป้ฉๅฟœ
Shinnosuke Takamichi
ย 
้›‘้Ÿณ็’ฐๅขƒไธ‹้Ÿณๅฃฐใ‚’็”จใ„ใŸ้ŸณๅฃฐๅˆๆˆใฎใŸใ‚ใฎ้›‘้Ÿณ็”Ÿๆˆใƒขใƒ‡ใƒซใฎๆ•ตๅฏพ็š„ๅญฆ็ฟ’
้›‘้Ÿณ็’ฐๅขƒไธ‹้Ÿณๅฃฐใ‚’็”จใ„ใŸ้ŸณๅฃฐๅˆๆˆใฎใŸใ‚ใฎ้›‘้Ÿณ็”Ÿๆˆใƒขใƒ‡ใƒซใฎๆ•ตๅฏพ็š„ๅญฆ็ฟ’้›‘้Ÿณ็’ฐๅขƒไธ‹้Ÿณๅฃฐใ‚’็”จใ„ใŸ้ŸณๅฃฐๅˆๆˆใฎใŸใ‚ใฎ้›‘้Ÿณ็”Ÿๆˆใƒขใƒ‡ใƒซใฎๆ•ตๅฏพ็š„ๅญฆ็ฟ’
้›‘้Ÿณ็’ฐๅขƒไธ‹้Ÿณๅฃฐใ‚’็”จใ„ใŸ้ŸณๅฃฐๅˆๆˆใฎใŸใ‚ใฎ้›‘้Ÿณ็”Ÿๆˆใƒขใƒ‡ใƒซใฎๆ•ตๅฏพ็š„ๅญฆ็ฟ’
Shinnosuke Takamichi
ย 
Moment matching networkใ‚’็”จใ„ใŸ้Ÿณๅฃฐใƒ‘ใƒฉใƒกใƒผใ‚ฟใฎใƒฉใƒณใƒ€ใƒ ็”Ÿๆˆใฎๆคœ่จŽ
Moment matching networkใ‚’็”จใ„ใŸ้Ÿณๅฃฐใƒ‘ใƒฉใƒกใƒผใ‚ฟใฎใƒฉใƒณใƒ€ใƒ ็”Ÿๆˆใฎๆคœ่จŽMoment matching networkใ‚’็”จใ„ใŸ้Ÿณๅฃฐใƒ‘ใƒฉใƒกใƒผใ‚ฟใฎใƒฉใƒณใƒ€ใƒ ็”Ÿๆˆใฎๆคœ่จŽ
Moment matching networkใ‚’็”จใ„ใŸ้Ÿณๅฃฐใƒ‘ใƒฉใƒกใƒผใ‚ฟใฎใƒฉใƒณใƒ€ใƒ ็”Ÿๆˆใฎๆคœ่จŽ
Shinnosuke Takamichi
ย 

Viewers also liked (16)

ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ ใƒ“ใ‚ฎใƒŠใƒผใ‚บใ‚ปใƒŸใƒŠใƒผ "ๆทฑๅฑคๅญฆ็ฟ’ใ‚’ๆทฑใๅญฆ็ฟ’ใ™ใ‚‹ใŸใ‚ใฎๅŸบ็คŽ"
ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ ใƒ“ใ‚ฎใƒŠใƒผใ‚บใ‚ปใƒŸใƒŠใƒผ "ๆทฑๅฑคๅญฆ็ฟ’ใ‚’ๆทฑใๅญฆ็ฟ’ใ™ใ‚‹ใŸใ‚ใฎๅŸบ็คŽ"ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ ใƒ“ใ‚ฎใƒŠใƒผใ‚บใ‚ปใƒŸใƒŠใƒผ "ๆทฑๅฑคๅญฆ็ฟ’ใ‚’ๆทฑใๅญฆ็ฟ’ใ™ใ‚‹ใŸใ‚ใฎๅŸบ็คŽ"
ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ ใƒ“ใ‚ฎใƒŠใƒผใ‚บใ‚ปใƒŸใƒŠใƒผ "ๆทฑๅฑคๅญฆ็ฟ’ใ‚’ๆทฑใๅญฆ็ฟ’ใ™ใ‚‹ใŸใ‚ใฎๅŸบ็คŽ"
ย 
ICASSP2017่ชญใฟไผš (Deep Learning III) [้›ป้€šๅคง ไธญ้นฟๅ…ˆ็”Ÿ]
ICASSP2017่ชญใฟไผš (Deep Learning III) [้›ป้€šๅคง ไธญ้นฟๅ…ˆ็”Ÿ]ICASSP2017่ชญใฟไผš (Deep Learning III) [้›ป้€šๅคง ไธญ้นฟๅ…ˆ็”Ÿ]
ICASSP2017่ชญใฟไผš (Deep Learning III) [้›ป้€šๅคง ไธญ้นฟๅ…ˆ็”Ÿ]
ย 
ICASSP2017่ชญใฟไผš (acoustic modeling and adaptation)
ICASSP2017่ชญใฟไผš (acoustic modeling and adaptation)ICASSP2017่ชญใฟไผš (acoustic modeling and adaptation)
ICASSP2017่ชญใฟไผš (acoustic modeling and adaptation)
ย 
DNN้Ÿณ้Ÿฟใƒขใƒ‡ใƒซใซใŠใ‘ใ‚‹็‰นๅพด้‡ๆŠฝๅ‡บใฎ่ซธ็›ธ
DNN้Ÿณ้Ÿฟใƒขใƒ‡ใƒซใซใŠใ‘ใ‚‹็‰นๅพด้‡ๆŠฝๅ‡บใฎ่ซธ็›ธDNN้Ÿณ้Ÿฟใƒขใƒ‡ใƒซใซใŠใ‘ใ‚‹็‰นๅพด้‡ๆŠฝๅ‡บใฎ่ซธ็›ธ
DNN้Ÿณ้Ÿฟใƒขใƒ‡ใƒซใซใŠใ‘ใ‚‹็‰นๅพด้‡ๆŠฝๅ‡บใฎ่ซธ็›ธ
ย 
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
Multimodal Deep Learning (D4L4 Deep Learning for Speech and Language UPC 2017)
ย 
GMMใซๅŸบใฅใๅ›บๆœ‰ๅฃฐๅค‰ๆ›ใฎใŸใ‚ใฎๅค‰่ชฟใ‚นใƒšใ‚ฏใƒˆใƒซๅˆถ็ด„ไป˜ใใƒˆใƒฉใ‚ธใ‚งใ‚ฏใƒˆใƒชๅญฆ็ฟ’ใƒป้ฉๅฟœ
GMMใซๅŸบใฅใๅ›บๆœ‰ๅฃฐๅค‰ๆ›ใฎใŸใ‚ใฎๅค‰่ชฟใ‚นใƒšใ‚ฏใƒˆใƒซๅˆถ็ด„ไป˜ใใƒˆใƒฉใ‚ธใ‚งใ‚ฏใƒˆใƒชๅญฆ็ฟ’ใƒป้ฉๅฟœGMMใซๅŸบใฅใๅ›บๆœ‰ๅฃฐๅค‰ๆ›ใฎใŸใ‚ใฎๅค‰่ชฟใ‚นใƒšใ‚ฏใƒˆใƒซๅˆถ็ด„ไป˜ใใƒˆใƒฉใ‚ธใ‚งใ‚ฏใƒˆใƒชๅญฆ็ฟ’ใƒป้ฉๅฟœ
GMMใซๅŸบใฅใๅ›บๆœ‰ๅฃฐๅค‰ๆ›ใฎใŸใ‚ใฎๅค‰่ชฟใ‚นใƒšใ‚ฏใƒˆใƒซๅˆถ็ด„ไป˜ใใƒˆใƒฉใ‚ธใ‚งใ‚ฏใƒˆใƒชๅญฆ็ฟ’ใƒป้ฉๅฟœ
ย 
ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ โ€Moment-matching networkใซๅŸบใฅใไธ€ๆœŸไธ€ไผš้ŸณๅฃฐๅˆๆˆใซใŠใ‘ใ‚‹็™บ่ฉฑ้–“ๅค‰ๅ‹•ใฎ่ฉ•ไพกโ€
ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ โ€Moment-matching networkใซๅŸบใฅใไธ€ๆœŸไธ€ไผš้ŸณๅฃฐๅˆๆˆใซใŠใ‘ใ‚‹็™บ่ฉฑ้–“ๅค‰ๅ‹•ใฎ่ฉ•ไพกโ€ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ โ€Moment-matching networkใซๅŸบใฅใไธ€ๆœŸไธ€ไผš้ŸณๅฃฐๅˆๆˆใซใŠใ‘ใ‚‹็™บ่ฉฑ้–“ๅค‰ๅ‹•ใฎ่ฉ•ไพกโ€
ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ โ€Moment-matching networkใซๅŸบใฅใไธ€ๆœŸไธ€ไผš้ŸณๅฃฐๅˆๆˆใซใŠใ‘ใ‚‹็™บ่ฉฑ้–“ๅค‰ๅ‹•ใฎ่ฉ•ไพกโ€
ย 
ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ โ€ใ‚ฏใƒฉใ‚ฆใƒ‰ใ‚ฝใƒผใ‚ทใƒณใ‚ฐใ‚’ๅˆฉ็”จใ—ใŸๅฏพ่จณๆ–น่จ€้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚นใฎๆง‹็ฏ‰โ€
ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ โ€ใ‚ฏใƒฉใ‚ฆใƒ‰ใ‚ฝใƒผใ‚ทใƒณใ‚ฐใ‚’ๅˆฉ็”จใ—ใŸๅฏพ่จณๆ–น่จ€้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚นใฎๆง‹็ฏ‰โ€ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ โ€ใ‚ฏใƒฉใ‚ฆใƒ‰ใ‚ฝใƒผใ‚ทใƒณใ‚ฐใ‚’ๅˆฉ็”จใ—ใŸๅฏพ่จณๆ–น่จ€้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚นใฎๆง‹็ฏ‰โ€
ๆ—ฅๆœฌ้Ÿณ้Ÿฟๅญฆไผš2017็ง‹ โ€ใ‚ฏใƒฉใ‚ฆใƒ‰ใ‚ฝใƒผใ‚ทใƒณใ‚ฐใ‚’ๅˆฉ็”จใ—ใŸๅฏพ่จณๆ–น่จ€้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚นใฎๆง‹็ฏ‰โ€
ย 
้Ÿณๅฃฐใฎๅฃฐ่ณชใ‚’ๅค‰ๆ›ใ™ใ‚‹ๆŠ€่ก“ใจใใฎๅฟœ็”จ
้Ÿณๅฃฐใฎๅฃฐ่ณชใ‚’ๅค‰ๆ›ใ™ใ‚‹ๆŠ€่ก“ใจใใฎๅฟœ็”จ้Ÿณๅฃฐใฎๅฃฐ่ณชใ‚’ๅค‰ๆ›ใ™ใ‚‹ๆŠ€่ก“ใจใใฎๅฟœ็”จ
้Ÿณๅฃฐใฎๅฃฐ่ณชใ‚’ๅค‰ๆ›ใ™ใ‚‹ๆŠ€่ก“ใจใใฎๅฟœ็”จ
ย 
Saito2017icassp
Saito2017icasspSaito2017icassp
Saito2017icassp
ย 
้›‘้Ÿณ็’ฐๅขƒไธ‹้Ÿณๅฃฐใ‚’็”จใ„ใŸ้ŸณๅฃฐๅˆๆˆใฎใŸใ‚ใฎ้›‘้Ÿณ็”Ÿๆˆใƒขใƒ‡ใƒซใฎๆ•ตๅฏพ็š„ๅญฆ็ฟ’
้›‘้Ÿณ็’ฐๅขƒไธ‹้Ÿณๅฃฐใ‚’็”จใ„ใŸ้ŸณๅฃฐๅˆๆˆใฎใŸใ‚ใฎ้›‘้Ÿณ็”Ÿๆˆใƒขใƒ‡ใƒซใฎๆ•ตๅฏพ็š„ๅญฆ็ฟ’้›‘้Ÿณ็’ฐๅขƒไธ‹้Ÿณๅฃฐใ‚’็”จใ„ใŸ้ŸณๅฃฐๅˆๆˆใฎใŸใ‚ใฎ้›‘้Ÿณ็”Ÿๆˆใƒขใƒ‡ใƒซใฎๆ•ตๅฏพ็š„ๅญฆ็ฟ’
้›‘้Ÿณ็’ฐๅขƒไธ‹้Ÿณๅฃฐใ‚’็”จใ„ใŸ้ŸณๅฃฐๅˆๆˆใฎใŸใ‚ใฎ้›‘้Ÿณ็”Ÿๆˆใƒขใƒ‡ใƒซใฎๆ•ตๅฏพ็š„ๅญฆ็ฟ’
ย 
MIRU2016 ใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซ
MIRU2016 ใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซMIRU2016 ใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซ
MIRU2016 ใƒใƒฅใƒผใƒˆใƒชใ‚ขใƒซ
ย 
ไฟกๅทๅ‡ฆ็†ใƒป็”ปๅƒๅ‡ฆ็†ใซใŠใ‘ใ‚‹ๅ‡ธๆœ€้ฉๅŒ–
ไฟกๅทๅ‡ฆ็†ใƒป็”ปๅƒๅ‡ฆ็†ใซใŠใ‘ใ‚‹ๅ‡ธๆœ€้ฉๅŒ–ไฟกๅทๅ‡ฆ็†ใƒป็”ปๅƒๅ‡ฆ็†ใซใŠใ‘ใ‚‹ๅ‡ธๆœ€้ฉๅŒ–
ไฟกๅทๅ‡ฆ็†ใƒป็”ปๅƒๅ‡ฆ็†ใซใŠใ‘ใ‚‹ๅ‡ธๆœ€้ฉๅŒ–
ย 
Moment matching networkใ‚’็”จใ„ใŸ้Ÿณๅฃฐใƒ‘ใƒฉใƒกใƒผใ‚ฟใฎใƒฉใƒณใƒ€ใƒ ็”Ÿๆˆใฎๆคœ่จŽ
Moment matching networkใ‚’็”จใ„ใŸ้Ÿณๅฃฐใƒ‘ใƒฉใƒกใƒผใ‚ฟใฎใƒฉใƒณใƒ€ใƒ ็”Ÿๆˆใฎๆคœ่จŽMoment matching networkใ‚’็”จใ„ใŸ้Ÿณๅฃฐใƒ‘ใƒฉใƒกใƒผใ‚ฟใฎใƒฉใƒณใƒ€ใƒ ็”Ÿๆˆใฎๆคœ่จŽ
Moment matching networkใ‚’็”จใ„ใŸ้Ÿณๅฃฐใƒ‘ใƒฉใƒกใƒผใ‚ฟใฎใƒฉใƒณใƒ€ใƒ ็”Ÿๆˆใฎๆคœ่จŽ
ย 
ICASSP2017่ชญใฟไผš๏ผˆ้–ขๆฑ็ทจ๏ผ‰ใƒปAASP_L3๏ผˆๅŒ—ๆ‘ๆ‹…ๅฝ“ๅˆ†๏ผ‰
ICASSP2017่ชญใฟไผš๏ผˆ้–ขๆฑ็ทจ๏ผ‰ใƒปAASP_L3๏ผˆๅŒ—ๆ‘ๆ‹…ๅฝ“ๅˆ†๏ผ‰ICASSP2017่ชญใฟไผš๏ผˆ้–ขๆฑ็ทจ๏ผ‰ใƒปAASP_L3๏ผˆๅŒ—ๆ‘ๆ‹…ๅฝ“ๅˆ†๏ผ‰
ICASSP2017่ชญใฟไผš๏ผˆ้–ขๆฑ็ทจ๏ผ‰ใƒปAASP_L3๏ผˆๅŒ—ๆ‘ๆ‹…ๅฝ“ๅˆ†๏ผ‰
ย 
ใƒคใƒ•ใƒผ้Ÿณๅฃฐ่ช่ญ˜ใ‚ตใƒผใƒ’ใ‚™ใ‚นใฆใ‚™ใฎใƒ†ใ‚™ใ‚ฃใƒผใƒ•ใ‚šใƒฉใƒผใƒ‹ใƒณใ‚ฏใ‚™ใจGPUๅˆฉ็”จไบ‹ไพ‹
ใƒคใƒ•ใƒผ้Ÿณๅฃฐ่ช่ญ˜ใ‚ตใƒผใƒ’ใ‚™ใ‚นใฆใ‚™ใฎใƒ†ใ‚™ใ‚ฃใƒผใƒ•ใ‚šใƒฉใƒผใƒ‹ใƒณใ‚ฏใ‚™ใจGPUๅˆฉ็”จไบ‹ไพ‹ใƒคใƒ•ใƒผ้Ÿณๅฃฐ่ช่ญ˜ใ‚ตใƒผใƒ’ใ‚™ใ‚นใฆใ‚™ใฎใƒ†ใ‚™ใ‚ฃใƒผใƒ•ใ‚šใƒฉใƒผใƒ‹ใƒณใ‚ฏใ‚™ใจGPUๅˆฉ็”จไบ‹ไพ‹
ใƒคใƒ•ใƒผ้Ÿณๅฃฐ่ช่ญ˜ใ‚ตใƒผใƒ’ใ‚™ใ‚นใฆใ‚™ใฎใƒ†ใ‚™ใ‚ฃใƒผใƒ•ใ‚šใƒฉใƒผใƒ‹ใƒณใ‚ฏใ‚™ใจGPUๅˆฉ็”จไบ‹ไพ‹
ย 

Similar to Ph.D defence (Shinnosuke Takamichi)

EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESEFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
kevig
ย 
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and PhonemesEffect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
kevig
ย 
SMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk SystemSMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk System
CSCJournals
ย 
Development of voice password based speaker verification system
Development of voice password based speaker verification systemDevelopment of voice password based speaker verification system
Development of voice password based speaker verification system
niranjan kumar
ย 

Similar to Ph.D defence (Shinnosuke Takamichi) (20)

What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication? What can GAN and GMMN do for augmented speech communication?
What can GAN and GMMN do for augmented speech communication?
ย 
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis SystemEvaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
Evaluation of Hidden Markov Model based Marathi Text-ToSpeech Synthesis System
ย 
Performance Calculation of Speech Synthesis Methods for Hindi language
Performance Calculation of Speech Synthesis Methods for Hindi languagePerformance Calculation of Speech Synthesis Methods for Hindi language
Performance Calculation of Speech Synthesis Methods for Hindi language
ย 
AUTOMATIC SPEECH RECOGNITION- A SURVEY
AUTOMATIC SPEECH RECOGNITION- A SURVEYAUTOMATIC SPEECH RECOGNITION- A SURVEY
AUTOMATIC SPEECH RECOGNITION- A SURVEY
ย 
Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...Hindi digits recognition system on speech data collected in different natural...
Hindi digits recognition system on speech data collected in different natural...
ย 
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMESEFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
EFFECT OF DYNAMIC TIME WARPING ON ALIGNMENT OF PHRASES AND PHONEMES
ย 
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and PhonemesEffect of Dynamic Time Warping on Alignment of Phrases and Phonemes
Effect of Dynamic Time Warping on Alignment of Phrases and Phonemes
ย 
Speech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using VocoderSpeech Analysis and synthesis using Vocoder
Speech Analysis and synthesis using Vocoder
ย 
IRJET- Vocal Code
IRJET- Vocal CodeIRJET- Vocal Code
IRJET- Vocal Code
ย 
lec26_audio.pptx
lec26_audio.pptxlec26_audio.pptx
lec26_audio.pptx
ย 
Speaker Segmentation (2006)
Speaker Segmentation (2006)Speaker Segmentation (2006)
Speaker Segmentation (2006)
ย 
SMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk SystemSMATalk: Standard Malay Text to Speech Talk System
SMATalk: Standard Malay Text to Speech Talk System
ย 
D111823
D111823D111823
D111823
ย 
IRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time DomainIRJET- Pitch Detection Algorithms in Time Domain
IRJET- Pitch Detection Algorithms in Time Domain
ย 
General Kalman Filter & Speech Enhancement for Speaker Identification
General Kalman Filter & Speech Enhancement for Speaker IdentificationGeneral Kalman Filter & Speech Enhancement for Speaker Identification
General Kalman Filter & Speech Enhancement for Speaker Identification
ย 
Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.Real-time DirectTranslation System for Sinhala and Tamil Languages.
Real-time DirectTranslation System for Sinhala and Tamil Languages.
ย 
Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
Sparse Approximation of Gram Matrices for GMMN-based Speech SynthesisSparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
Sparse Approximation of Gram Matrices for GMMN-based Speech Synthesis
ย 
Development of voice password based speaker verification system
Development of voice password based speaker verification systemDevelopment of voice password based speaker verification system
Development of voice password based speaker verification system
ย 
Development of voice password based speaker verification system
Development of voice password based speaker verification systemDevelopment of voice password based speaker verification system
Development of voice password based speaker verification system
ย 
Speaker Identification based on GFCC using GMM-UBM
Speaker Identification based on GFCC using GMM-UBMSpeaker Identification based on GFCC using GMM-UBM
Speaker Identification based on GFCC using GMM-UBM
ย 

More from Shinnosuke Takamichi

้Ÿณๅฃฐๅˆๆˆใฎใ‚ณใƒผใƒ‘ใ‚นใ‚’ใคใใ‚ใ†
้Ÿณๅฃฐๅˆๆˆใฎใ‚ณใƒผใƒ‘ใ‚นใ‚’ใคใใ‚ใ†้Ÿณๅฃฐๅˆๆˆใฎใ‚ณใƒผใƒ‘ใ‚นใ‚’ใคใใ‚ใ†
้Ÿณๅฃฐๅˆๆˆใฎใ‚ณใƒผใƒ‘ใ‚นใ‚’ใคใใ‚ใ†
Shinnosuke Takamichi
ย 
J-KAC๏ผšๆ—ฅๆœฌ่ชžใ‚ชใƒผใƒ‡ใ‚ฃใ‚ชใƒ–ใƒƒใ‚ฏใƒป็ด™่Šๅฑ…ๆœ—่ชญ้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
J-KAC๏ผšๆ—ฅๆœฌ่ชžใ‚ชใƒผใƒ‡ใ‚ฃใ‚ชใƒ–ใƒƒใ‚ฏใƒป็ด™่Šๅฑ…ๆœ—่ชญ้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚นJ-KAC๏ผšๆ—ฅๆœฌ่ชžใ‚ชใƒผใƒ‡ใ‚ฃใ‚ชใƒ–ใƒƒใ‚ฏใƒป็ด™่Šๅฑ…ๆœ—่ชญ้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
J-KAC๏ผšๆ—ฅๆœฌ่ชžใ‚ชใƒผใƒ‡ใ‚ฃใ‚ชใƒ–ใƒƒใ‚ฏใƒป็ด™่Šๅฑ…ๆœ—่ชญ้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
Shinnosuke Takamichi
ย 
Interspeech 2020 ่ชญใฟไผš "Incremental Text to Speech for Neural Sequence-to-Sequ...
Interspeech 2020 ่ชญใฟไผš "Incremental Text to Speech for Neural  Sequence-to-Sequ...Interspeech 2020 ่ชญใฟไผš "Incremental Text to Speech for Neural  Sequence-to-Sequ...
Interspeech 2020 ่ชญใฟไผš "Incremental Text to Speech for Neural Sequence-to-Sequ...
Shinnosuke Takamichi
ย 
่ซ–ๆ–‡็ดนไป‹ SANTLR: Speech Annotation Toolkit for Low Resource Languages
่ซ–ๆ–‡็ดนไป‹ SANTLR: Speech Annotation Toolkit for Low Resource Languages่ซ–ๆ–‡็ดนไป‹ SANTLR: Speech Annotation Toolkit for Low Resource Languages
่ซ–ๆ–‡็ดนไป‹ SANTLR: Speech Annotation Toolkit for Low Resource Languages
Shinnosuke Takamichi
ย 
ๅทฎๅˆ†ใ‚นใƒšใ‚ฏใƒˆใƒซๆณ•ใซๅŸบใฅใ DNN ๅฃฐ่ณชๅค‰ๆ›ใฎ่จˆ็ฎ—้‡ๅ‰Šๆธ›ใซๅ‘ใ‘ใŸใƒ•ใ‚ฃใƒซใ‚ฟๆŽจๅฎš
ๅทฎๅˆ†ใ‚นใƒšใ‚ฏใƒˆใƒซๆณ•ใซๅŸบใฅใ DNN ๅฃฐ่ณชๅค‰ๆ›ใฎ่จˆ็ฎ—้‡ๅ‰Šๆธ›ใซๅ‘ใ‘ใŸใƒ•ใ‚ฃใƒซใ‚ฟๆŽจๅฎšๅทฎๅˆ†ใ‚นใƒšใ‚ฏใƒˆใƒซๆณ•ใซๅŸบใฅใ DNN ๅฃฐ่ณชๅค‰ๆ›ใฎ่จˆ็ฎ—้‡ๅ‰Šๆธ›ใซๅ‘ใ‘ใŸใƒ•ใ‚ฃใƒซใ‚ฟๆŽจๅฎš
ๅทฎๅˆ†ใ‚นใƒšใ‚ฏใƒˆใƒซๆณ•ใซๅŸบใฅใ DNN ๅฃฐ่ณชๅค‰ๆ›ใฎ่จˆ็ฎ—้‡ๅ‰Šๆธ›ใซๅ‘ใ‘ใŸใƒ•ใ‚ฃใƒซใ‚ฟๆŽจๅฎš
Shinnosuke Takamichi
ย 
้Ÿณๅฃฐๅˆๆˆใƒปๅค‰ๆ›ใฎๅ›ฝ้š›ใ‚ณใƒณใƒšใƒ†ใ‚ฃใ‚ทใƒงใƒณใธใฎ ๅ‚ๅŠ ใ‚’ๆŒฏใ‚Š่ฟ”ใฃใฆ
้Ÿณๅฃฐๅˆๆˆใƒปๅค‰ๆ›ใฎๅ›ฝ้š›ใ‚ณใƒณใƒšใƒ†ใ‚ฃใ‚ทใƒงใƒณใธใฎ  ๅ‚ๅŠ ใ‚’ๆŒฏใ‚Š่ฟ”ใฃใฆ้Ÿณๅฃฐๅˆๆˆใƒปๅค‰ๆ›ใฎๅ›ฝ้š›ใ‚ณใƒณใƒšใƒ†ใ‚ฃใ‚ทใƒงใƒณใธใฎ  ๅ‚ๅŠ ใ‚’ๆŒฏใ‚Š่ฟ”ใฃใฆ
้Ÿณๅฃฐๅˆๆˆใƒปๅค‰ๆ›ใฎๅ›ฝ้š›ใ‚ณใƒณใƒšใƒ†ใ‚ฃใ‚ทใƒงใƒณใธใฎ ๅ‚ๅŠ ใ‚’ๆŒฏใ‚Š่ฟ”ใฃใฆ
Shinnosuke Takamichi
ย 

More from Shinnosuke Takamichi (20)

JTubeSpeech: ้Ÿณๅฃฐ่ช่ญ˜ใจ่ฉฑ่€…็…งๅˆใฎใŸใ‚ใซ YouTube ใ‹ใ‚‰ๆง‹็ฏ‰ใ•ใ‚Œใ‚‹ๆ—ฅๆœฌ่ชž้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
JTubeSpeech:  ้Ÿณๅฃฐ่ช่ญ˜ใจ่ฉฑ่€…็…งๅˆใฎใŸใ‚ใซ YouTube ใ‹ใ‚‰ๆง‹็ฏ‰ใ•ใ‚Œใ‚‹ๆ—ฅๆœฌ่ชž้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚นJTubeSpeech:  ้Ÿณๅฃฐ่ช่ญ˜ใจ่ฉฑ่€…็…งๅˆใฎใŸใ‚ใซ YouTube ใ‹ใ‚‰ๆง‹็ฏ‰ใ•ใ‚Œใ‚‹ๆ—ฅๆœฌ่ชž้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
JTubeSpeech: ้Ÿณๅฃฐ่ช่ญ˜ใจ่ฉฑ่€…็…งๅˆใฎใŸใ‚ใซ YouTube ใ‹ใ‚‰ๆง‹็ฏ‰ใ•ใ‚Œใ‚‹ๆ—ฅๆœฌ่ชž้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
ย 
้Ÿณๅฃฐๅˆๆˆใฎใ‚ณใƒผใƒ‘ใ‚นใ‚’ใคใใ‚ใ†
้Ÿณๅฃฐๅˆๆˆใฎใ‚ณใƒผใƒ‘ใ‚นใ‚’ใคใใ‚ใ†้Ÿณๅฃฐๅˆๆˆใฎใ‚ณใƒผใƒ‘ใ‚นใ‚’ใคใใ‚ใ†
้Ÿณๅฃฐๅˆๆˆใฎใ‚ณใƒผใƒ‘ใ‚นใ‚’ใคใใ‚ใ†
ย 
J-KAC๏ผšๆ—ฅๆœฌ่ชžใ‚ชใƒผใƒ‡ใ‚ฃใ‚ชใƒ–ใƒƒใ‚ฏใƒป็ด™่Šๅฑ…ๆœ—่ชญ้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
J-KAC๏ผšๆ—ฅๆœฌ่ชžใ‚ชใƒผใƒ‡ใ‚ฃใ‚ชใƒ–ใƒƒใ‚ฏใƒป็ด™่Šๅฑ…ๆœ—่ชญ้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚นJ-KAC๏ผšๆ—ฅๆœฌ่ชžใ‚ชใƒผใƒ‡ใ‚ฃใ‚ชใƒ–ใƒƒใ‚ฏใƒป็ด™่Šๅฑ…ๆœ—่ชญ้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
J-KAC๏ผšๆ—ฅๆœฌ่ชžใ‚ชใƒผใƒ‡ใ‚ฃใ‚ชใƒ–ใƒƒใ‚ฏใƒป็ด™่Šๅฑ…ๆœ—่ชญ้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
ย 
็Ÿญๆ™‚้–“็™บ่ฉฑใ‚’็”จใ„ใŸ่ฉฑ่€…็…งๅˆใฎใŸใ‚ใฎ้ŸณๅฃฐๅŠ ๅทฅใฎๅŠนๆžœใซ้–ขใ™ใ‚‹ๆคœ่จŽ
็Ÿญๆ™‚้–“็™บ่ฉฑใ‚’็”จใ„ใŸ่ฉฑ่€…็…งๅˆใฎใŸใ‚ใฎ้ŸณๅฃฐๅŠ ๅทฅใฎๅŠนๆžœใซ้–ขใ™ใ‚‹ๆคœ่จŽ็Ÿญๆ™‚้–“็™บ่ฉฑใ‚’็”จใ„ใŸ่ฉฑ่€…็…งๅˆใฎใŸใ‚ใฎ้ŸณๅฃฐๅŠ ๅทฅใฎๅŠนๆžœใซ้–ขใ™ใ‚‹ๆคœ่จŽ
็Ÿญๆ™‚้–“็™บ่ฉฑใ‚’็”จใ„ใŸ่ฉฑ่€…็…งๅˆใฎใŸใ‚ใฎ้ŸณๅฃฐๅŠ ๅทฅใฎๅŠนๆžœใซ้–ขใ™ใ‚‹ๆคœ่จŽ
ย 
ใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ DNN้Ÿณๅฃฐๅค‰ๆ›ใƒ•ใ‚ฃใƒผใƒ‰ใƒใƒƒใ‚ฏใซใ‚ˆใ‚‹ใ‚ญใƒฃใƒฉใ‚ฏใ‚ฟๆ€งใฎ็ฒๅพ—ๆ‰‹ๆณ•
ใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ DNN้Ÿณๅฃฐๅค‰ๆ›ใƒ•ใ‚ฃใƒผใƒ‰ใƒใƒƒใ‚ฏใซใ‚ˆใ‚‹ใ‚ญใƒฃใƒฉใ‚ฏใ‚ฟๆ€งใฎ็ฒๅพ—ๆ‰‹ๆณ•ใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ DNN้Ÿณๅฃฐๅค‰ๆ›ใƒ•ใ‚ฃใƒผใƒ‰ใƒใƒƒใ‚ฏใซใ‚ˆใ‚‹ใ‚ญใƒฃใƒฉใ‚ฏใ‚ฟๆ€งใฎ็ฒๅพ—ๆ‰‹ๆณ•
ใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ DNN้Ÿณๅฃฐๅค‰ๆ›ใƒ•ใ‚ฃใƒผใƒ‰ใƒใƒƒใ‚ฏใซใ‚ˆใ‚‹ใ‚ญใƒฃใƒฉใ‚ฏใ‚ฟๆ€งใฎ็ฒๅพ—ๆ‰‹ๆณ•
ย 
ใ“ใ“ใพใงๆฅใŸ๏ผ†ใ“ใ‚Œใ‹ใ‚‰ๆฅใ‚‹้Ÿณๅฃฐๅˆๆˆ (ๆ˜Žๆฒปๅคงๅญฆ ๅ…ˆ็ซฏใƒกใƒ‡ใ‚ฃใ‚ขใ‚ณใƒญใ‚ญใ‚ฆใƒ )
ใ“ใ“ใพใงๆฅใŸ๏ผ†ใ“ใ‚Œใ‹ใ‚‰ๆฅใ‚‹้Ÿณๅฃฐๅˆๆˆ (ๆ˜Žๆฒปๅคงๅญฆ ๅ…ˆ็ซฏใƒกใƒ‡ใ‚ฃใ‚ขใ‚ณใƒญใ‚ญใ‚ฆใƒ )ใ“ใ“ใพใงๆฅใŸ๏ผ†ใ“ใ‚Œใ‹ใ‚‰ๆฅใ‚‹้Ÿณๅฃฐๅˆๆˆ (ๆ˜Žๆฒปๅคงๅญฆ ๅ…ˆ็ซฏใƒกใƒ‡ใ‚ฃใ‚ขใ‚ณใƒญใ‚ญใ‚ฆใƒ )
ใ“ใ“ใพใงๆฅใŸ๏ผ†ใ“ใ‚Œใ‹ใ‚‰ๆฅใ‚‹้Ÿณๅฃฐๅˆๆˆ (ๆ˜Žๆฒปๅคงๅญฆ ๅ…ˆ็ซฏใƒกใƒ‡ใ‚ฃใ‚ขใ‚ณใƒญใ‚ญใ‚ฆใƒ )
ย 
ๅ›ฝ้š›ไผš่ญฐ interspeech 2020 ๅ ฑๅ‘Š
ๅ›ฝ้š›ไผš่ญฐ interspeech 2020 ๅ ฑๅ‘Šๅ›ฝ้š›ไผš่ญฐ interspeech 2020 ๅ ฑๅ‘Š
ๅ›ฝ้š›ไผš่ญฐ interspeech 2020 ๅ ฑๅ‘Š
ย 
Interspeech 2020 ่ชญใฟไผš "Incremental Text to Speech for Neural Sequence-to-Sequ...
Interspeech 2020 ่ชญใฟไผš "Incremental Text to Speech for Neural  Sequence-to-Sequ...Interspeech 2020 ่ชญใฟไผš "Incremental Text to Speech for Neural  Sequence-to-Sequ...
Interspeech 2020 ่ชญใฟไผš "Incremental Text to Speech for Neural Sequence-to-Sequ...
ย 
ใ‚ตใƒ–ใƒใƒณใƒ‰ใƒ•ใ‚ฃใƒซใ‚ฟใƒชใƒณใ‚ฐใซๅŸบใฅใใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ ๅบƒๅธฏๅŸŸDNNๅฃฐ่ณชๅค‰ๆ›ใฎๅฎŸ่ฃ…ใจ่ฉ•ไพก
ใ‚ตใƒ–ใƒใƒณใƒ‰ใƒ•ใ‚ฃใƒซใ‚ฟใƒชใƒณใ‚ฐใซๅŸบใฅใใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ ๅบƒๅธฏๅŸŸDNNๅฃฐ่ณชๅค‰ๆ›ใฎๅฎŸ่ฃ…ใจ่ฉ•ไพกใ‚ตใƒ–ใƒใƒณใƒ‰ใƒ•ใ‚ฃใƒซใ‚ฟใƒชใƒณใ‚ฐใซๅŸบใฅใใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ ๅบƒๅธฏๅŸŸDNNๅฃฐ่ณชๅค‰ๆ›ใฎๅฎŸ่ฃ…ใจ่ฉ•ไพก
ใ‚ตใƒ–ใƒใƒณใƒ‰ใƒ•ใ‚ฃใƒซใ‚ฟใƒชใƒณใ‚ฐใซๅŸบใฅใใƒชใ‚ขใƒซใ‚ฟใ‚คใƒ ๅบƒๅธฏๅŸŸDNNๅฃฐ่ณชๅค‰ๆ›ใฎๅฎŸ่ฃ…ใจ่ฉ•ไพก
ย 
P J S๏ผš ้Ÿณ็ด ใƒใƒฉใƒณใ‚นใ‚’่€ƒๆ…ฎใ—ใŸๆ—ฅๆœฌ่ชžๆญŒๅฃฐใ‚ณใƒผใƒ‘ใ‚น
P J S๏ผš ้Ÿณ็ด ใƒใƒฉใƒณใ‚นใ‚’่€ƒๆ…ฎใ—ใŸๆ—ฅๆœฌ่ชžๆญŒๅฃฐใ‚ณใƒผใƒ‘ใ‚นP J S๏ผš ้Ÿณ็ด ใƒใƒฉใƒณใ‚นใ‚’่€ƒๆ…ฎใ—ใŸๆ—ฅๆœฌ่ชžๆญŒๅฃฐใ‚ณใƒผใƒ‘ใ‚น
P J S๏ผš ้Ÿณ็ด ใƒใƒฉใƒณใ‚นใ‚’่€ƒๆ…ฎใ—ใŸๆ—ฅๆœฌ่ชžๆญŒๅฃฐใ‚ณใƒผใƒ‘ใ‚น
ย 
้Ÿณ้Ÿฟใƒขใƒ‡ใƒซๅฐคๅบฆใซๅŸบใฅใsubwordๅˆ†ๅ‰ฒใฎ้Ÿปๅพ‹ๆŽจๅฎš็ฒพๅบฆใซใŠใ‘ใ‚‹่ฉ•ไพก
้Ÿณ้Ÿฟใƒขใƒ‡ใƒซๅฐคๅบฆใซๅŸบใฅใsubwordๅˆ†ๅ‰ฒใฎ้Ÿปๅพ‹ๆŽจๅฎš็ฒพๅบฆใซใŠใ‘ใ‚‹่ฉ•ไพก้Ÿณ้Ÿฟใƒขใƒ‡ใƒซๅฐคๅบฆใซๅŸบใฅใsubwordๅˆ†ๅ‰ฒใฎ้Ÿปๅพ‹ๆŽจๅฎš็ฒพๅบฆใซใŠใ‘ใ‚‹่ฉ•ไพก
้Ÿณ้Ÿฟใƒขใƒ‡ใƒซๅฐคๅบฆใซๅŸบใฅใsubwordๅˆ†ๅ‰ฒใฎ้Ÿปๅพ‹ๆŽจๅฎš็ฒพๅบฆใซใŠใ‘ใ‚‹่ฉ•ไพก
ย 
้Ÿณๅฃฐๅˆๆˆ็ ”็ฉถใ‚’ๅŠ ้€Ÿใ•ใ›ใ‚‹ใŸใ‚ใฎใ‚ณใƒผใƒ‘ใ‚นใƒ‡ใ‚ถใ‚คใƒณ
้Ÿณๅฃฐๅˆๆˆ็ ”็ฉถใ‚’ๅŠ ้€Ÿใ•ใ›ใ‚‹ใŸใ‚ใฎใ‚ณใƒผใƒ‘ใ‚นใƒ‡ใ‚ถใ‚คใƒณ้Ÿณๅฃฐๅˆๆˆ็ ”็ฉถใ‚’ๅŠ ้€Ÿใ•ใ›ใ‚‹ใŸใ‚ใฎใ‚ณใƒผใƒ‘ใ‚นใƒ‡ใ‚ถใ‚คใƒณ
้Ÿณๅฃฐๅˆๆˆ็ ”็ฉถใ‚’ๅŠ ้€Ÿใ•ใ›ใ‚‹ใŸใ‚ใฎใ‚ณใƒผใƒ‘ใ‚นใƒ‡ใ‚ถใ‚คใƒณ
ย 
่ซ–ๆ–‡็ดนไป‹ Unsupervised training of neural mask-based beamforming
่ซ–ๆ–‡็ดนไป‹ Unsupervised training of neural  mask-based beamforming่ซ–ๆ–‡็ดนไป‹ Unsupervised training of neural  mask-based beamforming
่ซ–ๆ–‡็ดนไป‹ Unsupervised training of neural mask-based beamforming
ย 
่ซ–ๆ–‡็ดนไป‹ Building the Singapore English National Speech Corpus
่ซ–ๆ–‡็ดนไป‹ Building the Singapore English National Speech Corpus่ซ–ๆ–‡็ดนไป‹ Building the Singapore English National Speech Corpus
่ซ–ๆ–‡็ดนไป‹ Building the Singapore English National Speech Corpus
ย 
่ซ–ๆ–‡็ดนไป‹ SANTLR: Speech Annotation Toolkit for Low Resource Languages
่ซ–ๆ–‡็ดนไป‹ SANTLR: Speech Annotation Toolkit for Low Resource Languages่ซ–ๆ–‡็ดนไป‹ SANTLR: Speech Annotation Toolkit for Low Resource Languages
่ซ–ๆ–‡็ดนไป‹ SANTLR: Speech Annotation Toolkit for Low Resource Languages
ย 
่ฉฑ่€…V2Sๆ”ปๆ’ƒ๏ผš ่ฉฑ่€…่ช่จผใ‹ใ‚‰ๆง‹็ฏ‰ใ•ใ‚Œใ‚‹ ๅฃฐ่ณชๅค‰ๆ›ใจใใฎ้Ÿณๅฃฐใชใ‚Šใ™ใพใ—ๅฏ่ƒฝๆ€งใฎ่ฉ•ไพก
่ฉฑ่€…V2Sๆ”ปๆ’ƒ๏ผš ่ฉฑ่€…่ช่จผใ‹ใ‚‰ๆง‹็ฏ‰ใ•ใ‚Œใ‚‹ ๅฃฐ่ณชๅค‰ๆ›ใจใใฎ้Ÿณๅฃฐใชใ‚Šใ™ใพใ—ๅฏ่ƒฝๆ€งใฎ่ฉ•ไพก่ฉฑ่€…V2Sๆ”ปๆ’ƒ๏ผš ่ฉฑ่€…่ช่จผใ‹ใ‚‰ๆง‹็ฏ‰ใ•ใ‚Œใ‚‹ ๅฃฐ่ณชๅค‰ๆ›ใจใใฎ้Ÿณๅฃฐใชใ‚Šใ™ใพใ—ๅฏ่ƒฝๆ€งใฎ่ฉ•ไพก
่ฉฑ่€…V2Sๆ”ปๆ’ƒ๏ผš ่ฉฑ่€…่ช่จผใ‹ใ‚‰ๆง‹็ฏ‰ใ•ใ‚Œใ‚‹ ๅฃฐ่ณชๅค‰ๆ›ใจใใฎ้Ÿณๅฃฐใชใ‚Šใ™ใพใ—ๅฏ่ƒฝๆ€งใฎ่ฉ•ไพก
ย 
JVS๏ผšใƒ•ใƒชใƒผใฎๆ—ฅๆœฌ่ชžๅคšๆ•ฐ่ฉฑ่€…้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
JVS๏ผšใƒ•ใƒชใƒผใฎๆ—ฅๆœฌ่ชžๅคšๆ•ฐ่ฉฑ่€…้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น JVS๏ผšใƒ•ใƒชใƒผใฎๆ—ฅๆœฌ่ชžๅคšๆ•ฐ่ฉฑ่€…้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
JVS๏ผšใƒ•ใƒชใƒผใฎๆ—ฅๆœฌ่ชžๅคšๆ•ฐ่ฉฑ่€…้Ÿณๅฃฐใ‚ณใƒผใƒ‘ใ‚น
ย 
ๅทฎๅˆ†ใ‚นใƒšใ‚ฏใƒˆใƒซๆณ•ใซๅŸบใฅใ DNN ๅฃฐ่ณชๅค‰ๆ›ใฎ่จˆ็ฎ—้‡ๅ‰Šๆธ›ใซๅ‘ใ‘ใŸใƒ•ใ‚ฃใƒซใ‚ฟๆŽจๅฎš
ๅทฎๅˆ†ใ‚นใƒšใ‚ฏใƒˆใƒซๆณ•ใซๅŸบใฅใ DNN ๅฃฐ่ณชๅค‰ๆ›ใฎ่จˆ็ฎ—้‡ๅ‰Šๆธ›ใซๅ‘ใ‘ใŸใƒ•ใ‚ฃใƒซใ‚ฟๆŽจๅฎšๅทฎๅˆ†ใ‚นใƒšใ‚ฏใƒˆใƒซๆณ•ใซๅŸบใฅใ DNN ๅฃฐ่ณชๅค‰ๆ›ใฎ่จˆ็ฎ—้‡ๅ‰Šๆธ›ใซๅ‘ใ‘ใŸใƒ•ใ‚ฃใƒซใ‚ฟๆŽจๅฎš
ๅทฎๅˆ†ใ‚นใƒšใ‚ฏใƒˆใƒซๆณ•ใซๅŸบใฅใ DNN ๅฃฐ่ณชๅค‰ๆ›ใฎ่จˆ็ฎ—้‡ๅ‰Šๆธ›ใซๅ‘ใ‘ใŸใƒ•ใ‚ฃใƒซใ‚ฟๆŽจๅฎš
ย 
้Ÿณๅฃฐๅˆๆˆใƒปๅค‰ๆ›ใฎๅ›ฝ้š›ใ‚ณใƒณใƒšใƒ†ใ‚ฃใ‚ทใƒงใƒณใธใฎ ๅ‚ๅŠ ใ‚’ๆŒฏใ‚Š่ฟ”ใฃใฆ
้Ÿณๅฃฐๅˆๆˆใƒปๅค‰ๆ›ใฎๅ›ฝ้š›ใ‚ณใƒณใƒšใƒ†ใ‚ฃใ‚ทใƒงใƒณใธใฎ  ๅ‚ๅŠ ใ‚’ๆŒฏใ‚Š่ฟ”ใฃใฆ้Ÿณๅฃฐๅˆๆˆใƒปๅค‰ๆ›ใฎๅ›ฝ้š›ใ‚ณใƒณใƒšใƒ†ใ‚ฃใ‚ทใƒงใƒณใธใฎ  ๅ‚ๅŠ ใ‚’ๆŒฏใ‚Š่ฟ”ใฃใฆ
้Ÿณๅฃฐๅˆๆˆใƒปๅค‰ๆ›ใฎๅ›ฝ้š›ใ‚ณใƒณใƒšใƒ†ใ‚ฃใ‚ทใƒงใƒณใธใฎ ๅ‚ๅŠ ใ‚’ๆŒฏใ‚Š่ฟ”ใฃใฆ
ย 
ใƒฆใƒผใ‚ถๆญŒๅ”ฑใฎใŸใ‚ใฎ generative moment matching network ใซๅŸบใฅใ neural double-tracking
ใƒฆใƒผใ‚ถๆญŒๅ”ฑใฎใŸใ‚ใฎ generative moment matching network ใซๅŸบใฅใ neural double-trackingใƒฆใƒผใ‚ถๆญŒๅ”ฑใฎใŸใ‚ใฎ generative moment matching network ใซๅŸบใฅใ neural double-tracking
ใƒฆใƒผใ‚ถๆญŒๅ”ฑใฎใŸใ‚ใฎ generative moment matching network ใซๅŸบใฅใ neural double-tracking
ย 

Recently uploaded

Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
Areesha Ahmad
ย 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sรฉrgio Sacani
ย 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
ย 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
ย 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
ย 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
ย 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
MohamedFarag457087
ย 

Recently uploaded (20)

Bacterial Identification and Classifications
Bacterial Identification and ClassificationsBacterial Identification and Classifications
Bacterial Identification and Classifications
ย 
Introduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptxIntroduction of DNA analysis in Forensic's .pptx
Introduction of DNA analysis in Forensic's .pptx
ย 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
ย 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
ย 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
ย 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
ย 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
ย 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
ย 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
ย 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
ย 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
ย 
Velocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.pptVelocity and Acceleration PowerPoint.ppt
Velocity and Acceleration PowerPoint.ppt
ย 
300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx300003-World Science Day For Peace And Development.pptx
300003-World Science Day For Peace And Development.pptx
ย 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
ย 
Use of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptxUse of mutants in understanding seedling development.pptx
Use of mutants in understanding seedling development.pptx
ย 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
ย 
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate ProfessorThyroid Physiology_Dr.E. Muralinath_ Associate Professor
Thyroid Physiology_Dr.E. Muralinath_ Associate Professor
ย 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
ย 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
ย 
Digital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptxDigital Dentistry.Digital Dentistryvv.pptx
Digital Dentistry.Digital Dentistryvv.pptx
ย 

Ph.D defence (Shinnosuke Takamichi)

  • 1. 2015ยฉShinnosuke TAKAMICHI 12/22/2015 Acoustic modeling and speech parameter generation for high-quality statistical parametric speech synthesis ๏ผˆ้ซ˜้Ÿณ่ณชใช็ตฑ่จˆ็š„ใƒ‘ใƒฉใƒกใƒˆใƒชใƒƒใ‚ฏ้ŸณๅฃฐๅˆๆˆใฎใŸใ‚ใฎ ้Ÿณ้Ÿฟใƒขใƒ‡ใƒชใƒณใ‚ฐๆณ•ใจ้Ÿณๅฃฐใƒ‘ใƒฉใƒกใƒผใ‚ฟ็”Ÿๆˆๆณ•๏ผ‰ Nara Institute of Science and Technology Shinnosuke Takamichi Ph.D defense
  • 3. /46 Speech synthesis and its benefits ๏ƒ˜ Speech synthesis: a method to synthesize speech by a computer โ€“ Text-To-Speech (TTS) [Sagisaka et al., 1988.] โ€“ Voice Conversion (VC) [Stylianou et al., 1988.] ๏ƒ˜ What is required? โ€“ Flexible control of voice beyond ability of one human โ€“ High-quality speech generation like human 3 Text TTS VC
  • 4. /46 Statistical parametric speech synthesis ๏ƒ˜ Statistical parametric speech synthesis [Zen et al., 2009.] โ€“ Statistical modeling of relationship between input/output โ€“ Better flexibility than unit selection synthesis [Iwahashi et al., 1993.] ๏ƒ˜ HMM-based TTS & GMM-based VC* [Tokuda et al., 2013.] [Toda et al., 2007.] โ€“ Mathematical support of the flexibility โ€“ Application from other research areas โ€“ Butโ€ฆ 4*HMM: Hidden Markov Model, GMM: Gaussian Mixture Model
  • 5. /46 Natural speech vs. synthetic speech in speech quality 5 Natural speech spoken by human Synthetic speech of HMM-based TTS & GMM-based VC Why?
  • 6. /46 Problem definition and rest of this talk 6 Text Parameteri- zation error Insufficient modeling Over- smoothing Parameteri- zation error Text analysis Speech analysis Speech parameter generation Waveform synthesis Acoustic Modeling Approaches in this thesis Modeling of individual speech segment Chapter 3 Modulation spectrum for over-smoothing Chapter 4Chapter 5 Chapter 2
  • 7. /46 Speech synthesis Analysis Generation SynthesisModeling Modeling of individual speech segment Modulation spectrum for over-smoothing Chapter 4Chapter 5 Text Chapter 2 Chapter 3
  • 8. /46 2 approaches to speech synthesis ๏ƒ˜ Unit selection synthesis [Iwahashi et al., 1993.] โ€“ High quality but low flexibility 8 Pre-recorded speech database Synthetic speech Segment selection Text Text analysis Speech analysis Param. Gen. Wave. synthesis Acoustic modeling ๏ƒ˜ Statistical parametric speech synthesis [Zen et al., 2009.] โ€“ High flexibility but low quality
  • 9. /46 Text/speech analysis and waveform synthesis ๏ƒ˜ Text analysis (e.g., [Sagisaka et al., 1990.]) ๏ƒ˜ Speech analysis (e.g., [Kawahara et al., 1999.]) 9ana gen synmodel j i ใ‚ ใ‚‰ ใ‚† ใ‚‹ ็พ ๅฎŸ ใ‚’ใƒปใƒปใƒปSentence Accent phrase a tsr a y u r u g e n u oPhoneme Low High Power Frequency Fourier transform & pow. Envelope = spectral parameters Periodicity in detail = Pitch (F0)
  • 10. /46 Acoustic modeling in HMM-based TTS 10 ๐€ = argmax ๐‘ƒ ๐’€|๐‘ฟ, ๐€ ๏ƒ˜ ML training of HMM parameter sets ๐€ ana gen synmodel โ€œHelloโ€ Speech analysis Text analysis Context labels ๐’€Speech features Time sil-h+e h-e+l e-l+o HMM ๐€ Context-tied Gaussian distribution ๐‘ โ‹…; ๐, ๐šบ e-l+o a-l+o o-l+o ๐‘ฟ [Zen et al., 2007.]
  • 11. /46 Acoustic modeling in GMM-based VC ๏ƒ˜ ML training of GMM parameter sets ๐€ 11 ๐€ = argmax ๐‘ƒ ๐’€ ๐‘ก, ๐‘ฟ ๐‘ก|๐€ ๐‘ฟSpeech features ๐’€Speech features Speech analysis Speech analysis ana gen synmodel GMM ๐€ ๐‘ฟ ๐‘ก ๐’€ ๐‘ก ๐‘ โ‹…; ๐, ๐šบ ๐‘ฟ ๐‘ก ๐’€ ๐‘ก Joint vector at time ๐‘ก [Stylianou et al., 1988.]
  • 12. /46 Probability to generate features in HMM-based TTS 12 Text analysis HMM parameter sets ๐€ โ€œHelloโ€ ๐‘ฟ ana gen synmodel ๏ƒ˜ Probability to generate the synthetic speech features ๐’€ ๐‘ƒ ๐’€|๐‘ฟ, ๐’’, ๐€ = ๐‘ ๐’€; ๐‘ฌ ๐’’, ๐‘ซ ๐’’ โ€œhโ€ โ€œoโ€ ๐1 ๐2 ๐ ๐‘‡ ๐ ๐‘ก ๐‘ฌ ๐’’ ๐œฎ1 โˆ’1 ๐œฎ2 โˆ’1 ๐œฎ ๐‘‡ โˆ’1 ๐œฎ ๐‘ก โˆ’1 ๐‘ซ ๐’’ โˆ’1 Mean vector Covariance matrix ๐’’ [Tokuda et al., 2000.]
  • 13. /46 Probability to generate features in GMM-based VC 13 ๏ƒ˜ Probability to generate the synthetic speech features ๐’€ ๐‘ƒ ๐’€|๐‘ฟ, ๐’’, ๐€ = ๐‘ ๐’€; ๐‘ฌ ๐’’, ๐‘ซ ๐’’ Speech analysis GMM parameter sets ๐€ ๐‘ฟ ๐1 ๐2 ๐ ๐‘‡ ๐ ๐‘ก ๐‘ฌ ๐’’ ๐œฎ1 โˆ’1 ๐œฎ2 โˆ’1 ๐œฎ ๐‘‡ โˆ’1 ๐œฎ ๐‘ก โˆ’1 ๐‘ซ ๐’’ โˆ’1 Mean vector Covariance matrix ๐’’ [Toda et al., 2007.] ana gen synmodel
  • 14. /46 Speech parameter generation ๏ƒ˜ ML generation of synthetic speech parameters ๐’š ๐’’ โ€“ Computationally-efficient generation (solved in a closed form) 14 ๐’š ๐’’ = argmax ๐‘ƒ ๐’€|๐‘ฟ, ๐’’, ๐€ = argmax ๐‘ƒ ๐’š, ฮ”๐’š|๐‘ฟ, ๐’’, ๐€ Time Static๐’š Temporal deltaฮ”๐’š ๐’š ๐’’ ฮ”๐’š ๐’’ Mean and variance [Tokuda et al., 2000.] ana gen synmodel
  • 15. /46 Statistical sample-based speech synthesis Analysis Generation SynthesisModeling Modeling of individual speech segment Modulation spectrum for over-smoothing Chapter 3 Chapter 4Chapter 5 Text Chapter 2
  • 16. /46 Quality degradation by acoustic modeling 16 ๏ƒ˜ Averaging across input features Context-tied Gaussian in HMM-based TTS โ†’ Robust to the unseen context โ†’ Loses info. of individual speech parameters. ๐‘ โ‹…; ๐, ๐šบ e-l+o a-l+o o-l+o ๏ƒ˜ Proposed approach โ€“ Models individual speech parameters while keeping robustness. โ€“ Select one model in parameter generation. โ€“ โ†’ Able to alleviate the quality degradation caused by averaging
  • 17. /46 Acoustic modeling of the proposed method 17 ๏ƒ˜ From the tied model to Rich context-GMM (R-GMM) ๐‘ โ‹…; ๐, ๐šบ e-l+o a-l+o o-l+o Rich context models [Yan et al., 2009.] Less-averaged models having robustness R-GMM Model that is formed as the same as the conventional tied model Update the mean while tying the covariance. Gathers them with the same mixture weights.
  • 18. /46 Speech parameter generation from R-GMMs 18 ๏ƒ˜ ML generation of synthetic speech parameters ๐’š ๐’’ โ€“ Iterative generation with the explicit model selection* ๐’š ๐’’ = argmax ๐‘ƒ ๐’š, ฮ”๐’š|๐’Ž, ๐‘ฟ ๐‘ƒ ๐’Ž|๐’š, ฮ”๐’š, ๐‘ฟ Mean ยฑ variance Time Time Staticfeature๐’š ๐’Ž Tied model R-GMM โˆ— ๐€ (HMM/GMM parameter sets) is omitted.
  • 19. /46 Discussion ๏ƒ˜ Initialization of the parameter generation (Sec. 3.5) โ€“ Uses speech parameters from the over-trained statistics. โ€“ โ†’ Avoids averaging by initialization, and alleviating over-training by parameter generation. ๏ƒ˜ Comparison to unit selection synthesis (Sec. 2.2) โ€“ The model selection corresponds to the waveform segment selection. โ€“ โ†’ Integrates unit selection in the statistical modeling. ๏ƒ˜ Comparison to conventional hybrid methods โ€“ Able to apply voice controlling methods, e.g., [Yamagishi et al., 2007.]. โ€“ โ†’ Better flexibility than [Yan et al., 2009.][Ling et al., 2007.] (Sec. 2.8) 19
  • 20. /46 Subjective evaluation (preference test on speech quality) 20 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 HMM-based TTS Spectrum H H R R T F0 H R H R T H/G: HMM/GMM (= tied model), R: R-GMM, T: Target (``Rโ€™โ€™ using reference) 95% conf. interval GMM-based VC Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 G R T
  • 21. /46 Modulation spectrum-based post-filter Analysis Generation SynthesisModeling Modeling of individual speech segment Modulation spectrum for over-smoothing Chapter 4Chapter 5 Text Chapter 2 Chapter 3
  • 22. /46 Over-smoothing in parameter generation 22 Time Natural speech parameters Time Synthetic speech parameters Speech parameter generation Acoustic modeling
  • 23. /46 Revisits speech parameter generation (Sec. 2.6) ๏ƒ˜ ML generation of synthetic speech parameters ๐’š ๐’’* 23 Time Spectralparameter Natural ๐’š ๐’’ = argmax ๐‘ƒ ๐’š, ฮ”๐’š|๐‘ฟ ๐‘ฟ: input features ๐€ (HMM/GMM parameter sets) is omitted. [Tokuda et al., 2000.] HMM
  • 24. /46 Global Variance (GV) and parameter generation w/ GV ๏ƒ˜ ML generation with GV constraint 24 Time Natural HMM HMM+GV Spectralparameter ๐’—(๐’š) ๐’š ๐’’ = argmax ๐‘ƒ ๐’š, ฮ”๐’š|๐‘ฟ ๐‘ƒ ๐’— ๐’š ๐œ” ๐’— ๐’š : GV (= 2nd moment), ๐œ”: weight of the GV term [Toda et al., 2007.]
  • 25. Something is still different between them... โ†’ What is it?
  • 26. /46 Modulation Spectrum (MS) definition ๏ƒ˜ MS: power spectrum of the sequence โ€“ Represents temporal fluctuation. [Atlas et al., 2003.] โ€“ Segment features in speech recognition [Thomas et al., 2009.] โ€“ Captures speech intelligibility. [Drullman et al., 1994.] 26 2nd moment DFT & pow. GV (scalar) MS (vector) Time Speechparameter DFT: Discrete Fourier Transform
  • 27. /46 HMM natural Modulation frequency Modulationspectrum Speech quality will be improved by filling this gap! Example of the MS HMM+GV 27
  • 28. /46 Post-filtering process 28 Training data Speech param. MS Statistics (Gaussian) HMMs Filtering Training Synthesis ๏ƒ˜ Post-filtering in the MS domain โ€“ Linear conversion (interpolation) using 2 Gaussian distributions
  • 29. /46 Filtered speech parameter sequence 29 Time HMM HMM+GV natural HMM โ†’ post-filter Spectralparameter Generate fluctuating speech parameters by the post-filtering!
  • 30. /46 Discussion 1: What is the MS? 30 Speech parameter GV (temporal power) Freq. 1 Freq. 2 Freq. ๐ทs + + โ€ฆ= MS (frequency power) Sum of MSs = GV Fourier transform Time
  • 31. /46 Discussion 2 ๏ƒ˜ Why post-filter? โ€“ Independent on the original speech synthesis process โ€“ โ†’ High portability and high quality ๏ƒ˜ Further application โ€“ For spectrum, F0 (non-continuous), duration (unactual param.) โ€“ Segment-level filter (faster process) ๏ƒ˜ Advantages compared to the conventional post-filters โ€“ Automatic design/tuning [Eyben et al., 2014.][Yoshimura et al., 1999.] 31
  • 32. /46 Subjective evaluation (preference test on speech quality) 32 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 Spectrum in HMM-based TTS Spectrum in GMM-based VC HMM GMM +GV HMM +GV post-filtering
  • 33. /46 Speech synthesis integrating modulation spectrum Analysis Generation SynthesisModeling Modeling of individual speech segment Modulation spectrum for over-smoothing Chapter 5 Text Chapter 2 Chapter 3 Chapter 4
  • 34. /46 Problems of the MS-based post-filter ๏ƒ˜ MS-based post-filter โ€“ External process for MS emphasis โ€“ โ†’ Causes over-emphasis ignoring speech synthesis criteria. โ€“ โ†’ Difficult to utilize flexibility that HMM/GMMs have 34 ๏ƒ˜ Approaches: Joint optimization using HMM/GMMs and MS โ€“ Integrate MS statistics as the one of the acoustic models. โ€“ Speech parameter generation with MS โ€ฆ high-quality โ€“ Acoustic model training with MS โ€ฆ high-quality and fast
  • 35. /46 Speech parameter generation considering the MS 35 ๏ƒ˜ ML generation with MS constraint ๐’š ๐’’ = argmax ๐‘ƒ ๐’š, ฮ”๐’š|๐‘ฟ ๐‘ƒ ๐’” ๐’š ๐œ” ๐’” ๐’š : MS (= power spectrum), ๐œ”: weight of the MS term ๐‘ฌ ๐’’ ๐‘ซ ๐’’ ๐‘ƒ ๐’š, ฮ”๐’š|๐‘ฟ = ๐‘ ๐’š, ฮ”๐’š ; ๐‘ฌ ๐’’, ๐‘ซ ๐’’ ๐‘ƒ ๐’” ๐’š = ๐‘ ๐’” ๐’š ; ๐s, ๐šบ ๐ฌ Modulation freq.MS Natural Quadratic function of ๐’š
  • 36. /46 Discussion (comparison to MS-based post-filter) ๏ƒ˜ Initialization โ€“ Basic ML generation (``HMMโ€™โ€™) โ†’ MS-based post-filter โ€“ โ†’ Part optimization by initialization, and joint optimization by iteration 36 HMM โ†’ post-filter HMM Time Spectralparameter HMM+MS
  • 37. /46 Effect in the MS 37 HMM HMM+GV natural HMM+MS Modulation frequency Modulationspectrum Fills the gap by the proposed generation algorithm!
  • 38. /46 Effect in the GV 38 Index of speech parameters LogGV HMM HMM+GV natural Recovers the GV w/o considering the GV! HMM+MS
  • 39. /46 Subjective evaluation (preference test on speech quality) 39 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 GMM-based VC HMM +MS HMM +GV GMM +MS GMM +GV HMM-based TTS *+GV Parameter generation w/ GV (Sec. 2.9) *+MS Parameter generation w/ MS
  • 40. /46 Problems of parameter generation and MS-constrained training 40 ๏ƒ˜ Speech parameter generation considering the MS โ€“ Iterative process in synthesis โ€“ โ†’ Computationally-inefficient speech synthesis ๐€ = argmax ๐‘ƒ ๐’š|๐‘ฟ ๐‘ƒ ๐’” ๐’š ๐œ” ๐‘ƒ ๐’š|๐‘ฟ = ๐‘ ๐’š; ๐’š ๐’’, ๐œฎ : Trajectory likelihood (Sec. 2.8) ๐‘ƒ ๐’” ๐’š = ๐‘ ๐’” ๐’š ; ๐’” ๐’š ๐’’ , ๐œฎ ๐ฌ : MS likelihood Minimizes difference between ๐’š and ๐’š ๐’’. Minimizes difference between ๐’” ๐’š and ๐’” ๐’š ๐’’ . ๏ƒ˜ Acoustic model training constrained with MS โ€“ Train HMMs/GMMs ๐€ to generate param. ๐’š ๐’’ having natural MS
  • 41. /46 Trained HMM parameters 41 Basic training (Sec. 2.4-5) Trajectory training (Sec. 2.8) Time Deltafeature Updates HMM/GMM param. to generate fluctuating param.! MS-constrained training
  • 42. /46 Discussion ๏ƒ˜ Computational efficiency in parameter generation โ€“ Basic generation algorithm (Sec. 2.6) can be used without MS. โ€“ โ†’ Not only high-quality but also computationally-efficient ๏ƒ˜ Which is better in quality, proposed param. gen. or training? โ€“ Structures of HMMs/GMMs have limitation for recovering MS. โ€“ โ†’ The parameter generation considering the MS is better. 42 Portability Quality Computation time Post-filter Best! (no depend- ency on models) Better Better (120 ms) Param. gen. Better Best!(optimization in synthesis) Worse (1 min~) Training Worse Better Best! (5 ms)
  • 43. /46 Subjective evaluation (preference test on speech quality) 43 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 Preferencescore 0.0 0.2 0.4 0.6 0.8 1.0 HMM-based TTS GMM-based VC HMMTRJ GV MS- TRJ GMMTRJ GV MS- TRJ HMM/GMM Basic HMM/GMM training (Sec. 2.4-5) TRJ Trajectory HMM training (Sec. 2.8) GV GV-constrained training (Sec. 2.9) MS-TRJ MS-constrained trajectory training
  • 45. /46 Conclusion ๏ƒ˜ Problem in this thesis โ€“ Quality degradation in synthetic speech, which is caused by parameterization error, insufficient modeling, and over-smoothing ๏ƒ˜ Chapter 3: statistical parametric speech synthesis โ€“ Addresses the insufficiency in the acoustic modeling. โ€“ Models the individual speech parameter with rich context models. ๏ƒ˜ Chapter 4 & 5: approaches using Modulation Spectrum (MS) โ€“ Addresses the over-smoothing in the parameter generation. โ€“ 1. MS-based post-filter: high portability โ€“ 2. Parameter generation w/ MS: highest quality โ€“ 3. MS-constrained training: computationally-efficient generation 45
  • 46. /46 Future work ๏ƒ˜ Improvements of rich context modeling โ€“ Quality degradation even if the best models are selected. (Sec. A.5) ๏ƒ˜ Theoretical analysis of MS โ€“ Why is the speech quality improved by the MS? ๏ƒ˜ MS for DNN-based speech synthesis โ€“ More flexible structures to integrate the MS ๏ƒ˜ GPU implementation of the proposed methods โ€“ Rich-context-model selection & param. generation with the MS 46