SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
AVATAR SYMBIOTIC
SOCIETY
Adaptive End-to-End Text-to-Speech Synthesis
Based on Error Correction Feedback from Humans
Kazuki Fujii, Yuki Saito, Hiroshi Saruwatari
The University of Tokyo, Japan
APSIPA ASC 2022 ThPM1-3.5
7 - 10 Nov. 2022・Chiang Mai, Thailand
1 / 17
Research background
2 / 17
• End-to-end (E2E) Text-to-Speech (TTS)[Wang+17]
・High quality but low controllability
・Difficulty in correcting accent errors
(cause of miscommunication, especially in a tonal language e.g., Japanese)
High
Low
E2E TTS
model
a m e
a m e
E2E TTS
model
Text
a
me
a
me
Prosody modeling and control in E2E TTS are a very
important and challenging task.
High
Low
Improving controllability using linguistic features
3 / 17
• Examples of linguistic features
・Full-context labels [Okamoto+19], [Okamoto+20]
・Phonetic and prosodic labels [Kurihara+21]
• High controllability but need for prerequisite expertise
“Adaptability”, the ability to easily correct mistakes,
is not taken into consideration.
Text
analysis
E2E
TTS
Input text → → linguistic feature → → Synthesized speech
Outline of this talk
4 / 17
• Goal: Improve both “controllability” and “adaptability” of E2E TTS
・To make it easy for users to correct accent errors in synthetic speech
• Proposed:
・E2E TTS with a prosody predictor (improve “controllability” )
・Human-in-the-Loop (HITL) framework (improve “adaptability”)
• Results:
・Our HITL framework successfully corrects accent errors.
・Our method successfully achieves the same quality of synthetic speech
as the conventional method.
Overview of proposed TTS model
5 / 17
• Backbone TTS model: FastSpeech2 [Ren+21] with trainable prosody predictor
Backbone TTS model: FastSpeech2
6 / 17
• FastSpeech2 [Ren+21]
・Modules for generating a mel-spectrogram of speech from a phoneme sequence
・Stable learning and inference
Variance Adaptor Duration/pitch/energy
predictor
Proposed method
7 / 17
• Prosody predictor: DNN to estimate pitch changes per syllable (mora)
• Input: Phoneme embedding + Word embedding from BERT [Devlin+19]
• Output: One of three prosodic symbols ("[" / "]" : raising / lowering pitch,
• Loss function: MSE between predicted symbols and ground-truth (text analysis results)
"_" : keeping accent unchanged)
Overview of proposed HITL framework
8 / 17
• Motivation: Improves the adaptability of synthetic speech
• Approach: Involves multiple listeners in the process of correcting accent errors
…
…
(a) (me)
a
me
a
me
HITL accent error correction: Feedback aggregation
9 / 17
• Challenging point: Differences of listeners' error correction abilities
・Simple Way: Choose one from multiple accent annotations
・Proposed: Aggregate in the following ways (actual accent is Low(L) High(H) H H)
↑
Good
Bad
↓
Random
Selector
Bad accent
selected
Annotated accent Estimated ability by MACE
Listener 1 L L L L Low ability
Listener 2 L L L L Low ability
Listener 3 L H H H High ability
Mode L L L L
MACE [Hovy+13] L H H H
Low quarty
speech
Experimental evaluation
10 / 17
• Evaluation targets
・TTS model
・HITL framework
• Objective evaluation criterion
・RMSE of the logF0 between synthetic and natural speech
• Subjective evaluation criterion
・Mean Opinion Score (MOS) test
・Listeners rated the naturalness of each sample on a 5-point scale (1: very poor--5: very good).
・The number of listeners was 50, and each listened to 20 speech samples.
・Preference AB test
・Listeners evaluated 10 pairs of speech samples synthesized by a specific method-pair.
・The number of listeners for each AB test was 25.
Experimental setting for TTS model
11 / 17
• Experimental conditions for TTS model
• Compared methods for TTS model
TTS model FastSpeech2 [Ren+21]
Train/eval. data of TTS model
and Prosody predictor
JSUT corpus [Takamichi+20], BASIC5000 subset,
4,488 / 512 sentences
・FS2 ・FS2+Symbol ・FS2+Predictor (target)
・FS2+Predictor
…Proposed
• Objective evaluation of TTS model
Result of objective evaluation
12 / 17
Bad
Good
worsen
This means that the prediction error
of the prosody predictor significantly degrades quality
• Objective evaluation of TTS model
Result of objective evaluation
12 / 17
Bad
Good
comparable
Suggests that natural speech can be synthesized
if correct accent is obtained through feedback
• Subjective evaluation of TTS model
・Preference score for prosody naturalness of synthetic speech
…BOLD denotes a significant difference between the two methods
Compared method Preference score
FS2 vs. FS2+Predictor 0.552 vs. 0.448
FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732
FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232
FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480
FS2 vs. FS2+Symbols 0.248 vs. 0.752
Result of subjective evaluation
13 / 17
better
better
• Subjective evaluation of TTS model
・Preference score for prosody naturalness of synthetic speech
…BOLD denotes a significant difference between the two methods
Result of subjective evaluation
13 / 17
No significant
difference
Compared method Preference score
FS2 vs. FS2+Predictor 0.552 vs. 0.448
FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732
FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232
FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480
FS2 vs. FS2+Symbols 0.248 vs. 0.752
• Subjective evaluation of TTS model
・Preference score for prosody naturalness of synthetic speech
…BOLD denotes a significant difference between the two methods
Result of subjective evaluation
13 / 17
No significant
difference
Compared method Preference score
FS2 vs. FS2+Predictor 0.552 vs. 0.448
FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732
FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232
FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480
FS2 vs. FS2+Symbols 0.248 vs. 0.752
These results suggests that our method
improves the prosodic naturalness of synthetic speech
Experimental setting for HITL framework
14 / 17
• Experimental conditions for HITL framework
• Compared methods for HITL framework
・Mode, MACE, Best, Median, Worst
Listerner hiring platform Lancers (crowdsourcing platform)
Dataset, Participant Japanese female speaker (JSUT corpus)
Unused data for training TTS models and Prosody Predictor)
100 centences, 15 person / sentences (=1,500 crowdworkers)
HITL framework interface Initialize radio buttons with text analysis-derived accent information
…
15
crowdworkers
L H H H L L .... →
H H H H L L .... →
L H H H H L .... →
E2E
TTS
…
…
logF0
RMSE
between
Ground-Truth
Sort
…
…
Good
↑
↓
Bad
←Best
←Median
←Worst
[input accents]
• Objective evaluation of HITL framework
Result of objective evaluation
15 / 17
Bad
Good
→Methods w/ HITL
worsen
• Objective evaluation of HITL framework
Result of objective evaluation
15 / 17
Bad
Good
→Methods w/ HITL
• Objective evaluation of HITL framework
Result of objective evaluation
15 / 17
Bad
Good
→Methods w/ HITL
comparable
These results suggests that error correction feedback
can improve TTS quality
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
obviously
lower
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
These results suggest that the prediction error
of the prosody predictor degrades the naturalness of speech.
obviously
lower
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
significantly higher
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
significantly higher
These results suggests that our method
improves the prosodic naturalness of synthetic speech
Summary
• Goal: Improve both “controllability” and “adaptability” of E2E TTS
・To make it easy for users to correct accent errors in synthetic speech
• Proposed:
・E2E TTS with a prosody predictor (improve “controllability”)
・Human-in-the-Loop (HITL) framework (improve “adaptability”)
• Results:
・Our HITL framework successfully corrected accent errors.
・Our method successfully achieved the same quality of synthetic speech
as the conventional method.
• Future work
・User interface of feedback
・How to integrate the obtained prosodic sequences.
17 / 17
Thank you for your attention!

Contenu connexe

Similaire à fujii22apsipa_asc

The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015Shinnosuke Takamichi
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdfRamya Nellutla
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdfsimonp16
 
Evaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsEvaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsSajeed Mahaboob
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Kotaro Hara
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenTomoyuki Kajiwara
 
Emnlp読み会@2017 02-15
Emnlp読み会@2017 02-15Emnlp読み会@2017 02-15
Emnlp読み会@2017 02-15sekizawayuuki
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis Systemiosrjce
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
 
Pbsmt presenation waleed_oransa_29_april2010
Pbsmt presenation waleed_oransa_29_april2010Pbsmt presenation waleed_oransa_29_april2010
Pbsmt presenation waleed_oransa_29_april2010woransa
 
Phrase break prediction with bidirectional encoder representations in Japanes...
Phrase break prediction with bidirectional encoder representations in Japanes...Phrase break prediction with bidirectional encoder representations in Japanes...
Phrase break prediction with bidirectional encoder representations in Japanes...Kosuke Futamata
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
 
NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]Sagar Ahire
 

Similaire à fujii22apsipa_asc (20)

The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for Children
 
Evaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsEvaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutions
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for Children
 
Selecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for childrenSelecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for children
 
Emnlp読み会@2017 02-15
Emnlp読み会@2017 02-15Emnlp読み会@2017 02-15
Emnlp読み会@2017 02-15
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
Pbsmt presenation waleed_oransa_29_april2010
Pbsmt presenation waleed_oransa_29_april2010Pbsmt presenation waleed_oransa_29_april2010
Pbsmt presenation waleed_oransa_29_april2010
 
Phrase break prediction with bidirectional encoder representations in Japanes...
Phrase break prediction with bidirectional encoder representations in Japanes...Phrase break prediction with bidirectional encoder representations in Japanes...
Phrase break prediction with bidirectional encoder representations in Japanes...
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
 
NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]
 

Plus de Yuki Saito

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdfYuki Saito
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告Yuki Saito
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUSYuki Saito
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversionYuki Saito
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentationYuki Saito
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentationYuki Saito
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn MeetingYuki Saito
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 readingYuki Saito
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumnYuki Saito
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020Yuki Saito
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_publishedYuki Saito
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNAYuki Saito
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationYuki Saito
 

Plus de Yuki Saito (20)

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
 
Saito2103slp
Saito2103slpSaito2103slp
Saito2103slp
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
 
Une18apsipa
Une18apsipaUne18apsipa
Une18apsipa
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
 

Dernier

Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |aasikanpl
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )aarthirajkumar25
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 

Dernier (20)

CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
Call Us ≽ 9953322196 ≼ Call Girls In Mukherjee Nagar(Delhi) |
 
Engler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomyEngler and Prantl system of classification in plant taxonomy
Engler and Prantl system of classification in plant taxonomy
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 

fujii22apsipa_asc

  • 1. AVATAR SYMBIOTIC SOCIETY Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans Kazuki Fujii, Yuki Saito, Hiroshi Saruwatari The University of Tokyo, Japan APSIPA ASC 2022 ThPM1-3.5 7 - 10 Nov. 2022・Chiang Mai, Thailand 1 / 17
  • 2. Research background 2 / 17 • End-to-end (E2E) Text-to-Speech (TTS)[Wang+17] ・High quality but low controllability ・Difficulty in correcting accent errors (cause of miscommunication, especially in a tonal language e.g., Japanese) High Low E2E TTS model a m e a m e E2E TTS model Text a me a me Prosody modeling and control in E2E TTS are a very important and challenging task. High Low
  • 3. Improving controllability using linguistic features 3 / 17 • Examples of linguistic features ・Full-context labels [Okamoto+19], [Okamoto+20] ・Phonetic and prosodic labels [Kurihara+21] • High controllability but need for prerequisite expertise “Adaptability”, the ability to easily correct mistakes, is not taken into consideration. Text analysis E2E TTS Input text → → linguistic feature → → Synthesized speech
  • 4. Outline of this talk 4 / 17 • Goal: Improve both “controllability” and “adaptability” of E2E TTS ・To make it easy for users to correct accent errors in synthetic speech • Proposed: ・E2E TTS with a prosody predictor (improve “controllability” ) ・Human-in-the-Loop (HITL) framework (improve “adaptability”) • Results: ・Our HITL framework successfully corrects accent errors. ・Our method successfully achieves the same quality of synthetic speech as the conventional method.
  • 5. Overview of proposed TTS model 5 / 17 • Backbone TTS model: FastSpeech2 [Ren+21] with trainable prosody predictor
  • 6. Backbone TTS model: FastSpeech2 6 / 17 • FastSpeech2 [Ren+21] ・Modules for generating a mel-spectrogram of speech from a phoneme sequence ・Stable learning and inference Variance Adaptor Duration/pitch/energy predictor
  • 7. Proposed method 7 / 17 • Prosody predictor: DNN to estimate pitch changes per syllable (mora) • Input: Phoneme embedding + Word embedding from BERT [Devlin+19] • Output: One of three prosodic symbols ("[" / "]" : raising / lowering pitch, • Loss function: MSE between predicted symbols and ground-truth (text analysis results) "_" : keeping accent unchanged)
  • 8. Overview of proposed HITL framework 8 / 17 • Motivation: Improves the adaptability of synthetic speech • Approach: Involves multiple listeners in the process of correcting accent errors … … (a) (me) a me a me
  • 9. HITL accent error correction: Feedback aggregation 9 / 17 • Challenging point: Differences of listeners' error correction abilities ・Simple Way: Choose one from multiple accent annotations ・Proposed: Aggregate in the following ways (actual accent is Low(L) High(H) H H) ↑ Good Bad ↓ Random Selector Bad accent selected Annotated accent Estimated ability by MACE Listener 1 L L L L Low ability Listener 2 L L L L Low ability Listener 3 L H H H High ability Mode L L L L MACE [Hovy+13] L H H H Low quarty speech
  • 10. Experimental evaluation 10 / 17 • Evaluation targets ・TTS model ・HITL framework • Objective evaluation criterion ・RMSE of the logF0 between synthetic and natural speech • Subjective evaluation criterion ・Mean Opinion Score (MOS) test ・Listeners rated the naturalness of each sample on a 5-point scale (1: very poor--5: very good). ・The number of listeners was 50, and each listened to 20 speech samples. ・Preference AB test ・Listeners evaluated 10 pairs of speech samples synthesized by a specific method-pair. ・The number of listeners for each AB test was 25.
  • 11. Experimental setting for TTS model 11 / 17 • Experimental conditions for TTS model • Compared methods for TTS model TTS model FastSpeech2 [Ren+21] Train/eval. data of TTS model and Prosody predictor JSUT corpus [Takamichi+20], BASIC5000 subset, 4,488 / 512 sentences ・FS2 ・FS2+Symbol ・FS2+Predictor (target) ・FS2+Predictor …Proposed
  • 12. • Objective evaluation of TTS model Result of objective evaluation 12 / 17 Bad Good worsen This means that the prediction error of the prosody predictor significantly degrades quality
  • 13. • Objective evaluation of TTS model Result of objective evaluation 12 / 17 Bad Good comparable Suggests that natural speech can be synthesized if correct accent is obtained through feedback
  • 14. • Subjective evaluation of TTS model ・Preference score for prosody naturalness of synthetic speech …BOLD denotes a significant difference between the two methods Compared method Preference score FS2 vs. FS2+Predictor 0.552 vs. 0.448 FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732 FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232 FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480 FS2 vs. FS2+Symbols 0.248 vs. 0.752 Result of subjective evaluation 13 / 17 better better
  • 15. • Subjective evaluation of TTS model ・Preference score for prosody naturalness of synthetic speech …BOLD denotes a significant difference between the two methods Result of subjective evaluation 13 / 17 No significant difference Compared method Preference score FS2 vs. FS2+Predictor 0.552 vs. 0.448 FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732 FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232 FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480 FS2 vs. FS2+Symbols 0.248 vs. 0.752
  • 16. • Subjective evaluation of TTS model ・Preference score for prosody naturalness of synthetic speech …BOLD denotes a significant difference between the two methods Result of subjective evaluation 13 / 17 No significant difference Compared method Preference score FS2 vs. FS2+Predictor 0.552 vs. 0.448 FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732 FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232 FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480 FS2 vs. FS2+Symbols 0.248 vs. 0.752 These results suggests that our method improves the prosodic naturalness of synthetic speech
  • 17. Experimental setting for HITL framework 14 / 17 • Experimental conditions for HITL framework • Compared methods for HITL framework ・Mode, MACE, Best, Median, Worst Listerner hiring platform Lancers (crowdsourcing platform) Dataset, Participant Japanese female speaker (JSUT corpus) Unused data for training TTS models and Prosody Predictor) 100 centences, 15 person / sentences (=1,500 crowdworkers) HITL framework interface Initialize radio buttons with text analysis-derived accent information … 15 crowdworkers L H H H L L .... → H H H H L L .... → L H H H H L .... → E2E TTS … … logF0 RMSE between Ground-Truth Sort … … Good ↑ ↓ Bad ←Best ←Median ←Worst [input accents]
  • 18. • Objective evaluation of HITL framework Result of objective evaluation 15 / 17 Bad Good →Methods w/ HITL worsen
  • 19. • Objective evaluation of HITL framework Result of objective evaluation 15 / 17 Bad Good →Methods w/ HITL
  • 20. • Objective evaluation of HITL framework Result of objective evaluation 15 / 17 Bad Good →Methods w/ HITL comparable These results suggests that error correction feedback can improve TTS quality
  • 21. • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 obviously lower
  • 22. • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 These results suggest that the prediction error of the prosody predictor degrades the naturalness of speech. obviously lower
  • 23. Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 significantly higher
  • 24. Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 significantly higher These results suggests that our method improves the prosodic naturalness of synthetic speech
  • 25. Summary • Goal: Improve both “controllability” and “adaptability” of E2E TTS ・To make it easy for users to correct accent errors in synthetic speech • Proposed: ・E2E TTS with a prosody predictor (improve “controllability”) ・Human-in-the-Loop (HITL) framework (improve “adaptability”) • Results: ・Our HITL framework successfully corrected accent errors. ・Our method successfully achieved the same quality of synthetic speech as the conventional method. • Future work ・User interface of feedback ・How to integrate the obtained prosodic sequences. 17 / 17 Thank you for your attention!