SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
AVATAR SYMBIOTIC
SOCIETY
Adaptive End-to-End Text-to-Speech Synthesis
Based on Error Correction Feedback from Humans
Kazuki Fujii, Yuki Saito, Hiroshi Saruwatari
The University of Tokyo, Japan
APSIPA ASC 2022 ThPM1-3.5
7 - 10 Nov. 2022・Chiang Mai, Thailand
1 / 17
Research background
2 / 17
• End-to-end (E2E) Text-to-Speech (TTS)[Wang+17]
・High quality but low controllability
・Difficulty in correcting accent errors
(cause of miscommunication, especially in a tonal language e.g., Japanese)
High
Low
E2E TTS
model
a m e
a m e
E2E TTS
model
Text
a
me
a
me
Prosody modeling and control in E2E TTS are a very
important and challenging task.
High
Low
Improving controllability using linguistic features
3 / 17
• Examples of linguistic features
・Full-context labels [Okamoto+19], [Okamoto+20]
・Phonetic and prosodic labels [Kurihara+21]
• High controllability but need for prerequisite expertise
“Adaptability”, the ability to easily correct mistakes,
is not taken into consideration.
Text
analysis
E2E
TTS
Input text → → linguistic feature → → Synthesized speech
Outline of this talk
4 / 17
• Goal: Improve both “controllability” and “adaptability” of E2E TTS
・To make it easy for users to correct accent errors in synthetic speech
• Proposed:
・E2E TTS with a prosody predictor (improve “controllability” )
・Human-in-the-Loop (HITL) framework (improve “adaptability”)
• Results:
・Our HITL framework successfully corrects accent errors.
・Our method successfully achieves the same quality of synthetic speech
as the conventional method.
Overview of proposed TTS model
5 / 17
• Backbone TTS model: FastSpeech2 [Ren+21] with trainable prosody predictor
Backbone TTS model: FastSpeech2
6 / 17
• FastSpeech2 [Ren+21]
・Modules for generating a mel-spectrogram of speech from a phoneme sequence
・Stable learning and inference
Variance Adaptor Duration/pitch/energy
predictor
Proposed method
7 / 17
• Prosody predictor: DNN to estimate pitch changes per syllable (mora)
• Input: Phoneme embedding + Word embedding from BERT [Devlin+19]
• Output: One of three prosodic symbols ("[" / "]" : raising / lowering pitch,
• Loss function: MSE between predicted symbols and ground-truth (text analysis results)
"_" : keeping accent unchanged)
Overview of proposed HITL framework
8 / 17
• Motivation: Improves the adaptability of synthetic speech
• Approach: Involves multiple listeners in the process of correcting accent errors
…
…
(a) (me)
a
me
a
me
HITL accent error correction: Feedback aggregation
9 / 17
• Challenging point: Differences of listeners' error correction abilities
・Simple Way: Choose one from multiple accent annotations
・Proposed: Aggregate in the following ways (actual accent is Low(L) High(H) H H)
↑
Good
Bad
↓
Random
Selector
Bad accent
selected
Annotated accent Estimated ability by MACE
Listener 1 L L L L Low ability
Listener 2 L L L L Low ability
Listener 3 L H H H High ability
Mode L L L L
MACE [Hovy+13] L H H H
Low quarty
speech
Experimental evaluation
10 / 17
• Evaluation targets
・TTS model
・HITL framework
• Objective evaluation criterion
・RMSE of the logF0 between synthetic and natural speech
• Subjective evaluation criterion
・Mean Opinion Score (MOS) test
・Listeners rated the naturalness of each sample on a 5-point scale (1: very poor--5: very good).
・The number of listeners was 50, and each listened to 20 speech samples.
・Preference AB test
・Listeners evaluated 10 pairs of speech samples synthesized by a specific method-pair.
・The number of listeners for each AB test was 25.
Experimental setting for TTS model
11 / 17
• Experimental conditions for TTS model
• Compared methods for TTS model
TTS model FastSpeech2 [Ren+21]
Train/eval. data of TTS model
and Prosody predictor
JSUT corpus [Takamichi+20], BASIC5000 subset,
4,488 / 512 sentences
・FS2 ・FS2+Symbol ・FS2+Predictor (target)
・FS2+Predictor
…Proposed
• Objective evaluation of TTS model
Result of objective evaluation
12 / 17
Bad
Good
worsen
This means that the prediction error
of the prosody predictor significantly degrades quality
• Objective evaluation of TTS model
Result of objective evaluation
12 / 17
Bad
Good
comparable
Suggests that natural speech can be synthesized
if correct accent is obtained through feedback
• Subjective evaluation of TTS model
・Preference score for prosody naturalness of synthetic speech
…BOLD denotes a significant difference between the two methods
Compared method Preference score
FS2 vs. FS2+Predictor 0.552 vs. 0.448
FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732
FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232
FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480
FS2 vs. FS2+Symbols 0.248 vs. 0.752
Result of subjective evaluation
13 / 17
better
better
• Subjective evaluation of TTS model
・Preference score for prosody naturalness of synthetic speech
…BOLD denotes a significant difference between the two methods
Result of subjective evaluation
13 / 17
No significant
difference
Compared method Preference score
FS2 vs. FS2+Predictor 0.552 vs. 0.448
FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732
FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232
FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480
FS2 vs. FS2+Symbols 0.248 vs. 0.752
• Subjective evaluation of TTS model
・Preference score for prosody naturalness of synthetic speech
…BOLD denotes a significant difference between the two methods
Result of subjective evaluation
13 / 17
No significant
difference
Compared method Preference score
FS2 vs. FS2+Predictor 0.552 vs. 0.448
FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732
FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232
FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480
FS2 vs. FS2+Symbols 0.248 vs. 0.752
These results suggests that our method
improves the prosodic naturalness of synthetic speech
Experimental setting for HITL framework
14 / 17
• Experimental conditions for HITL framework
• Compared methods for HITL framework
・Mode, MACE, Best, Median, Worst
Listerner hiring platform Lancers (crowdsourcing platform)
Dataset, Participant Japanese female speaker (JSUT corpus)
Unused data for training TTS models and Prosody Predictor)
100 centences, 15 person / sentences (=1,500 crowdworkers)
HITL framework interface Initialize radio buttons with text analysis-derived accent information
…
15
crowdworkers
L H H H L L .... →
H H H H L L .... →
L H H H H L .... →
E2E
TTS
…
…
logF0
RMSE
between
Ground-Truth
Sort
…
…
Good
↑
↓
Bad
←Best
←Median
←Worst
[input accents]
• Objective evaluation of HITL framework
Result of objective evaluation
15 / 17
Bad
Good
→Methods w/ HITL
worsen
• Objective evaluation of HITL framework
Result of objective evaluation
15 / 17
Bad
Good
→Methods w/ HITL
• Objective evaluation of HITL framework
Result of objective evaluation
15 / 17
Bad
Good
→Methods w/ HITL
comparable
These results suggests that error correction feedback
can improve TTS quality
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
obviously
lower
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
These results suggest that the prediction error
of the prosody predictor degrades the naturalness of speech.
obviously
lower
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
significantly higher
Method MOS 95% confidence interval
FS2+Predictor 2.87 0.163
FS2+Predictor (target) 3.54 0.156
Best 3.62 0.151
Median 3.38 0.156
Worst 2.84 0.174
Mode 3.63 0.159
MACE 3.49 0.161
• Subjective evaluation of HITL framework
・Preference score for prosody naturalness of synthetic speech
…BOLD are higher than those of FS2+Predictor
Result of subjective evaluation
16 / 17
significantly higher
These results suggests that our method
improves the prosodic naturalness of synthetic speech
Summary
• Goal: Improve both “controllability” and “adaptability” of E2E TTS
・To make it easy for users to correct accent errors in synthetic speech
• Proposed:
・E2E TTS with a prosody predictor (improve “controllability”)
・Human-in-the-Loop (HITL) framework (improve “adaptability”)
• Results:
・Our HITL framework successfully corrected accent errors.
・Our method successfully achieved the same quality of synthetic speech
as the conventional method.
• Future work
・User interface of feedback
・How to integrate the obtained prosodic sequences.
17 / 17
Thank you for your attention!

Contenu connexe

Similaire à fujii22apsipa_asc

The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015Shinnosuke Takamichi
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdfRamya Nellutla
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdfsimonp16
 
Evaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsEvaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsSajeed Mahaboob
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Kotaro Hara
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for TranslationRIILP
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenTomoyuki Kajiwara
 
Emnlp読み会@2017 02-15
Emnlp読み会@2017 02-15Emnlp読み会@2017 02-15
Emnlp読み会@2017 02-15sekizawayuuki
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis Systemiosrjce
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Reviewinscit2006
 
Pbsmt presenation waleed_oransa_29_april2010
Pbsmt presenation waleed_oransa_29_april2010Pbsmt presenation waleed_oransa_29_april2010
Pbsmt presenation waleed_oransa_29_april2010woransa
 
Phrase break prediction with bidirectional encoder representations in Japanes...
Phrase break prediction with bidirectional encoder representations in Japanes...Phrase break prediction with bidirectional encoder representations in Japanes...
Phrase break prediction with bidirectional encoder representations in Japanes...Kosuke Futamata
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...Lifeng (Aaron) Han
 
NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]Sagar Ahire
 

Similaire à fujii22apsipa_asc (20)

The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015The NAIST Text-to-Speech System for Blizzard Challenge 2015
The NAIST Text-to-Speech System for Blizzard Challenge 2015
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
07-Effect-Of-Machine-Translation-In-Interlingual-Conversation.pdf
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for Children
 
Evaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutionsEvaluation of hindi english mt systems, challenges and solutions
Evaluation of hindi english mt systems, challenges and solutions
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
 
13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation13. Constantin Orasan (UoW) Natural Language Processing for Translation
13. Constantin Orasan (UoW) Natural Language Processing for Translation
 
Selecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for ChildrenSelecting Proper Lexical Paraphrase for Children
Selecting Proper Lexical Paraphrase for Children
 
Selecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for childrenSelecting proper lexical paraphrase for children
Selecting proper lexical paraphrase for children
 
Emnlp読み会@2017 02-15
Emnlp読み会@2017 02-15Emnlp読み会@2017 02-15
Emnlp読み会@2017 02-15
 
A Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis SystemA Marathi Hidden-Markov Model Based Speech Synthesis System
A Marathi Hidden-Markov Model Based Speech Synthesis System
 
Improvement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A ReviewImprovement in Quality of Speech associated with Braille codes - A Review
Improvement in Quality of Speech associated with Braille codes - A Review
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
Pbsmt presenation waleed_oransa_29_april2010
Pbsmt presenation waleed_oransa_29_april2010Pbsmt presenation waleed_oransa_29_april2010
Pbsmt presenation waleed_oransa_29_april2010
 
Phrase break prediction with bidirectional encoder representations in Japanes...
Phrase break prediction with bidirectional encoder representations in Japanes...Phrase break prediction with bidirectional encoder representations in Japanes...
Phrase break prediction with bidirectional encoder representations in Japanes...
 
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
MT SUMMIT13.Language-independent Model for Machine Translation Evaluation wit...
 
NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]NLP Asignment Final Presentation [IIT-Bombay]
NLP Asignment Final Presentation [IIT-Bombay]
 

Plus de Yuki Saito

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdfYuki Saito
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告Yuki Saito
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUSYuki Saito
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversionYuki Saito
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentationYuki Saito
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentationYuki Saito
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)Yuki Saito
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn MeetingYuki Saito
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 readingYuki Saito
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumnYuki Saito
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020Yuki Saito
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_publishedYuki Saito
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNAYuki Saito
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationYuki Saito
 

Plus de Yuki Saito (20)

hirai23slp03.pdf
hirai23slp03.pdfhirai23slp03.pdf
hirai23slp03.pdf
 
Interspeech2022 参加報告
Interspeech2022 参加報告Interspeech2022 参加報告
Interspeech2022 参加報告
 
saito22research_talk_at_NUS
saito22research_talk_at_NUSsaito22research_talk_at_NUS
saito22research_talk_at_NUS
 
Neural text-to-speech and voice conversion
Neural text-to-speech and voice conversionNeural text-to-speech and voice conversion
Neural text-to-speech and voice conversion
 
Nishimura22slp03 presentation
Nishimura22slp03 presentationNishimura22slp03 presentation
Nishimura22slp03 presentation
 
Nakai22sp03 presentation
Nakai22sp03 presentationNakai22sp03 presentation
Nakai22sp03 presentation
 
GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)GAN-based statistical speech synthesis (in Japanese)
GAN-based statistical speech synthesis (in Japanese)
 
Saito21asj Autumn Meeting
Saito21asj Autumn MeetingSaito21asj Autumn Meeting
Saito21asj Autumn Meeting
 
Saito2103slp
Saito2103slpSaito2103slp
Saito2103slp
 
Interspeech2020 reading
Interspeech2020 readingInterspeech2020 reading
Interspeech2020 reading
 
Saito20asj_autumn
Saito20asj_autumnSaito20asj_autumn
Saito20asj_autumn
 
ICASSP読み会2020
ICASSP読み会2020ICASSP読み会2020
ICASSP読み会2020
 
Saito20asj s slide_published
Saito20asj s slide_publishedSaito20asj s slide_published
Saito20asj s slide_published
 
Saito19asjAutumn_DeNA
Saito19asjAutumn_DeNASaito19asjAutumn_DeNA
Saito19asjAutumn_DeNA
 
Deep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generationDeep learning for acoustic modeling in parametric speech generation
Deep learning for acoustic modeling in parametric speech generation
 
Saito19asj_s
Saito19asj_sSaito19asj_s
Saito19asj_s
 
Une18apsipa
Une18apsipaUne18apsipa
Une18apsipa
 
Saito18sp03
Saito18sp03Saito18sp03
Saito18sp03
 
Saito18asj_s
Saito18asj_sSaito18asj_s
Saito18asj_s
 
Saito17asjA
Saito17asjASaito17asjA
Saito17asjA
 

Dernier

Development of a Questionnaire for Identifying Personal Values in Driving
Development of a Questionnaire for Identifying Personal Values in DrivingDevelopment of a Questionnaire for Identifying Personal Values in Driving
Development of a Questionnaire for Identifying Personal Values in Drivingstudiotelon
 
Skin: Structure and function of the skin
Skin: Structure and function of the skinSkin: Structure and function of the skin
Skin: Structure and function of the skinheenarahangdale01
 
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...Ajay kamboj
 
Phagocytosis Pinocytosis detail presentation
Phagocytosis Pinocytosis detail presentationPhagocytosis Pinocytosis detail presentation
Phagocytosis Pinocytosis detail presentationdhaduknevil1
 
Cultivating various strains of Duckweed Syllabus.pdf
Cultivating various strains of Duckweed Syllabus.pdfCultivating various strains of Duckweed Syllabus.pdf
Cultivating various strains of Duckweed Syllabus.pdfHaim R. Branisteanu
 
Theory of indicators: Ostwald's and Quinonoid theories
Theory of indicators: Ostwald's and Quinonoid theoriesTheory of indicators: Ostwald's and Quinonoid theories
Theory of indicators: Ostwald's and Quinonoid theoriesChimwemweGladysBanda
 
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest Management
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest ManagementPests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest Management
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest ManagementPirithiRaju
 
The deconstructed Standard Model equation _ - symmetry magazine.pdf
The deconstructed Standard Model equation _ - symmetry magazine.pdfThe deconstructed Standard Model equation _ - symmetry magazine.pdf
The deconstructed Standard Model equation _ - symmetry magazine.pdfSOCIEDAD JULIO GARAVITO
 
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...dkNET
 
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITY
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITYRHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITY
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITYDnyandaBopche
 
AI Published & MIT Validated Perpetual Motion Machine Breakthroughs (2 New EV...
AI Published & MIT Validated Perpetual Motion Machine Breakthroughs (2 New EV...AI Published & MIT Validated Perpetual Motion Machine Breakthroughs (2 New EV...
AI Published & MIT Validated Perpetual Motion Machine Breakthroughs (2 New EV...Thane Heins
 
Science9 Quarter 3:Latitude and altitude.pptx
Science9 Quarter 3:Latitude and altitude.pptxScience9 Quarter 3:Latitude and altitude.pptx
Science9 Quarter 3:Latitude and altitude.pptxteleganne21
 
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...Naomi Baes
 
Introduction to Green chemistry ppt.pptx
Introduction to Green chemistry ppt.pptxIntroduction to Green chemistry ppt.pptx
Introduction to Green chemistry ppt.pptxMuskan219429
 
The GIS Capability Maturity Model (2013)
The GIS Capability Maturity Model (2013)The GIS Capability Maturity Model (2013)
The GIS Capability Maturity Model (2013)GregBabinski
 
20240315 ACMJ Diagrams Set 2.docx . With light, motor, coloured light, and se...
20240315 ACMJ Diagrams Set 2.docx . With light, motor, coloured light, and se...20240315 ACMJ Diagrams Set 2.docx . With light, motor, coloured light, and se...
20240315 ACMJ Diagrams Set 2.docx . With light, motor, coloured light, and se...Sharon Liu
 
Introduction about protein and General method of analysis of protein
Introduction about protein and General method of analysis of proteinIntroduction about protein and General method of analysis of protein
Introduction about protein and General method of analysis of proteinSowmiya
 
Zoogeographical regions In the World.pptx
Zoogeographical regions In the World.pptxZoogeographical regions In the World.pptx
Zoogeographical regions In the World.pptx2019n04898
 
Preparation of enterprise budget for integrated fish farming
Preparation of enterprise budget for integrated fish farmingPreparation of enterprise budget for integrated fish farming
Preparation of enterprise budget for integrated fish farmingbhanilsaa
 

Dernier (20)

Development of a Questionnaire for Identifying Personal Values in Driving
Development of a Questionnaire for Identifying Personal Values in DrivingDevelopment of a Questionnaire for Identifying Personal Values in Driving
Development of a Questionnaire for Identifying Personal Values in Driving
 
Skin: Structure and function of the skin
Skin: Structure and function of the skinSkin: Structure and function of the skin
Skin: Structure and function of the skin
 
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...
INFLUENCE OF PREHARVEST PRACTICES, ENZYMATIC AND TEXTURAL CHANGES, RESPIRATIO...
 
Phagocytosis Pinocytosis detail presentation
Phagocytosis Pinocytosis detail presentationPhagocytosis Pinocytosis detail presentation
Phagocytosis Pinocytosis detail presentation
 
Cultivating various strains of Duckweed Syllabus.pdf
Cultivating various strains of Duckweed Syllabus.pdfCultivating various strains of Duckweed Syllabus.pdf
Cultivating various strains of Duckweed Syllabus.pdf
 
Theory of indicators: Ostwald's and Quinonoid theories
Theory of indicators: Ostwald's and Quinonoid theoriesTheory of indicators: Ostwald's and Quinonoid theories
Theory of indicators: Ostwald's and Quinonoid theories
 
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest Management
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest ManagementPests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest Management
Pests of Maize_Dr.UPR_Identification, Binomics, Integrated Pest Management
 
The deconstructed Standard Model equation _ - symmetry magazine.pdf
The deconstructed Standard Model equation _ - symmetry magazine.pdfThe deconstructed Standard Model equation _ - symmetry magazine.pdf
The deconstructed Standard Model equation _ - symmetry magazine.pdf
 
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
dkNET Webinar "The Multi-Omic Response to Exercise Training Across Rat Tissue...
 
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITY
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITYRHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITY
RHEOLOGY MODIFIERS: ENHANCING PERFORMANCE AND FUNCTIONALITY
 
AI Published & MIT Validated Perpetual Motion Machine Breakthroughs (2 New EV...
AI Published & MIT Validated Perpetual Motion Machine Breakthroughs (2 New EV...AI Published & MIT Validated Perpetual Motion Machine Breakthroughs (2 New EV...
AI Published & MIT Validated Perpetual Motion Machine Breakthroughs (2 New EV...
 
Science9 Quarter 3:Latitude and altitude.pptx
Science9 Quarter 3:Latitude and altitude.pptxScience9 Quarter 3:Latitude and altitude.pptx
Science9 Quarter 3:Latitude and altitude.pptx
 
Proof-of-Concept Publicly Accessible Data Dashboards from the US-EPA.pptx
Proof-of-Concept Publicly Accessible Data Dashboards from the US-EPA.pptxProof-of-Concept Publicly Accessible Data Dashboards from the US-EPA.pptx
Proof-of-Concept Publicly Accessible Data Dashboards from the US-EPA.pptx
 
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...
Naomi Baes's PhD Confirmation Presentation: A Multidimensional Framework for ...
 
Introduction to Green chemistry ppt.pptx
Introduction to Green chemistry ppt.pptxIntroduction to Green chemistry ppt.pptx
Introduction to Green chemistry ppt.pptx
 
The GIS Capability Maturity Model (2013)
The GIS Capability Maturity Model (2013)The GIS Capability Maturity Model (2013)
The GIS Capability Maturity Model (2013)
 
20240315 ACMJ Diagrams Set 2.docx . With light, motor, coloured light, and se...
20240315 ACMJ Diagrams Set 2.docx . With light, motor, coloured light, and se...20240315 ACMJ Diagrams Set 2.docx . With light, motor, coloured light, and se...
20240315 ACMJ Diagrams Set 2.docx . With light, motor, coloured light, and se...
 
Introduction about protein and General method of analysis of protein
Introduction about protein and General method of analysis of proteinIntroduction about protein and General method of analysis of protein
Introduction about protein and General method of analysis of protein
 
Zoogeographical regions In the World.pptx
Zoogeographical regions In the World.pptxZoogeographical regions In the World.pptx
Zoogeographical regions In the World.pptx
 
Preparation of enterprise budget for integrated fish farming
Preparation of enterprise budget for integrated fish farmingPreparation of enterprise budget for integrated fish farming
Preparation of enterprise budget for integrated fish farming
 

fujii22apsipa_asc

  • 1. AVATAR SYMBIOTIC SOCIETY Adaptive End-to-End Text-to-Speech Synthesis Based on Error Correction Feedback from Humans Kazuki Fujii, Yuki Saito, Hiroshi Saruwatari The University of Tokyo, Japan APSIPA ASC 2022 ThPM1-3.5 7 - 10 Nov. 2022・Chiang Mai, Thailand 1 / 17
  • 2. Research background 2 / 17 • End-to-end (E2E) Text-to-Speech (TTS)[Wang+17] ・High quality but low controllability ・Difficulty in correcting accent errors (cause of miscommunication, especially in a tonal language e.g., Japanese) High Low E2E TTS model a m e a m e E2E TTS model Text a me a me Prosody modeling and control in E2E TTS are a very important and challenging task. High Low
  • 3. Improving controllability using linguistic features 3 / 17 • Examples of linguistic features ・Full-context labels [Okamoto+19], [Okamoto+20] ・Phonetic and prosodic labels [Kurihara+21] • High controllability but need for prerequisite expertise “Adaptability”, the ability to easily correct mistakes, is not taken into consideration. Text analysis E2E TTS Input text → → linguistic feature → → Synthesized speech
  • 4. Outline of this talk 4 / 17 • Goal: Improve both “controllability” and “adaptability” of E2E TTS ・To make it easy for users to correct accent errors in synthetic speech • Proposed: ・E2E TTS with a prosody predictor (improve “controllability” ) ・Human-in-the-Loop (HITL) framework (improve “adaptability”) • Results: ・Our HITL framework successfully corrects accent errors. ・Our method successfully achieves the same quality of synthetic speech as the conventional method.
  • 5. Overview of proposed TTS model 5 / 17 • Backbone TTS model: FastSpeech2 [Ren+21] with trainable prosody predictor
  • 6. Backbone TTS model: FastSpeech2 6 / 17 • FastSpeech2 [Ren+21] ・Modules for generating a mel-spectrogram of speech from a phoneme sequence ・Stable learning and inference Variance Adaptor Duration/pitch/energy predictor
  • 7. Proposed method 7 / 17 • Prosody predictor: DNN to estimate pitch changes per syllable (mora) • Input: Phoneme embedding + Word embedding from BERT [Devlin+19] • Output: One of three prosodic symbols ("[" / "]" : raising / lowering pitch, • Loss function: MSE between predicted symbols and ground-truth (text analysis results) "_" : keeping accent unchanged)
  • 8. Overview of proposed HITL framework 8 / 17 • Motivation: Improves the adaptability of synthetic speech • Approach: Involves multiple listeners in the process of correcting accent errors … … (a) (me) a me a me
  • 9. HITL accent error correction: Feedback aggregation 9 / 17 • Challenging point: Differences of listeners' error correction abilities ・Simple Way: Choose one from multiple accent annotations ・Proposed: Aggregate in the following ways (actual accent is Low(L) High(H) H H) ↑ Good Bad ↓ Random Selector Bad accent selected Annotated accent Estimated ability by MACE Listener 1 L L L L Low ability Listener 2 L L L L Low ability Listener 3 L H H H High ability Mode L L L L MACE [Hovy+13] L H H H Low quarty speech
  • 10. Experimental evaluation 10 / 17 • Evaluation targets ・TTS model ・HITL framework • Objective evaluation criterion ・RMSE of the logF0 between synthetic and natural speech • Subjective evaluation criterion ・Mean Opinion Score (MOS) test ・Listeners rated the naturalness of each sample on a 5-point scale (1: very poor--5: very good). ・The number of listeners was 50, and each listened to 20 speech samples. ・Preference AB test ・Listeners evaluated 10 pairs of speech samples synthesized by a specific method-pair. ・The number of listeners for each AB test was 25.
  • 11. Experimental setting for TTS model 11 / 17 • Experimental conditions for TTS model • Compared methods for TTS model TTS model FastSpeech2 [Ren+21] Train/eval. data of TTS model and Prosody predictor JSUT corpus [Takamichi+20], BASIC5000 subset, 4,488 / 512 sentences ・FS2 ・FS2+Symbol ・FS2+Predictor (target) ・FS2+Predictor …Proposed
  • 12. • Objective evaluation of TTS model Result of objective evaluation 12 / 17 Bad Good worsen This means that the prediction error of the prosody predictor significantly degrades quality
  • 13. • Objective evaluation of TTS model Result of objective evaluation 12 / 17 Bad Good comparable Suggests that natural speech can be synthesized if correct accent is obtained through feedback
  • 14. • Subjective evaluation of TTS model ・Preference score for prosody naturalness of synthetic speech …BOLD denotes a significant difference between the two methods Compared method Preference score FS2 vs. FS2+Predictor 0.552 vs. 0.448 FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732 FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232 FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480 FS2 vs. FS2+Symbols 0.248 vs. 0.752 Result of subjective evaluation 13 / 17 better better
  • 15. • Subjective evaluation of TTS model ・Preference score for prosody naturalness of synthetic speech …BOLD denotes a significant difference between the two methods Result of subjective evaluation 13 / 17 No significant difference Compared method Preference score FS2 vs. FS2+Predictor 0.552 vs. 0.448 FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732 FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232 FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480 FS2 vs. FS2+Symbols 0.248 vs. 0.752
  • 16. • Subjective evaluation of TTS model ・Preference score for prosody naturalness of synthetic speech …BOLD denotes a significant difference between the two methods Result of subjective evaluation 13 / 17 No significant difference Compared method Preference score FS2 vs. FS2+Predictor 0.552 vs. 0.448 FS2 vs. FS2+Predictor (target) 0.268 vs. 0.732 FS2+Symbol vs. FS2+Predictor 0.768 vs. 0.232 FS2+Symbol vs. FS2+Predictor (target) 0.520 vs. 0.480 FS2 vs. FS2+Symbols 0.248 vs. 0.752 These results suggests that our method improves the prosodic naturalness of synthetic speech
  • 17. Experimental setting for HITL framework 14 / 17 • Experimental conditions for HITL framework • Compared methods for HITL framework ・Mode, MACE, Best, Median, Worst Listerner hiring platform Lancers (crowdsourcing platform) Dataset, Participant Japanese female speaker (JSUT corpus) Unused data for training TTS models and Prosody Predictor) 100 centences, 15 person / sentences (=1,500 crowdworkers) HITL framework interface Initialize radio buttons with text analysis-derived accent information … 15 crowdworkers L H H H L L .... → H H H H L L .... → L H H H H L .... → E2E TTS … … logF0 RMSE between Ground-Truth Sort … … Good ↑ ↓ Bad ←Best ←Median ←Worst [input accents]
  • 18. • Objective evaluation of HITL framework Result of objective evaluation 15 / 17 Bad Good →Methods w/ HITL worsen
  • 19. • Objective evaluation of HITL framework Result of objective evaluation 15 / 17 Bad Good →Methods w/ HITL
  • 20. • Objective evaluation of HITL framework Result of objective evaluation 15 / 17 Bad Good →Methods w/ HITL comparable These results suggests that error correction feedback can improve TTS quality
  • 21. • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 obviously lower
  • 22. • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 These results suggest that the prediction error of the prosody predictor degrades the naturalness of speech. obviously lower
  • 23. Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 significantly higher
  • 24. Method MOS 95% confidence interval FS2+Predictor 2.87 0.163 FS2+Predictor (target) 3.54 0.156 Best 3.62 0.151 Median 3.38 0.156 Worst 2.84 0.174 Mode 3.63 0.159 MACE 3.49 0.161 • Subjective evaluation of HITL framework ・Preference score for prosody naturalness of synthetic speech …BOLD are higher than those of FS2+Predictor Result of subjective evaluation 16 / 17 significantly higher These results suggests that our method improves the prosodic naturalness of synthetic speech
  • 25. Summary • Goal: Improve both “controllability” and “adaptability” of E2E TTS ・To make it easy for users to correct accent errors in synthetic speech • Proposed: ・E2E TTS with a prosody predictor (improve “controllability”) ・Human-in-the-Loop (HITL) framework (improve “adaptability”) • Results: ・Our HITL framework successfully corrected accent errors. ・Our method successfully achieved the same quality of synthetic speech as the conventional method. • Future work ・User interface of feedback ・How to integrate the obtained prosodic sequences. 17 / 17 Thank you for your attention!