This document discusses prosody modeling for Mandarin Chinese speech. It begins with an introduction to prosody and its importance in communication. Prosody can be measured acoustically using features like fundamental frequency, duration, intensity, and pause. A prosodic hierarchy for Mandarin is proposed with different levels like syllable, prosodic word, phrase, and breath group. Unsupervised joint prosody labeling and modeling is introduced as an approach that models observed prosodic features to determine prosodic tags without human perception. Parameters and a hierarchical model are used to represent prosodic structures and model relationships between linguistic information and prosodic-acoustic features.
2. 2
An Example - Talking Twin Babies - PART 2 -
OFFICIAL VIDEO
2016/7/17
https://www.youtube.com/watch?v=_JmA2ClUvUY
3. 3
It isn’t what you said; it’s how you said it!
• 這句話其實就是在講 “韻律” (Prosody)
• 簡而言之:語音的“抑,揚,頓,挫,輕,重,緩,急”
2016/7/17
快速流利與慢而斷續的極端例子
朗讀式語音
不同說話方式
自發性語音 (通常是對話)
人對機器
以朗讀方式讀以上句子 (怪異)
以自然方式唸出以上句子 (自然多了!具溝通功能!)
嘿!芭樂!請你過來看一看。 (“芭樂”是一個人?)
4. 4
It isn’t what you said; it’s how you said it!
• Thomas Sheridan (Irish stage actor) pointed
out the importance of prosody more than 200
years ago:
– Children are taught to read sentences, which they
do not understand; and as it is impossible to lay
the emphasis right, without perfectly
comprehending the meaning of what one reads,
they get a habit either of reading in a monotone,
or if they attempt to distinguish one word from the
rest, as the emphasis falls at random, the sense
is usually perverted, or changed into nonsense.
2016/7/17
5. 5
韻律的物理量化量測
• Prosody can be measured by the following prosodic-acoustic
features (韻律聲學參數)
– 基頻 (Fundamental Frequency),或稱 F0
– 時長 (Duration)
– 能量 (Intensity or Energy)
– 靜音 (Pause or silence)
• 韻律聲學參數可以使用以下的單元來量測
– 語句 (utterance)、語段 (discourse)、句子 (sentence)、子句 (clause)、片
語 (phrase)、詞 (word)、音節 (syllable)、聲母/韻母 (initial/final)、音素
(phoneme)
2016/7/17
7. 7
Important Characteristics of Mandarin Chinese (1/2)
• A tonal language (Four lexical tones (聲調), one neutral tone)
• The tonality of a monosyllable is mainly characterized by the shape of its
fundamental frequency (F0) contour. 趙元任提出的聲調標記
• To disambiguate word meanings: 媽 麻 馬 罵 嘛,買、賣,主投、豬頭
2016/7/17
Original speech signal
Synthesis without tone
8. 8
Important Characteristics of Mandarin Chinese (2/2)
• A syllable-based language, where each syllable carries a lexical tone (聲調).
– 411 base syllables and tones 1,300 distinct tonal syllables.
• A syllable-timed language
– syllables take approximately equal amounts of time to pronounce.
– Syllable structure of Chinese
– Initial (聲母)+ Final (韻母)
– Initial = consonants Final = [medial] + nucleus + [coda]
• English - a stress-timed language, where there is approximately the same amount of time
between stressed syllables.
2016/7/17
Native English Speaker Non-Native English Speaker
9. 9
Tone and Intonation
• Ripples on the waves (趙元任) or superposition
– Synthesis with tone+intonation
– Synthesis without intonation
2016/7/17
10. 10
2016/7/17
Prosody Hierarchy for Mandarin (Tseng, 2005)
Chiu-yu Tseng, et. al.“Fluent speech prosody: framework and modeling,” Speech
Communication, vol.46, Issues 3-4, Special Issue on Quantitative Prosody modeling for
Natural Speech Description and Generation, pp.284-309, July 2005.
Prosodic Phrase Group
Breath Group
Prosodic Phrase
Prosodic Word
Syllable
11. 11
A Modified Prosodic Structure for Mandarin
2016/7/17
B4: Boundary of a Breath Group
(BP)/Prosodic Phrase Group (PG)
B3: Boundary of a Prosodic Phrase (PPh)
B2: Prosodic Word (PW) boundary
B2-1: pitch reset
B2-2: short pause
B2-3: duration lengthening
B1: Normal syllabic boundary
B0: Tightly coupling syllabic
boundary
14. 14
Examples of English Prosodic Structure (ToBI)
• BU Radio f1ajrlp4.sph
• Hennessy * is the S.J.C.'s | thirty-second * chief justice. / Holding the court
system * on the course * he has set / and plotting | it's future agenda / won't
be an easy job / for his successor. /
2016/7/17
17. 17
Functions of Prosody
• Grammar
– It is believed that prosody assists listeners in parsing continuous speech and in the recognition of
words, providing cues to syntactic structure, grammatical boundaries and sentence type.
• Why did you hit Joe? Why did you hit PAUSE Joe?
• Focus
– Intonation and stress work together to highlight important words or syllables for contrast and focus.
• Discourse
– Prosody plays a role in the regulation of conversational interaction and in signaling discourse
structure.
• Emotion
– Prosody is also important in signaling emotions and attitudes.
• 簡而言之:韻律是人與人溝通的通訊協定!
2016/7/17
18. 18
Issues and Applications
• Issues concerned in prosody modeling
– Labeling of important prosodic cues
– Construction of prosody hierarchy
– Modeling of syntax-prosody relationship
– Prediction of prosodic phrase boundary (break) from text, etc.
• Applications
– Automatic Speech Recognition (ASR)
• Important prosodic cues can be explored from the input utterance to assist in both
acoustic and linguistic decoding
– Text-to-Speech (TTS)
• A good prosody model can be used to generate appropriate prosodic features from
the input text
2016/7/17
19. 19
韻律於 Spoken Language Processing 的角色
Human
Computer
Input Output
Generation Understanding
Speech
Text
Recognition
Speech
Text
Synthesis
Meaning
20. 20
Prosody Modeling
• y = f(x) prosody generation for TTS
– x: input information
– y: prosodic-acoustic features (pitch, duration, energy, pause)
2016/7/17
21. 21
Prosody Modeling
• y = f(x) recognition of information carried by prosody
– x: prosodic-acoustic features (pitch, duration, energy, pause)
– y: information carried by prosody, including
2016/7/17
22. 22
Direct or Indirect Prosody Modeling
Linguistic/para-linguistic/non-
linguistic features
Prosodic-acoustic features
工 程 師 藉 由 pattern
recognition tools 建立兩
者之關係 (不須大量語言
學知識)
23. 23
Direct or Indirect Prosody Modeling
Linguistic/para-linguistic/non-linguistic features
Prosodic-acoustic features
Prosody tags (Abstract
representation of
prosody)
語言學家可解釋其物理及語
言學意義,可較廣義一般化
(generalization)至所有語言
(nature?)
24. 24
Influential Factors on Prosody
2016/7/17
Fujisaki, H., “Information, prosody, and modeling – with emphasis on tonal
features of speech,” Proc. Speech Prosody 2004, Nara, Japan, pp. 1-10, 2004.
25. 25
Conventional Schemes
2016/7/17
Training of pattern classifier
Speech corpus
Feature extraction
Prosodic-
acoustic features
Target class:
lexical tone, word
boundary, etc.
Parameters of pattern classifiers
(GMM, DT, NN, ME, etc.)
Fig.1. prosody modeling via intermediate
abstract phonological categories
Fig. 2. Direct modeling of target classes
26. 26
Proposed Scheme – Unsupervised Prosody
Labeling and Modeling (PLM)
2016/7/17
Basic Idea
– Prosody modeling and labeling are jointly conducted
using an unlabeled speech database.
– To properly model the observed features and then let
the modeled-features objectively determine prosodic
tags by themselves rather than by human perception.
Design of the Hierarchical Prosodic Model
1. Representation of prosody hierarchy by Break Types and
Prosodic State
2. Realizing patterns of prosodic constituents
– Prosodic state model
– Syllable prosodic-acoustic model
3. Exploring the relationship between prosodic tags or
boundary types and the acoustic features surrounding
junctures.
– Syllable juncture prosodic-acoustic model
4. Relationship between prosodic structure and syntactic
structure.
– Break-syntax model
Prosody-labeled
database
27. 27
2016/7/17
Prosody Hierarchical Structure and Prosody Tags
• Break types of syllable junctures
– demarcate prosodic constituents, i.e. syllable (SYL), prosodic word (PW),
prosodic phrase (PPh), breath group (BG) and prosodic phrase group
(PG).
• Prosodic states of syllables
– represent syllable pitch contour, duration and energy level variations
resulting from high-level prosodic constituents (>=PW).
– a substitution for the effects from high-level linguistic features, such as a
word, a phrase or a syntactic tree.
28. 28
A Modified Prosodic Structure for Mandarin
2016/7/17
B4: Boundary of a Breath Group
(BP)/Prosodic Phrase Group (PG)
B3: Boundary of a Prosodic Phrase (PPh)
B2: Prosodic Word (PW) boundary
B2-1: pitch reset
B2-2: short pause
B2-3: duration lengthening
B1: Normal syllabic boundary
B0: Tightly coupling syllabic
boundary
C.-Y. Tseng, S.-H. Pin, Y.-L. Lee, H.-M. Wang, and Y.-C.
Chen, “Fluent speech prosody: Framework and modeling,”
Speech Commun. special issue on quantitative prosody
modeling for natural speech description and generation, 46,
284–309 (2005).
29. 29
2016/7/17
Unsupervised Joint Prosody Labeling
and Modeling by Hierarchical
Prosodic Model
Chen-Yu Chiang, Sin-Horng Chen, Hsiu-Min and Yu, Yih-Ru Wang,
“Unsupervised Joint Prosody Labeling and Modeling for Mandarin
Speech,” J. Acoust. Soc. Am., vol. 125, No. 2, pp. 1164-1183, Feb, 2009.
Chen-Yu Chiang, Sin-Horng Chen and Yih-Ru Wang, “Advanced
Unsupervised Joint Prosody Labeling and Modeling for Mandarin Speech
and Its Application to Prosody Generation for TTS,” in Proc. Interspeech
2009, Brighton, UK, Sept. 2009, pp. 504-507.
30. 30
2016/7/17
Features and Parameters Used in the Hierarchical Prosodic
Model
T: prosodic tag B: break type ={B0, B1, B2-1, B2-2, B2-3, B3, B4}
PS: prosodic state p: pitch prosodic state
q: duration prosodic state
r: energy prosodic state
A: prosodic feature X: syllable prosodic feature sp: syllable pitch contour
sd: syllable duration
se: syllable energy level
Y: inter-syllabic prosodic feature pd: pause duration
ed: energy-dip level
Z: differential prosodic features pj: normalized pitch jump
dl: normalized duration lengthening factor 1
df: normalized duration lengthening factor 2
L: linguistic feature l: reduced linguistic feature set
t: syllable tone sequence
s: base-syllable type sequence
f: final type sequence
u: utterance sequence
31. 31
Parameterization of Syllable Pitch Contour (in logHz)
• Discrete orthogonal polynomial
– Basis Functions (Discrete Legendre Polynomials) :
2016/7/17
1)(0 M
i
][][)( 2
12/1
2
12
1
M
i
M
M
M
i
])[(][)( 6
122/1
)3)(2)(1(
180
2
3
M
M
M
i
M
i
MMM
M
M
i
])()()[(][)( 22
25
20
)2)(1(
10
2362
2
332/1
)4)(3)(2)(2)(1(
2800
3 M
MM
M
i
M
MM
M
i
M
i
MMMMM
M
M
i
Mi 0 3M
32. 32
The Design of the Four Models
2016/7/17
Syllable prosodic features X Inter-syllable prosodic feature Y
Differential prosodic features Z
Reduced linguistic feature set l
Prosodic state PS Break type B
Tone t, syllable type s, final f
General prosodic feature model General prosody-
syntax model
Syllable prosodic-acoustic
model
Syllable juncture prosodic-
acoustic model
Prosodic state model Break-syntax model
( , | ) ( | , ) ( | ) ( , , | , , ) ( , | )P P P P P T AL AT L TL X Y ZB PS L B PSL
( , , | , , ) ( | , , ) ( , | , )P P PX Y ZB PS L XB PS L Y ZB L
( , | ) ( | ) ( | )P PB PSL PSB BL
34. 34
2016/7/17
Syllable Pitch Contour Model (2/3)
Figure 2.5: The (a) forward and (b) onset coarticulation patterns Here tp = (i, j) and t = i or j.
,
f
B tpβ
,b
f
B tβ
Tone 1 Tone 3
+
High-low mismatch compensation
B0
B1
B4
35. 35
2016/7/17
Syllable Pitch Contour Model (3/3)
Figure 2.5: The (c) forward and (d) offset coarticulation patterns Here tp = (i, j) and t = i or j.
,
b
B tpβ ,e
b
B tβ
Tone 3 Tone 3
+
A tone sandhi example
36. 36
Patterns of Prosodic Constituents
2016/7/17
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
LogF0
PG/BG
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
LogF0
PPh
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Length in syllable
LogF0
PW
Figure 3.13: The log-F0 patterns of BG/PG, PPh and PW.
/n n n
r
n n PW PPh BG PG pm pm β β β
37. 37
Patterns of Prosodic Constituents
2016/7/17
-0.02
0
0.02
0.04
0.06
PG/BG
sec
-0.02
0
0.02
0.04
0.06
PPh
sec
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
-0.02
0
0.02
0.04
0.06
PW
Length in syllable
sec
Figure 3.14: The syllable duration patterns of BG/PG, PPh and PW.
/n n n
r
n n PW PPh BG PGdm dm
38. 38
Patterns of Prosodic Constituents
2016/7/17
-5
-3
-1
1
3
5
dB
PG/BG
-5
-3
-1
1
3
5
dB
PPh
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
-5
-3
-1
1
3
5
Length in syllable
dB
PW
Figure 3.15: The energy level patterns of BG/PG, PPh and PW.
/n n n
r
n n PW PPh BG PGem em
39. 39
Comparison between Human Labeling and Machine Labeling (1/2)
2016/7/17
Human Labeling Tags:
b1: non-break
b2: prosodic word boundary
b3: minor break
b4: major break
Machine Labeling Tags:
B4: Boundary of a Breath Group
(BP)/Prosodic Phrase Group
(PG)
B3: Boundary of a Prosodic Phrase
(PPh)
B2: Prosodic Word (PW) boundary
B2-1: pitch reset
B2-2: short pause
B2-3: duration lengthening
B1: Normal syllabic boundary
B0: Tightly coupling syllabic
boundary
41. 41
2016/7/17
Application to ASR (read speech)
Sin-Horng Chen, Jyh-Her Yang, Chen-Yu Chiang, Ming-Chieh Liu and
Yih-Ru Wang, "A New Prosody-Assisted Mandarin ASR System", IEEE
Trans. on Audio, Speech and Language Processing, vol.20, no.6,
pp.1669-1684, Aug. 2012.
43. 43
Experimental Settings
• Database for the ASR experiments
– TCC300: a large Mandarin read speech database
– Training: 274 speakers, 23 hours for acoustic model and prosodic model
– Test: 19 speakers, 2 hours
• Acoustic model
– 411 Syllable HMM (8 states) + silence model + short pause model
– MMI training
– Trained from TCC300 training set (274 speakers, 23 hours)
• Factored LM
– NTCIR + Sinica + Panorama, about 1.2 billion words
– 60000-word lexicon
• Prosodic model
– Trained from the subset of TCC300 training set (164 speakers, 8.3 hours)
2016/7/17
44. 44
Experimental Results
2016/7/17
Recognition Performances of The Baseline Scheme, Scheme 1, and Scheme 2 (%)
WER CER SER
Baseline scheme 24.4 18.1 12.0
Break 21.3 15.0 10.2
Break + Prosodic state 20.7 14.4 9.6
EXPERIMENTAL RESULTS OF POS DECODING (%)
Precision Recall F-measure
Baseline scheme 93.4 76.4 84.0
Break + Prosodic state 93.4 80.0 86.2
EXPERIMENTAL RESULTS OF PM DECODING (%)
Precision Recall F-measure
Baseline scheme 55.2 37.8 44.8
Break + Prosodic state 61.2 53.0 56.8
45. 45
2016/7/17
TABLE VIII. EXPERIMENTAL RESULTS OF TONE DECODING (%)
Precision Recall F-Measure
Baseline scheme 87.9 87.5 87.7
Break + Prosodic state 91.9 91.6 91.7
An example of recognition results for a partial paragraph. Eight panels represent, respectively, waveform, prosodic state AP+global
mean of syllable log-F0 level, syllable duration, and syllable energy level, break type (B), reference transcription (R), result of
baseline scheme (F) and proposed system (P).
46. 46
2016/7/17
Application to ASR
(spontaneous speech)
Cheng-Hsien Lin, Meng-Chian Wu, Chung-Long You, Chen-Yu Chiang,
Yih-Ru Wang, Sin-Horng Chen, “Prosody Modeling of Spontaneous
Mandarin Speech and Its Application to Automatic Speech
Recognition,” Speech Prosody 2016, accepted.
47. 47
Experimental Settings
• Database for the ASR experiments
– MCDC 8-hour dialogues from 16 speakers, texts with PU tags are transcribed and
annotated by linguist experts
• Acoustic Model
– Seed tri-phone HMM models are trained from TCC300[3] and adapted using 80% of MCDC.
– CI models for PU: 6 particles (HO, EI, HAN, HEN, HEIN, and MHM)+ 2 fillers (unrecognized/foreign
speech)
– CI models for paralinguistic: BREATHE, CLEAR_THROAT, LAUGH, NOISE, SMACK, and
SWALLOW
• Factored LM
– About 440 million words corpus merged from 5 corpora, words/POS are tagged by in-house CRF
tagger
– Adapted by 90% MCDC corpus
– 60000-word lexicon, including all particles and markers, selected by their word frequencies
2016/7/17
49. 49
2016/7/17
Speaking Rate Dependent Hierarchical Prosodic
Model
Sin-Horng Chen, Chiao-Hua Hsieh, Chen-Yu Chiang, Hsi-Chun Hsiao, Yih-Ru Wang, Yuan-Fu
Liao and Hsiu-Min Yu, “Modeling of Speaking Rate Influences on Mandarin Speech Prosody
and Its Application to Speaking Rate-controlled TTS,” , IEEE Trans. on Audio, Speech and
Language Processing, vol.22, no. 7, pp.1158-1171, July. 2014.
50. 50
Introduction
• Speaking rate is a prosodic feature that influences many
phenomena such as
– Syllable duration
– Pitch contour
– Pause duration
– Occurrence frequency of pause
• Modeling the effects of speaking rate is an important research
issue in
– Automatic speech recognition (ASR)
– Text-to-speech system (TTS)
2016/7/17
51. 51
• Objective
– Modeling the influence of speaking rate on speech prosody based on the
PLM method
• The proposed approach
– We take speaking rate as a continuous variable and construct a single
HPM using the same four corpora
– In this study, the speaking rate(SR) in each utterance is defined as its
average duration per syllable uttered disregarding all inter-syllable pauses
2016/7/17
52. 52
Experimental Database
• SR-Treebank database:
– Read speech
– The corpus contains four parallel speech datasets uttered by a female
professional announcer with fast, normal, median and slow speaking rate.
– All utterances are short paragraphs. There are in total 1478 utterances
consisting of 203,746 syllables.
2016/7/17
53. 53
Break Labeling Examples for Four Parallel
Utterances with Various SR
2016/7/17
Note: only pause-related break type, i.e. B4(@), B3 (/) and B2-2(*) are displayed
Fast SR:
依據行政院主計處的統計 @,十月份 * 一到二十日 / ,我國出口及進口金額 /
比起去年同期 * 均有增加 @,
Normal SR:
依據行政院主計處的統計 @,十月份 * 一到二十日 /,我國出口 * 及進口金額
/ 比起去年同期 * 均有增加@,
Median SR:
依據 * 行政院主計處的統計 @,十月份 / 一到 * 二十日 /,我國出口 * 及進口
金額 / 比起去年同期 * 均有增加 @,
Slow SR:
依據 / 行政院 * 主計處的統計 @,十月份 / 一 * 到 * 二十日 @,我國出口 * 及
進口金額 / 比起去年同期 * 均有增加 @,
55. 55
Cross-Dialect and -Speaker Adaptation of SR-HPM
2016/7/17
Chen-Yu Chiang, “A Study on Adaptation of Speaking Rate-Dependent
Hierarchical Prosodic Model for Chinese Dialect TTS,” in Proc.
OCOCOSDA 2015, Shanghai, China, Oct. 2015. (Best Paper Award)
Chen-Yu Chiang, Hsiu-Min Yu, Sin-Horng Chen, “On Cross-Dialect and
-Speaker Adaptation of Speaking Rate-Dependent Hierarchical Prosodic
Model for a Hakka Text-to-Speech System,” Speech Prosody 2016,
accepted.
I-Bin Liao, Chen-Yu Chiang, Sin-Horng Chen, “Structural Maximum a
Posteriori Speaker Adaptation of Speaking Rate-Dependent Hierarchical
Prosodic Model for,” accepted by ICASSP 2016
56. 56
Experimental Databases
• Mandarin (for background model)
– 1,478 utterances with 183,795 syllables
– A wider SR range of 3.4-6.8 syl/sec
• Min (for adaptation):
– 21,143 syllables for adaptation and the test set of 2,488 syllables
– Speaking rate: 4.5-6.8 syl/sec
• Hakka (for adaptation):
– 15,009 syllables for adaptation and test set of 3,711 syllables
– Speaking rate: 3.8-5.1 syl/sec
2016/7/17
59. 59
2016/7/17
Application to Prosody Coding
Chen-Yu Chiang, Jyh-Her Yang, Ming-Chieh Liu, Yih-Ru Wang, Yuan-Fu Liao and Sin-
Horng Chen, “A New Model-based Mandarin-speech Coding System,” in Proc. Interspeech
2011, Florence, Italy, Aug. 2011, pp 2561-2564.
Chen-Yu Chiang, Yu-Ping Hung, Sin-Horng Chen, and Yih-Ru Wang, “A New Model-
Based Prosody Coder for Mandarin Speech,” in Proc. of IIHMSP 2013, Beijing, China, Oct.
2013, pp. 60-63.
61. 61
2016/7/17
Experimental Database
• Treebank Corpus
– Read speech
– 425 utterances with 56,237 syllables uttered by a
female professional announcer.
– Average syllable duration = 0.19 sec
– Associated texts - short paragraphs composed of
several sentences selected from the Sinica Treebank
Version 3.0.
– Training set - 379 utterances with 52,192 syllables.
– Test set - 46 utterances with 4,801 syllables.
65. 65
Ongoing Tasks and Future Works
• Transform Leaning:
– Take the SR-HPM for Mandarin as a base model (prior) to construct
prosodic models for English
• Voice Bank
– Prosody bank: modeling prosodies of various speakers, emotions,
styles…
– Voice font bank: modeling spectra of various speakers, emotions, styles…
2016/7/17
66. 66
2016/7/17
Acknowledgements
• We would like to thank
– Academia Sinica, Taiwan for providing the Tree-Bank
text corpus
– Dr. Chiu-yu TSENG (鄭秋豫博士) of Academia Sinica,
Taiwan for providing the Sinica COSPRO Corpus and
and the on-line word segmentation system
– Dr. Shu-Chuan TSENG (曾淑娟博士) of Academia
Sinica, Taiwan for providing the Mandarin
Conversational Dialogue Corpus (MCDC)
– Prof. Ho-Hsien PAN (潘荷仙教授) of Phonetics
Laboratory, Department of Foreign Languages and
Literatures of National Chiao Tung University, Taiwan
for her generous and helpful assistance in manually
labeling our experimental