江振宇/It's Not What You Say: It's How You Say It!

1
江振宇Chen-Yu CHIANG
國立臺北大學通訊工程學系
語音暨多媒體訊號處理實驗室
It isn’t what you said;
it’s how you said it!
2016/7/17

2
An Example - Talking Twin Babies - PART 2 -
OFFICIAL VIDEO
2016/7/17
https://www.youtube.com/watch?v=_JmA2ClUvUY

3
It isn’t what you said; it’s how you said it!
• 這句話其實就是在講 “韻律” (Prosody)
• 簡而言之：語音的“抑,揚,頓,挫,輕,重,緩,急”
2016/7/17
快速流利與慢而斷續的極端例子
朗讀式語音
不同說話方式
自發性語音 (通常是對話)
人對機器
以朗讀方式讀以上句子 (怪異)
以自然方式唸出以上句子 (自然多了!具溝通功能!)
嘿!芭樂!請你過來看一看。 (“芭樂”是一個人?)

4
It isn’t what you said; it’s how you said it!
• Thomas Sheridan (Irish stage actor) pointed
out the importance of prosody more than 200
years ago:
– Children are taught to read sentences, which they
do not understand; and as it is impossible to lay
the emphasis right, without perfectly
comprehending the meaning of what one reads,
they get a habit either of reading in a monotone,
or if they attempt to distinguish one word from the
rest, as the emphasis falls at random, the sense
is usually perverted, or changed into nonsense.
2016/7/17

5
韻律的物理量化量測
• Prosody can be measured by the following prosodic-acoustic
features (韻律聲學參數)
– 基頻 (Fundamental Frequency)，或稱 F0
– 時長 (Duration)
– 能量 (Intensity or Energy)
– 靜音 (Pause or silence)
• 韻律聲學參數可以使用以下的單元來量測
– 語句 (utterance)、語段 (discourse)、句子 (sentence)、子句 (clause)、片
語 (phrase)、詞 (word)、音節 (syllable)、聲母/韻母 (initial/final)、音素
(phoneme)
2016/7/17

6
韻律聲學參數量測範例
2016/7/17
科學家愈來愈相信在生化學上而言眼睛與胃之間必定有密切的關聯
波形
頻譜
基頻
能量
時間
靜音
音節切割

7
Important Characteristics of Mandarin Chinese (1/2)
• A tonal language (Four lexical tones (聲調), one neutral tone)
• The tonality of a monosyllable is mainly characterized by the shape of its
fundamental frequency (F0) contour. 趙元任提出的聲調標記
• To disambiguate word meanings: 媽麻馬罵嘛，買、賣，主投、豬頭
2016/7/17
Original speech signal
Synthesis without tone

8
Important Characteristics of Mandarin Chinese (2/2)
• A syllable-based language, where each syllable carries a lexical tone (聲調).
– 411 base syllables and tones  1,300 distinct tonal syllables.
• A syllable-timed language
– syllables take approximately equal amounts of time to pronounce.
– Syllable structure of Chinese
– Initial (聲母)+ Final (韻母)
– Initial = consonants Final = [medial] + nucleus + [coda]
• English - a stress-timed language, where there is approximately the same amount of time
between stressed syllables.
2016/7/17
Native English Speaker Non-Native English Speaker

9
Tone and Intonation
• Ripples on the waves (趙元任) or superposition
– Synthesis with tone+intonation
– Synthesis without intonation
2016/7/17

10
2016/7/17
Prosody Hierarchy for Mandarin (Tseng, 2005)
Chiu-yu Tseng, et. al.“Fluent speech prosody: framework and modeling,” Speech
Communication, vol.46, Issues 3-4, Special Issue on Quantitative Prosody modeling for
Natural Speech Description and Generation, pp.284-309, July 2005.
Prosodic Phrase Group
Breath Group
Prosodic Phrase
Prosodic Word
Syllable

11
A Modified Prosodic Structure for Mandarin
2016/7/17
B4: Boundary of a Breath Group
(BP)/Prosodic Phrase Group (PG)
B3: Boundary of a Prosodic Phrase (PPh)
B2: Prosodic Word (PW) boundary
B2-1: pitch reset
B2-2: short pause
B2-3: duration lengthening
B1: Normal syllabic boundary
B0: Tightly coupling syllabic
boundary

12
2016/7/17
Prosody and Syntax
Na
科學家
Dfa
愈來愈
VK
相信
PP
P
在
Na
生化學
Ng
而言
Ng
上
GP
GP
Na
眼睛
Caa
與
Na
胃
Ng
之間
NP
GP
D
必定
V_2
有
VH
密切
DE
的
Na
關聯
V-的
NP
S
S
PW PW PW PW PW PW PW PW PW PW PW
PPh PPh PPh PPh
PG/BG
Syntactic Structure
Prosodic Structure
。
Original speech signal

13
2016/7/17
Arbitrary Prosody
Na
科學家
Dfa
愈來愈
VK
相信
PP
P
在
Na
生化學
Ng
而言
Ng
上
GP
GP
Na
眼睛
Caa
與
Na
胃
Ng
之間
NP
GP
D
必定
V_2
有
VH
密切
DE
的
Na
關聯
V-的
NP
S
S
PW PW PW PW PW PW PW PW PW PW
PPh PPh
PG/BG
Syntactic Structure
Prosodic Structure
。
Poor prosody speech signal

14
Examples of English Prosodic Structure (ToBI)
• BU Radio f1ajrlp4.sph
• Hennessy * is the S.J.C.'s | thirty-second * chief justice. / Holding the court
system * on the course * he has set / and plotting | it's future agenda / won't
be an easy job / for his successor. /
2016/7/17

15
Prosody Labeling Example
2016/7/17

16
Labeling Example
• 謝謝 B2-1 主持人 B2-2 今天的 B2-3 監察人 B3 林委員 B2-2 我們
B2-3 主持人 B2-3 陳委員 B3 還有 B2-3 朱 B2-1 主席 B2-3 宋主席
B2-2 各位 B2-2 在場的 B2-2 朋友 B2-2 還有我們 B2-2 電視機 B2-
1 前面 B3 的國人 B2-3 同胞 B4 父老兄弟 B2-3 姐妹 B2-3 朋友 B2-
1 們 B2-2 大家 B2-1 晚安 B2-2 大家好 Be
2016/7/17

17
Functions of Prosody
• Grammar
– It is believed that prosody assists listeners in parsing continuous speech and in the recognition of
words, providing cues to syntactic structure, grammatical boundaries and sentence type.
• Why did you hit Joe? Why did you hit PAUSE Joe?
• Focus
– Intonation and stress work together to highlight important words or syllables for contrast and focus.
• Discourse
– Prosody plays a role in the regulation of conversational interaction and in signaling discourse
structure.
• Emotion
– Prosody is also important in signaling emotions and attitudes.
• 簡而言之：韻律是人與人溝通的通訊協定!
2016/7/17

18
Issues and Applications
• Issues concerned in prosody modeling
– Labeling of important prosodic cues
– Construction of prosody hierarchy
– Modeling of syntax-prosody relationship
– Prediction of prosodic phrase boundary (break) from text, etc.
• Applications
– Automatic Speech Recognition (ASR)
• Important prosodic cues can be explored from the input utterance to assist in both
acoustic and linguistic decoding
– Text-to-Speech (TTS)
• A good prosody model can be used to generate appropriate prosodic features from
the input text
2016/7/17

19
韻律於 Spoken Language Processing 的角色
Human
Computer
Input Output
Generation Understanding
Speech
Text
Recognition
Speech
Text
Synthesis
Meaning

20
Prosody Modeling
• y = f(x)  prosody generation for TTS
– x: input information
– y: prosodic-acoustic features (pitch, duration, energy, pause)
2016/7/17

21
Prosody Modeling
• y = f(x)  recognition of information carried by prosody
– x: prosodic-acoustic features (pitch, duration, energy, pause)
– y: information carried by prosody, including
2016/7/17

22
Direct or Indirect Prosody Modeling
Linguistic/para-linguistic/non-
linguistic features
Prosodic-acoustic features
工程師藉由 pattern
recognition tools 建立兩
者之關係 (不須大量語言
學知識)

23
Direct or Indirect Prosody Modeling
Linguistic/para-linguistic/non-linguistic features
Prosodic-acoustic features
Prosody tags (Abstract
representation of
prosody)
語言學家可解釋其物理及語
言學意義，可較廣義一般化
(generalization)至所有語言
(nature?)

24
Influential Factors on Prosody
2016/7/17
Fujisaki, H., “Information, prosody, and modeling – with emphasis on tonal
features of speech,” Proc. Speech Prosody 2004, Nara, Japan, pp. 1-10, 2004.

25
Conventional Schemes
2016/7/17
Training of pattern classifier
Speech corpus
Feature extraction
Prosodic-
acoustic features
Target class:
lexical tone, word
boundary, etc.
Parameters of pattern classifiers
(GMM, DT, NN, ME, etc.)
Fig.1. prosody modeling via intermediate
abstract phonological categories
Fig. 2. Direct modeling of target classes

26
Proposed Scheme – Unsupervised Prosody
Labeling and Modeling (PLM)
2016/7/17
 Basic Idea
– Prosody modeling and labeling are jointly conducted
using an unlabeled speech database.
– To properly model the observed features and then let
the modeled-features objectively determine prosodic
tags by themselves rather than by human perception.
 Design of the Hierarchical Prosodic Model
1. Representation of prosody hierarchy by Break Types and
Prosodic State
2. Realizing patterns of prosodic constituents
– Prosodic state model
– Syllable prosodic-acoustic model
3. Exploring the relationship between prosodic tags or
boundary types and the acoustic features surrounding
junctures.
– Syllable juncture prosodic-acoustic model
4. Relationship between prosodic structure and syntactic
structure.
– Break-syntax model
Prosody-labeled
database

27
2016/7/17
Prosody Hierarchical Structure and Prosody Tags
• Break types of syllable junctures
– demarcate prosodic constituents, i.e. syllable (SYL), prosodic word (PW),
prosodic phrase (PPh), breath group (BG) and prosodic phrase group
(PG).
• Prosodic states of syllables
– represent syllable pitch contour, duration and energy level variations
resulting from high-level prosodic constituents (>=PW).
– a substitution for the effects from high-level linguistic features, such as a
word, a phrase or a syntactic tree.

28
A Modified Prosodic Structure for Mandarin
2016/7/17
(BP)/Prosodic Phrase Group (PG)
B3: Boundary of a Prosodic Phrase (PPh)
B2-1: pitch reset
B2-2: short pause
boundary
C.-Y. Tseng, S.-H. Pin, Y.-L. Lee, H.-M. Wang, and Y.-C.
Chen, “Fluent speech prosody: Framework and modeling,”
Speech Commun. special issue on quantitative prosody
modeling for natural speech description and generation, 46,
284–309 (2005).

29
2016/7/17
Unsupervised Joint Prosody Labeling
and Modeling by Hierarchical
Prosodic Model
Chen-Yu Chiang, Sin-Horng Chen, Hsiu-Min and Yu, Yih-Ru Wang,
“Unsupervised Joint Prosody Labeling and Modeling for Mandarin
Speech,” J. Acoust. Soc. Am., vol. 125, No. 2, pp. 1164-1183, Feb, 2009.
Chen-Yu Chiang, Sin-Horng Chen and Yih-Ru Wang, “Advanced
Unsupervised Joint Prosody Labeling and Modeling for Mandarin Speech
and Its Application to Prosody Generation for TTS,” in Proc. Interspeech
2009, Brighton, UK, Sept. 2009, pp. 504-507.

30
2016/7/17
Features and Parameters Used in the Hierarchical Prosodic
Model
T: prosodic tag B: break type ={B0, B1, B2-1, B2-2, B2-3, B3, B4}
PS: prosodic state p: pitch prosodic state
q: duration prosodic state
r: energy prosodic state
A: prosodic feature X: syllable prosodic feature sp: syllable pitch contour
sd: syllable duration
se: syllable energy level
Y: inter-syllabic prosodic feature pd: pause duration
ed: energy-dip level
Z: differential prosodic features pj: normalized pitch jump
dl: normalized duration lengthening factor 1
df: normalized duration lengthening factor 2
L: linguistic feature l: reduced linguistic feature set
t: syllable tone sequence
s: base-syllable type sequence
f: final type sequence
u: utterance sequence

31
Parameterization of Syllable Pitch Contour (in logHz)
• Discrete orthogonal polynomial
– Basis Functions (Discrete Legendre Polynomials) :
2016/7/17
1)(0 M
i

][][)( 2
12/1
2
12
1  

M
i
M
M
M
i

])[(][)( 6
122/1
)3)(2)(1(
180
2
3
M
M
M
i
M
i
MMM
M
M
i





])()()[(][)( 22
25
20
)2)(1(
10
2362
2
332/1
)4)(3)(2)(2)(1(
2800
3 M
MM
M
i
M
MM
M
i
M
i
MMMMM
M
M
i







Mi 0 3M

32
The Design of the Four Models
2016/7/17
Syllable prosodic features X Inter-syllable prosodic feature Y
Differential prosodic features Z
Reduced linguistic feature set l
Prosodic state PS Break type B
Tone t, syllable type s, final f
General prosodic feature model General prosody-
syntax model
Syllable prosodic-acoustic
model
Syllable juncture prosodic-
acoustic model
Prosodic state model Break-syntax model
( , | ) ( | , ) ( | ) ( , , | , , ) ( , | )P P P P P T AL AT L TL X Y ZB PS L B PSL
( , , | , , ) ( | , , ) ( , | , )P P PX Y ZB PS L XB PS L Y ZB L
( , | ) ( | ) ( | )P PB PSL PSB BL

33
2016/7/17
Syllable Pitch Contour Model (1/3)
4
565.4 23.9 -25.6 -0.5
23.9 90.5 9.7 -8.2
10
-25.6 9.7 17.8 -0.9
-0.5 -8.2 -0.9 5.0

 
 
   
 
 
 
spR 4
3.5 0.2 -0.2 0.0
0.2 31.9 2.6 -1.5
10
-0.2 2.6 11.1 0.6
0.0 -1.5 0.6 3.7

 
 
  
 
 
 
r
sp
R
Covariance of observed log-F0
Figure 2.4: The APs of five tones
Covariance of residual log-F0

34
2016/7/17
Figure 2.5: The (a) forward and (b) onset coarticulation patterns Here tp = (i, j) and t = i or j.
,
f
B tpβ
,b
f
B tβ
Tone 1 Tone 3
+
High-low mismatch compensation
B0
B1
B4

35
2016/7/17
Figure 2.5: The (c) forward and (d) offset coarticulation patterns Here tp = (i, j) and t = i or j.
,
b
B tpβ ,e
b
B tβ
Tone 3 Tone 3
+
A tone sandhi example

36
Patterns of Prosodic Constituents
2016/7/17
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
LogF0
PG/BG
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
LogF0
PPh
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Length in syllable
LogF0
PW
Figure 3.13: The log-F0 patterns of BG/PG, PPh and PW.
/n n n
r
n n PW PPh BG PG   pm pm β β β

37
2016/7/17
-0.02
0
0.02
0.04
0.06
PG/BG
sec
-0.02
0
0.02
0.04
0.06
PPh
sec
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
-0.02
0
0.02
0.04
0.06
PW
Length in syllable
sec
Figure 3.14: The syllable duration patterns of BG/PG, PPh and PW.
/n n n
r
n n PW PPh BG PGdm dm      

38
2016/7/17
-5
-3
-1
1
3
5
dB
PG/BG
-5
-3
-1
1
3
5
dB
PPh
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
-5
-3
-1
1
3
5
Length in syllable
dB
PW
Figure 3.15: The energy level patterns of BG/PG, PPh and PW.
/n n n
r
n n PW PPh BG PGem em      

39
Comparison between Human Labeling and Machine Labeling (1/2)
2016/7/17
Human Labeling Tags:
b1: non-break
b2: prosodic word boundary
b3: minor break
b4: major break
Machine Labeling Tags:
(BP)/Prosodic Phrase Group
(PG)
B3: Boundary of a Prosodic Phrase
(PPh)
B2-1: pitch reset
B2-2: short pause
boundary

40
Comparison between Human Labeling and Machine Labeling (2/2)
2016/7/17

41
2016/7/17
Application to ASR (read speech)
Sin-Horng Chen, Jyh-Her Yang, Chen-Yu Chiang, Ming-Chieh Liu and
Yih-Ru Wang, "A New Prosody-Assisted Mandarin ASR System", IEEE
Trans. on Audio, Speech and Language Processing, vol.20, no.6,
pp.1669-1684, Aug. 2012.

42
Proposed Two-Stage Prosody-Assisted ASR
2016/7/17

43
Experimental Settings
• Database for the ASR experiments
– TCC300: a large Mandarin read speech database
– Training: 274 speakers, 23 hours for acoustic model and prosodic model
– Test: 19 speakers, 2 hours
• Acoustic model
– 411 Syllable HMM (8 states) + silence model + short pause model
– MMI training
– Trained from TCC300 training set (274 speakers, 23 hours)
• Factored LM
– NTCIR + Sinica + Panorama, about 1.2 billion words
– 60000-word lexicon
• Prosodic model
– Trained from the subset of TCC300 training set (164 speakers, 8.3 hours)
2016/7/17

44
Experimental Results
2016/7/17
Recognition Performances of The Baseline Scheme, Scheme 1, and Scheme 2 (%)
WER CER SER
Baseline scheme 24.4 18.1 12.0
Break 21.3 15.0 10.2
Break + Prosodic state 20.7 14.4 9.6
EXPERIMENTAL RESULTS OF POS DECODING (%)
Precision Recall F-measure
EXPERIMENTAL RESULTS OF PM DECODING (%)
Precision Recall F-measure

45
2016/7/17
TABLE VIII. EXPERIMENTAL RESULTS OF TONE DECODING (%)
Precision Recall F-Measure
An example of recognition results for a partial paragraph. Eight panels represent, respectively, waveform, prosodic state AP+global
mean of syllable log-F0 level, syllable duration, and syllable energy level, break type (B), reference transcription (R), result of
baseline scheme (F) and proposed system (P).

46
2016/7/17
Application to ASR
(spontaneous speech)
Cheng-Hsien Lin, Meng-Chian Wu, Chung-Long You, Chen-Yu Chiang,
Yih-Ru Wang, Sin-Horng Chen, “Prosody Modeling of Spontaneous
Mandarin Speech and Its Application to Automatic Speech
Recognition,” Speech Prosody 2016, accepted.

47
Experimental Settings
• Database for the ASR experiments
– MCDC 8-hour dialogues from 16 speakers, texts with PU tags are transcribed and
annotated by linguist experts
• Acoustic Model
– Seed tri-phone HMM models are trained from TCC300[3] and adapted using 80% of MCDC.
– CI models for PU: 6 particles (HO, EI, HAN, HEN, HEIN, and MHM)+ 2 fillers (unrecognized/foreign
speech)
– CI models for paralinguistic: BREATHE, CLEAR_THROAT, LAUGH, NOISE, SMACK, and
SWALLOW
• Factored LM
– About 440 million words corpus merged from 5 corpora, words/POS are tagged by in-house CRF
tagger
– Adapted by 90% MCDC corpus
– 60000-word lexicon, including all particles and markers, selected by their word frequencies
2016/7/17

48
2016/7/17

49
2016/7/17
Speaking Rate Dependent Hierarchical Prosodic
Model
Sin-Horng Chen, Chiao-Hua Hsieh, Chen-Yu Chiang, Hsi-Chun Hsiao, Yih-Ru Wang, Yuan-Fu
Liao and Hsiu-Min Yu, “Modeling of Speaking Rate Influences on Mandarin Speech Prosody
and Its Application to Speaking Rate-controlled TTS,” , IEEE Trans. on Audio, Speech and
Language Processing, vol.22, no. 7, pp.1158-1171, July. 2014.

50
Introduction
• Speaking rate is a prosodic feature that influences many
phenomena such as
– Syllable duration
– Pitch contour
– Pause duration
– Occurrence frequency of pause
• Modeling the effects of speaking rate is an important research
issue in
– Automatic speech recognition (ASR)
– Text-to-speech system (TTS)
2016/7/17

51
• Objective
– Modeling the influence of speaking rate on speech prosody based on the
PLM method
• The proposed approach
– We take speaking rate as a continuous variable and construct a single
HPM using the same four corpora
– In this study, the speaking rate(SR) in each utterance is defined as its
average duration per syllable uttered disregarding all inter-syllable pauses
2016/7/17

52
Experimental Database
• SR-Treebank database:
– Read speech
– The corpus contains four parallel speech datasets uttered by a female
professional announcer with fast, normal, median and slow speaking rate.
– All utterances are short paragraphs. There are in total 1478 utterances
consisting of 203,746 syllables.
2016/7/17

53
Break Labeling Examples for Four Parallel
Utterances with Various SR
2016/7/17
Note: only pause-related break type, i.e. B4(@), B3 (/) and B2-2(*) are displayed
Fast SR：
依據行政院主計處的統計 @，十月份 * 一到二十日 / ，我國出口及進口金額 /
比起去年同期 * 均有增加 @，
Normal SR：
依據行政院主計處的統計 @，十月份 * 一到二十日 /，我國出口 * 及進口金額
/ 比起去年同期 * 均有增加@，
Median SR：
依據 * 行政院主計處的統計 @，十月份 / 一到 * 二十日 /，我國出口 * 及進口
金額 / 比起去年同期 * 均有增加 @，
Slow SR：
依據 / 行政院 * 主計處的統計 @，十月份 / 一 * 到 * 二十日 @，我國出口 * 及
進口金額 / 比起去年同期 * 均有增加 @，

54
Examples of Synthesized Speech
2016/7/17
Original
proposed
baseline
SlowerFaster

55
Cross-Dialect and -Speaker Adaptation of SR-HPM
2016/7/17
Chen-Yu Chiang, “A Study on Adaptation of Speaking Rate-Dependent
Hierarchical Prosodic Model for Chinese Dialect TTS,” in Proc.
OCOCOSDA 2015, Shanghai, China, Oct. 2015. (Best Paper Award)
Chen-Yu Chiang, Hsiu-Min Yu, Sin-Horng Chen, “On Cross-Dialect and
-Speaker Adaptation of Speaking Rate-Dependent Hierarchical Prosodic
Model for a Hakka Text-to-Speech System,” Speech Prosody 2016,
accepted.
I-Bin Liao, Chen-Yu Chiang, Sin-Horng Chen, “Structural Maximum a
Posteriori Speaker Adaptation of Speaking Rate-Dependent Hierarchical
Prosodic Model for,” accepted by ICASSP 2016

56
Experimental Databases
• Mandarin (for background model)
– 1,478 utterances with 183,795 syllables
– A wider SR range of 3.4-6.8 syl/sec
• Min (for adaptation):
– 21,143 syllables for adaptation and the test set of 2,488 syllables
– Speaking rate: 4.5-6.8 syl/sec
• Hakka (for adaptation):
– 15,009 syllables for adaptation and test set of 3,711 syllables
– Speaking rate: 3.8-5.1 syl/sec
2016/7/17

57
Results for Hakka
2016/7/17
客家人在歷史項，輒常
分人看做「人客」。從
歷史個文獻資料來看，
客家人經過幾下擺個大
遷徙；見擺遷徙，就去
到別人既經早就先到個
所在；高不將先向山區
安身，先定疊下來，正
定定仔對外發展。台灣
個客家人，大部分對廣
東梅州、惠州，少部分
對福建永定、詔安遷徙
過來。

58
Results for Min
2016/7/17
原來春枝迷著歌仔戲。呣敢siau
想看規齣，有通看一te te仔就teh
癮囉。自按呢日昇配合伊，調
整送貨時間、路線，撥工載伊
去看戲尾仔。彼khui仔是日昇一
世人上樂暢的時陣，送貨的
khang khoe，做著嘛加偌thiau iat
咧。一下chiap去，連顧戲口的
查某gin仔看in來，一句赫晏ouh？
就知愛放戲尾仔啦。春枝知影
日昇愛食甜，不時偷me烏糖互
伊，日昇上愛那看那chng，chng
甲規嘴hoe sa sa。一擺戲齣做到
娘子落難，沿途奔波討食，尾
仔煞真正跪ti台仔頂teh即時台仔
腳看戲的銀角仔四界tan去lih。

59
2016/7/17
Application to Prosody Coding
Chen-Yu Chiang, Jyh-Her Yang, Ming-Chieh Liu, Yih-Ru Wang, Yuan-Fu Liao and Sin-
Horng Chen, “A New Model-based Mandarin-speech Coding System,” in Proc. Interspeech
2011, Florence, Italy, Aug. 2011, pp 2561-2564.
Chen-Yu Chiang, Yu-Ping Hung, Sin-Horng Chen, and Yih-Ru Wang, “A New Model-
Based Prosody Coder for Mandarin Speech,” in Proc. of IIHMSP 2013, Beijing, China, Oct.
2013, pp. 60-63.

61
2016/7/17
Experimental Database
• Treebank Corpus
– Read speech
– 425 utterances with 56,237 syllables uttered by a
female professional announcer.
– Average syllable duration = 0.19 sec
– Associated texts - short paragraphs composed of
several sentences selected from the Sinica Treebank
Version 3.0.
– Training set - 379 utterances with 52,192 syllables.
– Test set - 46 utterances with 4,801 syllables.

62
2016/7/17

63
2016/7/17

64
2016/7/17

65
Ongoing Tasks and Future Works
• Transform Leaning:
– Take the SR-HPM for Mandarin as a base model (prior) to construct
prosodic models for English
• Voice Bank
– Prosody bank: modeling prosodies of various speakers, emotions,
styles…
– Voice font bank: modeling spectra of various speakers, emotions, styles…
2016/7/17

66
2016/7/17
Acknowledgements
• We would like to thank
– Academia Sinica, Taiwan for providing the Tree-Bank
text corpus
– Dr. Chiu-yu TSENG (鄭秋豫博士) of Academia Sinica,
Taiwan for providing the Sinica COSPRO Corpus and
and the on-line word segmentation system
– Dr. Shu-Chuan TSENG (曾淑娟博士) of Academia
Sinica, Taiwan for providing the Mandarin
Conversational Dialogue Corpus (MCDC)
– Prof. Ho-Hsien PAN (潘荷仙教授) of Phonetics
Laboratory, Department of Foreign Languages and
Literatures of National Chiao Tung University, Taiwan
for her generous and helpful assistance in manually
labeling our experimental

67
2016/7/17
Thank You for Your Attention!
Contact:
cychiang@mail.ntpu.edu.tw
http://cychiang.tw

江振宇/It's Not What You Say: It's How You Say It!

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Viewers also liked

Viewers also liked (20)

Similar to 江振宇/It's Not What You Say: It's How You Say It!

Similar to 江振宇/It's Not What You Say: It's How You Say It! (20)

More from 台灣資料科學年會

More from 台灣資料科學年會 (20)

Recently uploaded

Recently uploaded (20)

江振宇/It's Not What You Say: It's How You Say It!