5. Audio clips of synthetic speech illustrating the history of the art and technology of synthetically produced human speech. http://www.cs.indiana.edu/rhythmsp/ASA/Contents.html http://www.cs.indiana.edu/rhythmsp/ASA/highlights.html http://www. humnet . ucla . edu / humnet /linguistics/ faciliti /demos/ vocalfolds/vocalfolds.htm
6.
7. This may also be referred to as synthesis-by-rule although rules of one sort or another are common to all synthesis systems. For a formant synthesis system the output of the high-level component typically consists of a sequence of allophones together with their duration and pitch, e.g. DH 7 34 I 5 34 S 8 33 Formant Synthesis Duration measured in 10ms frames Pitch coded into the range 1-63
8. The low-level component uses this input to provide a sequence of frames, each frame containing a set of parameters referring to formant frequencies, formant amplitudes, voicing, fundamental pitch, etc., e.g. Fn alf f1 a1 f2 a2 f3 a3 ahf s f0 250 33 280 33 1300 32 2680 34 41 36 34 250 37 280 37 1300 36 2680 38 45 36 34 250 40 280 40 1300 39 2680 41 48 36 34 250 42 280 42 1300 41 2680 43 50 36 34 This information is then fed into a formant synthesiser which uses it to generate the appropriate audio output. The formant synthesiser may be implemented in hardware or software. An example of a formant synthesis system is DECTALK.
10. Klatt synthesiser A combined serial/parallel formant synthesiser. A serial, or cascade, synthesiser is a better model for the production of vowel and vowel-like sounds whereas a parallel synthesiser is better suited to producing nasals, fricatives and stops. The serial synthesiser specifies the formant centre frequencies and bandwidths. The parallel synthesiser specifies the formant levels (peak amplitudes) also.
11. Waveform Concatenation Synthesis With this system the low-level component generates a speech output file by concatenating units of previously recorded speech. Information about the duration and pitch of these units is again supplied by the high-level component. The size of the units is clearly an important consideration and both in terms of amount of storage required and the difficulties involved with joining them together (more about this later). An example of a waveform concatenation synthesis system is the Lernout and Hauspie TTS system.
14. words How many? ~ 300,000 Perhaps 2000 for most frequently used words. Formant synthesis 1 word ~ 0.5sec 12bytes per 10msec frame 1 word = 0.5/.010 frames = 50 frames = 600 bytes 300,000 words ~180Mbyte Waveform concatenation 1 word ~ 0.5 sec Sampling rate = 16Khz 1 word = 8K samples = 16K bytes at 2 bytes per sample 300,000 words = 4.8Gbyte
15. morphemes Basic meaningful units that make up words, essentially roots, prefices, suffices e.g. sail, travel, -ed, -s => sail, travel sailed, travelled sails, travels ~10,000-30,000 entries Formant synthesis 30,000 entries ~ 10-15Mbyte Waveform concatenation 3 0,000 entries ~ 120Mbyte
17. phonemes Formant synthesis . 1 phoneme ~ 10 frames = 120 bytes 40 phonemes ~5Kbytes Actually need about 70-80 allophones, giving ~ 10Kbytes How many? About 40 for English. Waveform concatenation 1 phoneme ~ 100msec = 0.1sec = 1.6K samples = 3.2K bytes Total ~ 256K
18. demi-syllables s u m t ie m z (sometimes) These are units of speech obtained by making cuts in the middle of the vowel part of the syllable. The reason for doing this is that coarticulation effects are minimal in the middle of the vowel. The number of demi-syllables is about 4-5000 made up from about 1500 initial demisyllables and 3000 final demisyllables. s u m - t ie m z
20. diphones k u n uu (canoe) These are units of speech obtained by going from the middle of one phone to the middle of another. The reason for doing this is that coarticulation effects are minimal in the middle of the sound. Theoretically there are about 40x40 = 100 diphones but in practice the number is about 1200. qk | ku | un | nuu | uuq q = silence
22. Words versus phonemes Trade off between storage and processing The larger the unit then the more storage space it requires but in compensation less effort is required in joining the units together.
23. Pronunciation Task Input Text Phonemic Text Broad Phonetic Text Affix Tables Pronunciation Rules Restricted Text Phonemic Text Conversion Task Exceptions Dictionary Restricted Text Prosody Task Prosody Table Broad Phonetic Text Narrow Phonetic Representation Lower Phonetic Task Lower Phonetic Table Narrow Phonetic Representation Control Parameters Phonotactic Tables Speech Allophone Task Allophonic Rules
25. This task converts unrestricted text to restricted text. Unrestricted text consists of non-English words, abbreviations (e.g. Dr.), unusual pronunciation, words to be spelt (e.g. BBC), etc. This task also gets the phonetic form of words that are in the dictionary and deletes any redundant white space. The date is 1/10/97. …the first of October, ninety ninety seven. I can’t do it. …I cant do it. Or I cannot do it The price is £23.99. … twenty three pounds ninety nine pence St. George St. … Saint George Street well-behaved …well behaved CONVERSION TASK
26.
27. 1. Look up the entire word in the dictionary. If found then exit from pronunciation task. sentence “sen-tans sensitive “sen’si-tiv transport “traans-poat cough “kof Conversion of words to phonemes is far from regular and is very context dependent in English. Some languages are better than others in this respect. George Bernard Shaw is quoted as saying that fish is spelt GHOTI ?? The advantage of using a dictionary is that it can include information on syllables, stress, syntactic types, etc. This is much more easily obtained using a lookup table than by using algorithms. For example, stress markers and syllable boundaries are more easily identified. But, of course, there will always be some words which are not in the dictionary, however large one makes it.
28.
29. Example of function for -s removal (not definitive) lastchar = S if (prevchar=S) then return else if (prevtwochars = IO) then remove S else if (prevchar=vowel but not E) then return else if (prevchar=E) then remove S if (prevprevchar=I) then replace IE by Y else if (prevprevchar=H) and (prevprevprevchar T) then delete E else if ((prevprevchar=S) and (prevprevprevchar=S)) or ((prevprevchar=Z) and (prevprevprevchar=Z)) then delete E SS IOS IES -> Y HES but not THES SSES/ZZES loss folios parties batches/bathes losses/buzzes AS alias
30.
31. The JSRU system has a list of 39 suffixes e.g. ED,NT, FUL, OUS, ALIC, IBLE, EN, etc. After a suffix has been removed, the stem is checked to see if it is long enough(at least 3 letters) and whether the final consonant cluster is pronounc e able. If the suffix could have replaced a final 'e' then this 'e' must be added. The algorithm for deciding whether or not an 'e' has been removed is not simple. e.g. alternation -> alternate + ion i nteraction -> interact + ion Rule here is that if stem ends in vowel + consonant then an 'e' should be added. What about taxable? OK when 'x' replaced by 'ks'.Having removed a suffix, the reduced word is looked up in the dictionary. If it ’ s not found then an attempt is made to remove another suffix. e.g. wond - er - ful - ly
32.
33.
34. For example, {"A", "", "consY>", "ai"} lazy {"OUGH", "", "", "ou"} bough {"AUGH", "", "", "aw"} daughter {"C", "", "E", "S"} lace {"PH", "", "", "F"} phase What about laughter? What about rough, cough, though?
35. The order of the rules is important. e.g. Special procedures need to be called in some cases: {"Q", "", "", "KW"} free q ent free kw ent {"E", "cons", "Q", "EE"} fr e qent fr ee qent {"QU", "", "", "Q"} fre qu ent fre q ent N.B. Every rule is tried at every position in the word. Hence, time taken depends very much on the number of rules. {"vowel", "", ""consE", "@001"} Magic 'e' procedure {"E", "", ">", "@002} final 'e' procedure {"@003", "", "", ""} double letter procedure
36. 5. Replace the prefixes and suffixes – in their phonetic form, of course, w hich can be obtained from lookup tables. This may result in some adjustments which are handled by the second set of rules. e.g. {"c", "", "i", "s"} precious {"c", "", "", "k"} practically {"ig", "", "n-", "ie"} assignment
37. 6. Perform stress assignment by applying the stress rules – these can b e quite complicated. Make adjustments to final pronunciation e. g. reduction of some unstressed vowels. The stress rules are based on MIT rules and are quite complicated. (From text to speech: the MITalk system Allen et al) In the first (cyclic) phase several rules are applied in sequence first to the stem then to the stem + su ffixes taken one at a time P refixes are considered as part of the stem Some affixes(e.g. -ING) do not affect the stress pattern so rules are omitted, while others(e.g. -ION) force stress onto a particular vowel, usually the one before the affix. In the second (non-cyclic) phase one vowel is selected for mainstress and all others are reduced to secondary or no stress.
38. Stress Rules Main Stress Rule (cyclic) 1. V -> [1-stress] / X – C 0 {[short v] C 0 1 / V} {[short V] C 0 / V} where X contains all prefixes and the symbol ‘-’ indicates the position of the vowel to be stressed C 0 1 matches zero or one consonant C 0 matches any number of consonants (including none) {..} denotes a list of alternative patterns separated by slashes /. Assign 1-stress (primary stress) to the vowel in a syllable which precedes a weak syllable followed by a morph-final syllable containing a short vowel and zero or more consonants. e.g. difficult - > d ” i f i k a l t X – C 0 {..} {…}
39. V -> [1-stress] / X – C 0 {[short v] C 0 1 / V} {[short V] C 0 / V} where X contains all prefixes and the symbol ‘-’ indicates the position of the vowel to be stressed Assign 1-stress to the vowel in a syllable preceding a vowel followed by a morph-final syllable containing a short vowel and zero or more consonants. e.g. secretariat -> sekret“eir ee aat X – C 0 {..} {..} Assign 1-stress to the vowel in a syllable preceding a vowel followed by a morph-final vowel. e.g. oratorio -> orat“oar ee oa X – C 0 {..} {..}
40. 2. V -> [1-stress] / X – C 0 {[short V] C 0 / V} where X contains all prefixes Assign 1-stress to the vowel in a syllable preceding a short vowel and zero or more consonants e.g. edit -> “ed it bitumen -> bity”uum en agenda -> aj”en da 3. V -> [1-stress] / X – C 0 where X contains all prefixes Assign 1-stress to the vowel in the last syllable e.g. stand -> st”aand parole -> paar”oal
42. After the stress assignment a third set of rules is applied. e.g. {"41r", "", "cons", "41"} far gone versus far away (141 = 'schwa') {"ir", "", "cons", "er"} dirt versus direct and some unstressed vowels are reduced.e.g. aa a ai i u a o a bottom => "bot-am
43.
44. This task adds intonation and timing to the phonetic text. The task processes complete breath groups. The output consists of a list of phonemes with corresponding pitch and duration. PROSODY TASK
45. In No vem ber the reg ion’s wea ther was un us ually dry . Do you want to tra vel to Lon don? Who is the Prime Min ister of the Ba ha mas? Lift the safe ty cov er and press the red but ton.
46.
47. Examples What time is the next train to London? What time is the next train to London ? What time is the next train to London? There’s John cycling down the road. Sentence stress This refers to the way in which one word is singled out in a sentence as the focus , or nuclear stress
48. Rhythm Rhythm is about how we time the delivery of the sentence. In English we are supposed to time the stressed syllables so that they occur equidistantly in time. English is said to be a stress-timed language The stressed syll–a-bles-in Eng-lish o - ccur e - qui- dis-tant-ly in time # # # # # # # # # # # # # # # # # # # # # # Les encyclopedies electroniques sont pleines d’informations French is said to be a syllable-timed language D Crystal 1995 clo pe die se lec tro niques cy sen Le sont pleines D’in for ma tions
49.
50. Projects represent a substantial part of your marks and you should spend some time choosing a project which will show you in your best light. Projects represent // a substantial part // of your marks // and you should spend // some time choosing // a project // which will show // you in your best light. “ Pro-jects rep-res- ” ent // a sub- ” stant-ial “ part // of your “ marks // and you should “ spend // some “ time “ choos-ing // a “ pro-ject // which will “ show // you in your “ best “ light. Unstressed syllable weight 1 Secondary stress “ 2 Primary stress “ 3 Emphatic stress “ 4
53. Baseline When all syllable patterns have been calculated they are superimposed on a falling baseline , , The value at the end of the phrase (indicated by the ,) is half that at the beginning, and starting value for next phrase is 1.2 times the final value of previous one
54.
55. This task converts the output from PROSODY TASK into control parameters for the hardware synthesiser. LOWER PHONETIC TASK T he table for each phonetic element contains information relating to how transitions between target values are calculated. For each parameter in the table there are entries for Its target value The proportion of the target of the dominated element used in deriving the boundary value A fixed contribution to the boundary value Transition duration within the dominant element Transition duration within the dominated element 3 4 External Duration 2 3 50 950 1900 E 8 4 50 380 760 W Rank Internal Duration Proportion Fixed Contribution Target
56. It sends a frame of parameters to the synthesiser every 10msecs. The boundary value is calculated as Fixed Contribution of Dominant element + Target of dominated element * proportion of dominant element i .e. 380 + 1900 * 0.5 = 1330 760 Hz Target value for W 1330 Hz 1900 Hz Target value for E Internal duration 4 frames External duration 4 frames 2 3 3 50 950 1900 E 8 4 4 50 380 760 W Rank External Duration Internal Duration Proportion Fixed Contribution Target
57. DH A W E 1300 1360 1090 1330 1420 760 Boundary Target Values Values DH|A 650 + 1420 * 0.5 = 1360 DH 1330 A|W 380 + 1420 * 0.5 = 1090 A 1420 W|E 380 + 1900 * 0.5 = 1330 W 760 7 4 8 12 6 11 2 8 2 6 Rank 5 6 3 4 3 5 External Duration 2 50 860 1720 Z 0 50 440 880 L 3 50 950 1900 E 4 50 380 760 W 3 50 710 1420 A 2 50 650 1300 DH Internal Duration Proportion Fixed Contribution Target
58. Tomorrow will be starting off grey and rather murky. :"to-ma-rou /wil /bey :"star-ting /of :"grai /and :"raa-dha :"mer-kee.
59. T 9 37 TY 2 37 TZ 3 37 O 10 37 M 9 35 A 4 35 R 4 34 OU 7 34 OB 7 33 W 4 34 I 5 34 LP 4 32 B 5 33 BY 1 33 EY 5 33 S 11 36 T 6 35 TY 2 35 AR 14 35 T 6 31 TY 2 31 I 5 31 NG 9 30 O 6 32 F 6 31 G 6 34 GY 2 34 R 2 34 AI 12 34 AJ 9 32 A 4 32 N 6 31 D 2 30 DY 1 30 R 7 33 AA 10 32 DH 7 31 A 5 30 M 11 32 ER 14 32 K 6 25 KY 2 23 I 10 23 QQ 51 22 Q 42 21