Ever since we use computers, people have dreamt of interacting with computers in a natural way, using spoken language. Hardly any science fiction flick gets by without humans talking to computers or even androids. With Siri, Apple has brought this capability to iOS. Many developers hoped to use Siri I their apps, but so far Apple hasn’t provided us with an API. In this talk, I will show how to use speech recognition and speech synthesis in native iOS applications without having to jailbreak your device. We will take a look at some libraries (both open source and commercial) that allow us to build speech-enabled apps with little effort.
7. Sp!ch syn"esis: Hist#y
1769: Speaking machine, by Wolfgang von Kempelen (he also developed the
famous Mechanical Turk)
Functional representation of the human vocal tract.
http://www.youtube.com/watch?v=zYRVqrfY3tQ
1970: Vocoder, custom built for Kraftwerk.
http://www.youtube.com/watch?v=w-Jq7BHtQMA
1939: Vocoder (Vocal Encoder), developed by Horner Dudley for Bell Labs.
Needed to be played (using a keyboard) by a trained operator.
Exhibited at the 1939 World Fair.
http://www.youtube.com/watch?v=CyaK22DMfF0
8. Most modern speech synthesis
systems use electronic /
computerized approaches
Sp!ch Syn"esis
9. Text to sp!ch (TTS)
Text Speech
Front end Back end
In modern TTS systems, speech synthesis is a
multi-step process that is divided into two
main parts:
1) Front end (analysis)
2) Back end (synthesis)
10. Text to sp!ch (TTS)
Text
analysis
Linguistic analysis
Waveform
generation
Phasing
Intonation
Duration
Text Speech
PhonemesWords
Front end Back end
16. TTS: Concatenative syn"
Base strategy: Concatenate segments of recorded speech
Unit selection synthesis: uses phones, diphones, half-phones, syllables,
morphemes, word, phrases and sentences. Best results, often
indistinguishable from human speech. Requires huge amount of pre-
recorded data.
Diphone synthesis: uses a minimal database containing all diphones of a
natural language (English: 800 diphones, German: 2500 diphones).
Disadvantage: sonic glitches. Still used commercially, but on the decline.
Domain-specific synthesis: concatenates prerecorded words and
sentences. Used in transport schedule announcements, weather reports,...
Simple to implement. High level of naturalness.
17. TTS: F#mant syn"
Formant: spectral peak of the sound spectrum of the voice.
It is sufficient to reproduce the first two (of 4) formants to be able to
distinguish vowels.
Can be implemented quite easily, but results in rather artificial results
(“computer voice”).
Vowel Formant f1 Formant f2
i 240 Hz 2400 Hz
e 390 Hz 2300 Hz
o 360 Hz 640 Hz
Vowel Formant f1 Formant f2
i 320 Hz 3200 Hz
e 500 Hz 2300 Hz
o 500 Hz 1000 Hz
English German
18. Concatenative Formant
Advantages • High level of naturalness • No large database
required
• Very intelligible, also at
high speeds
Disadvantages • Requires large database • Low level of naturalness
(“robotic” sound)
TTS: Syn"esis
20. TTS SDKs
• Siri
• iOS Voice Services
• Flite
• OpenEars (based on Flite)
• iSpeech
• Nuance
• AT&T
• Google TTS
• Bing TTS
21. TTS SDKs
• Siri
• iOS Voice Services
• Flite
• OpenEars (based on Flite)
• iSpeech
• Nuance
• AT&T
• Google TTS
• Bing TTS
22. Using iOS Voice Service
Private API: Not save for the App Store - use at your own risk!
VSSpeechSynthesizer *speech =
[[NSClassFromString(@"VSSpeechSynthesizer") alloc] init];
[speech setRate:(float)1.0];
[speech startSpeakingString:@"Hello world, how are you"];
23. OpenEars SDK
URL: http://www.politepix.com/openears/
Shared Source
Based on CMU Pocketsphinx, CMU Flite, and CMU-CLMTK
Works offline, both for recognition and synthesis
Currently only supports English
Synthetic sound (diphone voice synthesis)
Pricing: free, with additional paid voices
24. iSp!ch SDK
URL: http://www.ispeech.org
Commercial, free access for testing
Needs a server connection
Supports several languages: English (US, UK, m/f), Spanish (m/f), Chinese,
Japanese, Danish, Finnish, Italian, German, Russian, ...
Synthetic sound (diphone voice synthesis)
Pricing:
pay per use (0.02$ per TX)
pay per install (0.25$ per install, minimum 10.000 installs)
25. AT & T Sp!ch SDK
URL: http://developer.att.com
Commercial, free trial access for 90 days
Pricing: USD 99 / year grants 1.000.000 API calls per month
TTS API:
Web Service:
send text, get WAV back
Voices:
US English (male / female)
US Spanish (male / female)
26. Nuance
URL: http://dragonmobile.nuancemobiledeveloper.com/
Commercial, free access for testing
Needs a server connection
Supports several languages: English (US, UK, m/f), Spanish (m/f), Chinese,
Japanese, Danish, Finnish, Italian, German, Russian, ...
Rather natural sound
Pricing:
Several Service Levels (Silver, Gold, Emerald)
Silver:
Up to 20 TX per device per day, max 500.000 devices
Gold
Pay per device ($0.24 per install)
Pay per transaction ($0.009 per tx)
Pre-payment of at least $3000
29. Sp!ch recognition: Hist#y
1952: “Audrey” developed at Bell Labs. Could recognized digits spoken by a
single voice.
1970s: DARPA Speech Unerstanding Research program. “Harpy”, developed at
Carnegie Mellon University (could understand 1011 words).
http://www.youtube.com/watch?v=N3i6NoUZsSw
1962: “Shoebox” by IBM, demonstrated at World Fair. Could recognize 16
words spoken in English.
http://sysrun.haifa.il.ibm.com/ibm/history/exhibits/specialprod1/
specialprod1_7.html
1980s: By using statistical models (Hidden Markov Models), ASR vocabularies
grew from a few hundred words over several thousand words to
potentially unlimited numbers of words. Still, discrete dictation was
required.
1990s: Dragon Naturally Speaking (originally at $9000) supports continuous
speech recognition.
33. Sp!ch Recognition SDKs
• Siri
• iOS Voice Services
• Flite
• OpenEars (based on Flite)
• iSpeech
• Nuance
• AT&T
• Google TTS
• Bing TTS
34. • Siri
• iOS Voice Services
• Flite
• OpenEars (based on Flite)
• iSpeech
• Nuance
• AT&T
• Google TTS
• Bing TTS
Sp!ch Recognition SDKs
35. OpenEars SDK
URL: http://www.politepix.com/openears/
Shared Source
Based on CMU Pocketsphinx, CMU Flite, and CMU-CLMTK
Works offline, both for recognition and synthesis
Vocabulary: needs to be provided by developer
Currently only supports English
Pricing: free, with additional paid voices
36. iSp!ch SDK
URL: http://www.ispeech.org
Commercial, free access for testing
Needs a server connection
Supports several languages: English (US, UK, m/f), Spanish (m/f), Chinese,
Japanese, Danish, Finnish, Italian, German, Russian, ...
Pricing:
pay per use (0.02$ per TX)
pay per install (0.25$ per install, minimum 10.000 installs)
37. AT & T Sp!ch SDK
URL: http://developer.att.com
Commercial, free trial access for 90 days
Pricing: USD 99 / year grants 1.000.000 API calls per month
Supports several recognition contexts:
Gaming, Social Media, Web Search, Business Search, Voicemail to Text,
SMS, Question and Answer, TV, Generic
Support for command mode:
provide set of commands that are allowed in your app. Supports 19
languages (including English, German, Mandarin, Japanese, French,
Italian)
38. Nuance
URL: http://dragonmobile.nuancemobiledeveloper.com/
Commercial, free access for testing
Needs a server connection
Supports several languages: English (US, UK), Spanish, Chinese,
Japanese, Danish, Finnish, Italian, German, Russian, ...
Pricing:
Several Service Levels (Silver, Gold, Emerald)
Silver:
Up to 20 TX per device per day, max 500.000 devices
Gold
Pay per device ($0.24 per install)
Pay per transaction ($0.009 per tx)
Pre-payment of at least $3000