This document summarizes the history and development of speech recognition technology. It discusses how early systems used statistical methods and large computational resources to train models. It outlines key applications and areas like dictation, command and control, telephony, automobiles, and mobile devices. The document describes the basic process of how speech recognition works including feature extraction, labeling, and classification. It highlights challenges for the future like improving recognition in noisy environments, better integration across devices and languages, and combining modalities like voice, text, and sensors.
1. Voice: The New UI for Mobile Devices
Jan Šedivý
WORLD USIBILITY DAY – 2012
1
2. Fred Jelinek (1932-2010)
During 21 years at IBM Research and
nearly two decades at Johns
Hopkins, he has pioneered the
statistical methods that enable modern
computers to understand spoken
language.
“He envisioned applying the mathematics of
probability to the problem of processing speech
and language,” said Sanjeev Khudanpur, a Johns
Hopkins associate
2
4. Speech reco benefits
Speech •
•
Speech is much richer then two mouse buttons
Disambiguation, dialog
• Show me all emails from David about Linux server
is rich • “Call David”, David Smith or Stone? Home or cell?
Text • Speech expresses not only text entry but C&C,
search, URI entry
• Speech entry is part of the keyboard
entry • “command box”, general source of information
WYSIWYG == What You Say
Is What You Get
4
5. Elements of success
• Access to huge content: Internet, YouTube,
maps, music, pictures, SMS, email…
• Train on all available data: contact, location
Best names addresses, email, documents content,
history, personalization and other sensors: GPS,
accuracy: accelerometers, camera, compass
• Computationally expensive - huge clusters of
computers to speed up training
• speech reco must not introduce any friction to
the interface
• keyboard, touch screen, multi-touch, keyboard,
Great UI speaker, microphone
• OS control, part of the OS, noise reduction, AD
design: converter
• Use all sensors available on the phone to inject
extra information to app
5
7. Speech recognition areas
Command Creation of
Telephony
control, digit texts, dictati
IVR
dictation on
Mobile Voice
Automotive
devices search
Speech is the most natural
way
we communicate
7
8. The main areas in time perspective
PC – C&C, dictation
Telephony
Automotive
Mobile devices
UI
1995 2000 2005
8
9. Little more history
1993 IBM Personal Dictation System IBM PC, audio adapter card
1996 VoiceType (Win 95, dictation, isolated words, email, …)
1996 Nuance deployed its first commercial speech application
1997 Dragon Systems unveiled its Naturally Speaking
1999 VoiceXML
2000 Telephony applications, IVR
2002 Car control (control car equipment, make a phone call, select
music, dictate address to navigation)
2003 Microsoft includes speech to Office 2003
2007 Growth of mobile phones/devices
2008 Google launches speech to Search iPhone
2009 Nuance Acquires IBM's patents Speech Technology rights
2011 iOS 5, Siri
9
11. Speech recognition – high level
Digitize audio
AD convertor
FFT, Non-lin,
DFFT Front End
feature extraction
Application
API
Labeling
triphones, prototy
pes
Text output Search
LM, HMM, Viter Back End
bi classification
11
13. IBM speech recognition – the early days
Large vocabulary, dictation (1990…)
Office correspondence task – Tangora
Written in Fortran
IBM RISC System/6000, AIX, Tangora
Albert Tangora (July 2, 1903 – April
7, 1978) set the world speed record for
sustained typing on a manual keyboard for
one hour, 147 words per minute, on
13
October 22, 1923.
14. How to get reco running on PC -1994
• Add-on board with ASIC
Front End • Integer version on CPU
• Input - 39 dim cepstrum coeffs feature vector
Hierarchical each 10 ms
• Output - 100 most likely prototypes out of
labeler 30k, diagonal Gaussians
• Statistical LM – high compression, log,
Search • Viterbi search, Hidden Markov Models
14
15. How get reco running on Embedded 1999
• Resource efficient speech recognition engine
Easy Port to • Written in C/C++
• Integer implementation, GCC compiler
Embedded • Simple API to customize for any platform
• Grammar support for command control
applications
Basic reco • Special emphasis on digit recognition
• Robust front end for noisy environments
• Command control
Cars • Digit and name dialing
• Navigation control
applications: • On-board entertainment control
15
21. Factors accelerating better mobile apps
Basic phone
More powerful CPU more memory
Connectivity, Internet
Much better UI, multi-
touch screen
Rapid growth of mobile phones/devices is
driving the adoption of speech recognition
21
22. Why is reco so important for mobile?
Small screen
Limited keyboard
Difficult text entry
Difficult to navigate
Slow, not reliable connectivity (latency)
Speech is fundamentally
changing the mobile user
22
experience
27. iOS Siri versus Google search
Siri are "natural language
processing" apps that use statistical
Siri is deep in iOS, start apps,
make calls, set meetings
Google is deep in the search engine
Can't launch apps with Google, you
can dictate an email or a text message.
Google is faster (much faster)
Future – combination of AI and
different UI
27
29. Future challenges
Better recognition, ROBUSTNES (noisy conditions,
dictation)
Better UI integration (speech button)
Multiple languages (how would a German native search for
an address in France?)
Switching between multiple languages
UI combining multiple
modalities, (voice, text, video, sensors)
Work on dictated text correction
Better integration of speech reco to special applications
29 ECSS 2010, 10/12/2010