Aplikace pro rozpoznávání řeči - Jan Šedivý

Voice: The New UI for Mobile Devices

Jan Šedivý
WORLD USIBILITY DAY – 2012

1

Fred Jelinek (1932-2010)

During 21 years at IBM Research and
nearly two decades at Johns
Hopkins, he has pioneered the
statistical methods that enable modern
computers to understand spoken
language.

“He envisioned applying the mathematics of
probability to the problem of processing speech
and language,” said Sanjeev Khudanpur, a Johns
Hopkins associate
2

WHY SPEECH RECOGNITION?

3

Speech reco benefits

Speech •
•
Speech is much richer then two mouse buttons
Disambiguation, dialog
• Show me all emails from David about Linux server
is rich • “Call David”, David Smith or Stone? Home or cell?

Text • Speech expresses not only text entry but C&C,
search, URI entry
• Speech entry is part of the keyboard
entry • “command box”, general source of information

WYSIWYG == What You Say
Is What You Get
4

Elements of success
• Access to huge content: Internet, YouTube,
maps, music, pictures, SMS, email…
• Train on all available data: contact, location
Best names addresses, email, documents content,
history, personalization and other sensors: GPS,
accuracy: accelerometers, camera, compass
• Computationally expensive - huge clusters of
computers to speed up training

• speech reco must not introduce any friction to
the interface
• keyboard, touch screen, multi-touch, keyboard,
Great UI speaker, microphone
• OS control, part of the OS, noise reduction, AD
design: converter
• Use all sensors available on the phone to inject
extra information to app

5

WHERE IS SPEECH RECOGNITION
USEFUL?

6

Speech recognition areas

Command Creation of
Telephony
control, digit texts, dictati
IVR
dictation on

Mobile Voice
Automotive
devices search

Speech is the most natural
way
we communicate
7

The main areas in time perspective
PC – C&C, dictation
Telephony
Automotive
Mobile devices
UI
1995 2000 2005

8

Little more history
 1993 IBM Personal Dictation System IBM PC, audio adapter card
 1996 VoiceType (Win 95, dictation, isolated words, email, …)
 1996 Nuance deployed its first commercial speech application
 1997 Dragon Systems unveiled its Naturally Speaking
 1999 VoiceXML
 2000 Telephony applications, IVR
 2002 Car control (control car equipment, make a phone call, select
music, dictate address to navigation)
 2003 Microsoft includes speech to Office 2003
 2007 Growth of mobile phones/devices
 2008 Google launches speech to Search iPhone
 2009 Nuance Acquires IBM's patents Speech Technology rights
 2011 iOS 5, Siri

9

HOW SPEECH RECOGNITION
WORKS

10

Speech recognition – high level
Digitize audio
AD convertor

FFT, Non-lin,
DFFT Front End
feature extraction
Application

API
Labeling
triphones, prototy
pes

Text output Search
LM, HMM, Viter Back End
bi classification

11

APPLICATIONS DEVELOPMENT
CHRONOLOGICALLY

12

IBM speech recognition – the early days
 Large vocabulary, dictation (1990…)
 Office correspondence task – Tangora
 Written in Fortran
 IBM RISC System/6000, AIX, Tangora

Albert Tangora (July 2, 1903 – April
7, 1978) set the world speed record for
sustained typing on a manual keyboard for
one hour, 147 words per minute, on
13
October 22, 1923.

How to get reco running on PC -1994

• Add-on board with ASIC
Front End • Integer version on CPU

• Input - 39 dim cepstrum coeffs feature vector
Hierarchical each 10 ms
• Output - 100 most likely prototypes out of
labeler 30k, diagonal Gaussians

• Statistical LM – high compression, log,
Search • Viterbi search, Hidden Markov Models

14

How get reco running on Embedded 1999
• Resource efficient speech recognition engine
Easy Port to • Written in C/C++
• Integer implementation, GCC compiler
Embedded • Simple API to customize for any platform

• Grammar support for command control
applications
Basic reco • Special emphasis on digit recognition
• Robust front end for noisy environments

• Command control
Cars • Digit and name dialing
• Navigation control
applications: • On-board entertainment control

15

7 billion people

Over 5.3 billion people or
77% of the world’s
population are now on
mobile.
according to
WIPRO
17

Mobile operating system preferences

 Sparkwiz 2012
18

Factors accelerating better mobile apps
Basic phone

More powerful CPU more memory

Connectivity, Internet
Much better UI, multi-
touch screen
Rapid growth of mobile phones/devices is
driving the adoption of speech recognition

21

Why is reco so important for mobile?
Small screen

Limited keyboard

Difficult text entry

Difficult to navigate

Slow, not reliable connectivity (latency)
Speech is fundamentally
changing the mobile user
22
experience

LATEST APPLICATIONS

23

Google Now, Google search

Some Android
phones: two
mics
24

Poor performance in the Czech Rep.

26

iOS Siri versus Google search
 Siri are "natural language
processing" apps that use statistical
 Siri is deep in iOS, start apps,
make calls, set meetings
 Google is deep in the search engine
 Can't launch apps with Google, you
can dictate an email or a text message.
 Google is faster (much faster)

Future – combination of AI and
different UI
27

Future challenges
Better recognition, ROBUSTNES (noisy conditions,
dictation)
Better UI integration (speech button)
Multiple languages (how would a German native search for
an address in France?)
Switching between multiple languages
UI combining multiple
modalities, (voice, text, video, sensors)
Work on dictated text correction

Better integration of speech reco to special applications

29 ECSS 2010, 10/12/2010

QUESTIONS & THANK YOU

30

Aplikace pro rozpoznávání řeči - Jan Šedivý

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Similaire à Aplikace pro rozpoznávání řeči - Jan Šedivý

Similaire à Aplikace pro rozpoznávání řeči - Jan Šedivý (20)

Plus de Asociace UX (Prague ACM SIGCHI)

Plus de Asociace UX (Prague ACM SIGCHI) (6)

Aplikace pro rozpoznávání řeči - Jan Šedivý