Contenu connexe Similaire à Voice Recognition and Natural Language - Dallas TechFest 2016 (20) Plus de Crispin Reedy (12) Voice Recognition and Natural Language - Dallas TechFest 20162. 2© 2016 Versay Solutions LLC
• Voice User Interface Designer
• 10 years in the field
• Former coder; got interested in UX
• President of the Association for Voice
Interaction Design
• Consultant for Versay Solutions
@crispinTX
crispinreedy.com
4. Disclaimers
This Session Is About:
• What is speech
recognition anyway?
• Should I speech-enable
X? How?
• In general, how does it
work?
– What technologies should I
consider?
– What skills are important?
• What are the design
considerations?
It’s NOT About:
• Detailed code
• In depth how-tos
• Deep technical
knowledge
• Advanced ASR
7. How does this new modality
enable or enhance what I want
to do on this platform?
9. Terms & Technologies
• Speech Recognition
• Natural Language Understanding
• Text to Speech
• Voice Verification (Biometrics)
9© 2016 Versay Solutions LLC
10. Speech Recognition
• Also known as “ASR”
– “Speech to Text” ?
10© 2016 Versay Solutions LLC
“See the cat.”
Spoken
language
Machine-
readable
format
11. Natural Language Understanding
• Extracting meaning from natural text
– Not necessarily tied to speech recognition
11© 2016 Versay Solutions LLC
“Hello, yes,
I’d like to
pay my
water bill.
Can you
help me with
that?
Action =
BillPay
BillType =
Water
12. Text to Speech
• Speech Synthesis
– Used to convert text to spoken words
12© 2016 Versay Solutions LLC
13. Voice Verification
• Also called voiceprints, biometrics, voice
authentication, etc.
• Recognizes a person, not necessarily
what they are saying.
– You can have ASR without Voice Verification
– And vice versa
13© 2016 Versay Solutions LLC
“My voice is
my password.”
“Authenticated.
Welcome, Mr.
Smith.”
✓
14. 14© 2016 Versay Solutions LLC
Speech Recognition
• Hands-free command /
control
• Dictation
• Input text
• Small form factor
device, etc.
Text To Speech
• Output text dynamically
• Respond to input
• Useful when no display
is available
Natural Language
Understanding
• Necessary at some level
for all language-based
input
• Also used to parse large
volumes of text
Voice Verification
• Security
Uses: Separate Applications
15. Uses: Combined
15© 2016 Versay Solutions LLC
ASR
Application
Data
• Sign-In
• Interaction
• Request
• Action
• Meaning
• Access Data
• Output
TTS
NLU
Voice
prints
Verifi-
cation
16. True Multimodality
16© 2016 Versay Solutions LLC
ASR
Application
Data
• Sign-In
• Interaction
• Request
• Action
• Meaning
• Access Data
• Output
TTS
NLU
Voice
prints
Verifi-
cation
Touch
Keyboard
Manage I/O Modality
Determine Meaning in
Context
Visual
Context!
19. Output: Text to Speech
• (Somewhat) mature technology
• (Fairly) easy to understand and use
– Note: “Create TTS audio” is not the same as
having a TTS engine
19© 2016 Versay Solutions LLC
21. TTS Engine
• Text in, speech out
• May do some text pre-processing
– St. James St.
– Saint James Street
– Punctuation
– If it doesn’t do this, you’ll have to yourself.
• Grapheme to phoneme transcription
• Identify intonation patterns
– Assign the correct lexical stress to the words
21© 2016 Versay Solutions LLC
22. What Makes Good TTS?
• Phonemes change based on location
– “Cat”
– “Alligator”
• Elision
– “I’m. Awaiting. You.”
– “I’m awaiting you.”
• Intonation
– “Do you want coffee?”
– “Do you want soda, tea, or coffee?”
22© 2016 Versay Solutions LLC
23. SSML
• XML based WC3 standard for Speech
Synthesis Markup
– Not universally supported by vendors.
• Tags for marking up text to produce a
more natural quality output.
– Emphasis
– Break
– Voice
– Prosody
– Pitch
23© 2016 Versay Solutions LLC
25. When To Use It
• When high quality audio is not a
consideration
– TTS has improved considerably, but is still
noticeable
• When you have a lot of dynamic data
– If you just need to say a few things, it may be
overkill
25© 2016 Versay Solutions LLC
26. Other Considerations
• More phonemes = higher quality voice
– Also means a bigger download and install (if on
device)
• Exceptions (addresses, names) can be iffy
– May require a lot of work to handle well
• Your data needs to be clean and ready to voice
back
– Acronyms, incomplete sentences will not sound good
• Some applications may have other acoustic
limitations
– Telephony
• It is possible to build a custom voice
– But it takes a lot of work!
26© 2016 Versay Solutions LLC
27. Where To Find It
• Many commercial products available
– Most languages and dialects i.e. American
English, British English, etc.
– Many different voices
– Nuance, Cepstral, Inova
– Some open source
– Some APIs
• Chrome https://developer.chrome.com/apps/tts
27© 2016 Versay Solutions LLC
29. ASR and NLU: Topics
• Complications of speech
– Why is it so hard?
• How it works: overview
• Early commercial adoptions
– IVR
• Design considerations
• Speech today
– Different vendors
• Should I voice-enable X?
29© 2016 Versay Solutions LLC
31. 31The Voice in the Machine: Pieraccini
World
Knowledge
Semantics
Syntax
Lexicon
Morphology
Phonetics
Acoustics
Linguistics
Physiology
Concepts
Phrases
Words
Phonemes
Sounds
ASR
NLU
32. Speech Is Ambiguous
• Speech is never stationary
– Coarticulation
• Noisy environments
• Accents
• Different speakers have voices with different
acoustic qualities
– Goats
• Challenges vary depending on what you are
going to recognize
– Spelling (short utterances) can be difficult even
for humans
– Phonetic alphabet (Military)
32© 2016 Versay Solutions LLC
33. Language Is Ambiguous
• Humans can deduce meaning from context
and unknown words
“How can I help you?”
I’m having a problem with my account.
I’d like that one. No, not the green one, the red
one.
Time flies like an arrow.
Fruit flies like a banana.
33© 2016 Versay Solutions LLC
34. Everything Is Ambiguous
• All modern speech recognition is
probabilistic
– GUI: Button clicked? true / false
– VUI: There is an 85% chance that button was
clicked
34© 2016 Versay Solutions LLC
35. Three Dimensions of Speech Problems
35The Voice in the Machine: Pieraccini
Speaker Independence
Speaker
Dependent
Multiple
Speakers
Speaker
Independent
Isolated Words
Connected
Words
Natural Speech
10 words
1000 words
100,000 words
Unlimited
VocabularySize
Humanlike
36. History of Speech Recognition
• AUDREY: Davis, Biddulph, and Balashek -
Bell Labs 1952
36© 2016 Versay Solutions LLC
• Analog
• Isolated digit recognition
– Pause between digits
• Speaker-dependent
41. 1970’s: Template Matching
• Template matching approach
– “Brute force” model
– Quantitized spectrograms
– What about duration?
• Dynamic time warping
• Endpoint detection
– Difficult to do
• Feature extraction
41© 2016 Versay Solutions LLC
42. 1980’s: The Power of Statistics
• The recognition of connected speech
becomes a search for the best path in a large
network
– Problem of finding the probabilities
• Statistical Language Models
– Not all sequences of words are equally probable
– Rank all permissible sentences in terms of
probability
• “Correct” grammar is not applicable
• Restricted by domain
• Hidden Markov Models (HMM)
– Unified probabilistic model for speech
42© 2016 Versay Solutions LLC
43. Hidden Markov Model Example
43"HiddenMarkovModel" by Tdunningvectorization (Wikimedia)
X — states
y — possible
observations
a — state transition
probabilities
b — output
probabilities
44. You’re Only As Good As What You’re
Trained On
• Corpora
– Collection of speech used to train a
recognizer
– Acoustic and/or Pronunciation Model
• Associates sounds with symbols and words.
• Created by a general speech corpora and a
phonetic and orthographic transcription
– Statistical Language Model (SLM)
• A probability distribution over sequences of words
• Created by a domain-specific speech corpora and
a tagged transcription to extract meaning
44© 2016 Versay Solutions LLC
45. Training
45© 2016 Versay Solutions LLC
Speech
Recognition
Engine
Acoustic
Model
SLM and/or
Grammar
Pronunciation
Model
46. Language Model vs. Grammar
• SLM
– Has to be trained against collected utterances
– Large potential set of what the caller can say
– Tagged with the meanings of what they can
say
• Grammar (GrXML)
– More tightly constrained than an SLM
– Easier to create
– Not “trained” in the same way
– System will only recognize what is in the
grammar
46© 2016 Versay Solutions LLC
47. 47© 2016 Versay Solutions LLC
Utterance
Noise
Levels?
Barge-In?
Feature
Extraction
Endpointing
Speech
Recognition
Engine
Grammar or SLM
Probabilities
n:best list
Literal return
Tokens
Recognition Event
48. Natural Language Understanding
• Parsing input to extract meaning
• Covers a large field
– Commands
– Automatic classification of emails
– Newspaper articles, large chunks of text
• Lexicon
• Parser
• Grammar rules
• New tools / APIs
48© 2016 Versay Solutions LLC
49. Levels of Meaning
49© 2016 Versay Solutions LLC
Too Broad / Ambiguous Too MuchJust Right
“I’m having a problem
with my account.”
“Well, I was
looking at my
bill, because
I do that
every week,
and I was
reviewing
everything
on there,
and I saw…”
“I’m seeing an
unusual charge
on my bill.”
“How can I help you?”
50. Multi-Token Utterances
• “I’d like to transfer $50 from my checking
account to my savings account.”
– ACTION = Transfer
– FROM_ACCOUNT = Checking
– TO_ACCOUNT = Savings
– AMOUNT = $50
• Unfortunately, people don’t often naturally
produce these kinds of utterances.
50© 2016 Versay Solutions LLC
51. Early Commercial Adoption
• IVR
– Touchtone / DTMF
• “For checking, press 1. For savings, press 2.”
– Directed Dialog (Grammar-based ASR)
• “Which account? Just say ‘checking,’ ‘savings,’ or
‘money market.’”
– Natural Language (SLM-based ASR)
• “From which account?”
• SpeechWorks / Nuance technology
• Voice XML / GrXML
51© 2016 Versay Solutions LLC
54. Typical IVR Architecture
54© 2016 Versay Solutions LLC
Voice Browser
VUI
VXML
PSTN /
VOIP
HTTP
App Server
/ Data
Connection
Data
SIP
MRCP
ASR
Server
TTS
Server
55. Anatomy of an VUI + NLU project
• Voice User Interface
Design
– High level design
• Design style, sound
and feel, IA,
– Detailed design
• Prompts (recorded)
• Grammars for directed
dialog states
• Data I/O
55© 2016 Versay Solutions LLC
• SLM Creation
– Utterance capture
– Transcription
– Tagging
– Compiling and
deployment
59. Design Considerations
• Types of Speech User Interfaces
– Command and Control
– Dictation
– Dialog-based
• Speech is a linear, time-based interface
– Multimodality introduces additional
complications
59© 2016 Versay Solutions LLC
60. Design Considerations
• If the recognizer doesn’t get something,
you have to reprompt.
• Don’t say “sorry.”
“Where are you traveling today?”
I’m going to…. <noise>
“What city was that?”
60© 2016 Versay Solutions LLC
61. Design Considerations
• Speech is interruptible
– Main Menu: Choose from: “Beverages,”
“Sandwiches,” “Sides,” “Salads,” or “Alcoholic
Drinks.”
61© 2016 Versay Solutions LLC
63. Design Considerations
• Input must be limited *after* it is provided
– Can’t check the box on the client side to only
allow input of valid amounts
– “Sorry, you’re only allowed to transfer up to
$500.”
63© 2016 Versay Solutions LLC
64. Design Considerations
• Avoid using the word “Help” as a global
command.
• Instead, if there is a need to give
additional information, supply it in the first
or second reprompts.
– Or use specific keywords
– Other than “help”
• “You can also say ‘instructions.’”
• “Or, say ‘It’s something else.’”
64© 2016 Versay Solutions LLC
65. User Centered Design Techniques
• A set of techniques designed to keep the focus on
the user during the design process
• May include but are not limited to:
– Conversations
• Specific to VUI design
– Read Aloud
• Specific to VUI design
– Card Sorts
• Used to construct an IA
– Personas
• Used in all modalities
– Usability Testing
• Used in all modalities
– A/B Testing
• Useful for applications that are already in production
65© 2015 Versay Solutions LLC
70. What’s the Use Case For Speech?
• Enabling application
– User can’t do it any other way
– New tasks
• Enhancing application
– User can do it now
– But speech makes it better
• Faster
• Safer
70Credit: Bruce Ballentine, EIG
71. How Hard Is It To Do?
• What do you need it for?
• What kind of device will you be running it
on?
– Connectivity? Can you use cloud based ASR?
– Do you have to download it? If so, how much
space do you have?
• How much control do you need over the
application / user interface?
71© 2016 Versay Solutions LLC
72. Possibilities
72© 2016 Versay Solutions LLC
Write an app (skill) for
an agent such as
Cortana / Alexa
Use cloud APIs to add
ASR to your app / device
/ page / gadget
Download an ASR and
use full-featured
capabilities for more
robust recognition
Build your own
74. Today’s Cloud-Based Speech APIs
• Distributed speech recognition
– Collection and compression of speech is on
the device
– The language models are typically on the
network
– Phone can be speaker-dependent
• Trains itself on your voice and on the acoustic
environments you are in most often
– Many companies are providing APIs to use
their speech recognition
74© 2016 Versay Solutions LLC
75. AVS vs. Amazon Echo
• Could use AVS with the Amazon Echo, or
with your own device
75© 2016 Versay Solutions LLC
79. Alexa “Skills”
• “Alexa, ask Yelp to find me a restaurant.”
– Cortana has similar integration
• Register your skill with Amazon and
publish it
79© 2016 Versay Solutions LLC
80. Cloud vs. Downloadable / Embedded
• Microsoft
– Cortana integration
– Project Oxford API
• Google API
• Amazon
• Several new recent
startups
– Api.ai, Capio.ai,
Speechmatics,
iSpeech
80© 2016 Versay Solutions LLC
• Microsoft
– Windows 10 Speech
APIs
– Microsoft Speech Server
• Nuance
– the 800 pound gorilla in
the room
• Interactions
– IBM Watson
81. Cloud vs. Downloadable / Embedded
• Easy to get started
• Lightweight
• Not much
specialized
knowledge
81© 2016 Versay Solutions LLC
• Customizable
• Probably better
recognition
• Can be device-specific
• More features
• Higher powered
• Will require specialized
knowledge
• Speech scientist
82. Today’s NLU APIs
• Microsoft LUIS (part of Project Oxford)
• Api.ai
82© 2016 Versay Solutions LLC
83. Open Source ASR
• CMU Sphinx
– pocketsphinx
• Kaldi
– http://kaldi-asr.org/
• Github
• New updates include some pretty interesting stuff
(DNN)
• Requires:
– Corpus
– Tech know-how
83© 2016 Versay Solutions LLC
84. Who May You Need On Your Team
• Speech Scientist
• VUI Designer
84© 2016 Versay Solutions LLC
86. Should I Speech-Enable X?
86© 2016 Versay Solutions LLC
Desktop App / Website
• Easy to get started with
API-based ASR
• But the use case may
not be as powerful
Tablet / Mobile
• Stronger use case
• But will the network be
available for APIs?
Industrial Device
• Great use case esp. with
multimodal
• But this is harder to do
and probably will be
custom
Gadget
• Decent use case
• APIs are tailored for this
• Will they do everything
you need?
• Will the extra modality
be a plus or just a “silly
add-on?”
Car
• Safety considerations
are high here
• Need better user
interfaces & more
robust
IVR
• Touchtone can still be
good for a lot of
applications
• Speech is good for
complex call routing and
input
87. Resources
• The Voice in the Machine: Building
Computers that Understand Speech –
Roberto Pieraccini
• YouTube video: “Open the Pod Bay Doors,
Siri”
• Best Practices in VUI Design: AVIxD Wiki
– http://videsign.wikispaces.com/
• AVIxD: Quarterly Brown Bags
87© 2016 Versay Solutions LLC
88. 88© 2016 Versay Solutions LLC
Thanks!
@crispinTX
crispinreedy.com
creedy@versay.com
Notes de l'éditeur DO NOT FORGET TO BRING THE MINI-SPEAKERS!!! Siri, Alexa, Cortana: Voice recognition is hot. This session will give an overview of the voice recognition and natural language ecosystem and the technologies behind the experience. For example: What is the difference between voice recognition and natural language understanding? What are some of the common technologies in the market today? What are design considerations around these types of interfaces? This is an introductory session designed for people interested the exploring the possibilities of voice and conversational user interfaces, especially when considering the Internet of Things. Siri, Alexa, Cortana: Voice recognition is hot. This session will give an overview of the voice recognition and natural language ecosystem and the technologies behind the experience. For example: What is the difference between voice recognition and natural language understanding? What are some of the common technologies in the market today? What are design considerations around these types of interfaces? This is an introductory session designed for people interested the exploring the possibilities of voice and conversational user interfaces, especially when considering the Internet of Things. Computers: Apps and webpages. Consoles: Gaming / Connectivity
Mobile and Tablet
Industrial devices – especially something task driven
Gadgets
Cars
The phone Computers: Apps and webpages. Consoles: Gaming / Connectivity
Mobile and Tablet
Industrial devices – especially something task driven
Gadgets
Cars
The phone
Essentially what we’re coming to terms with here is a new input modality. It’s one that doesn’t always work very well – for reasons we’ll get into later. But, it can be a very powerful one when it does work well. It’s also a lot harder to figure out how to properly combine speech with everything else that is going on in your environment. Not going to discuss this one in a lot of detail today but it’s important that you understand the difference between these technologies. Human voice talent
Hundreds of hours of recording
Digitized
Phonemes:
Concatenated speech synthesis
World Knowledge: Concepts of the world around us, i.e. Tables have four legs, what is left and right, what is a car, etc. This is the level before language
Semantics: The first level of language. Knowledge can be represented in structured meaningful elements. Example: semantics of a party invitation
Syntax: The rules that govern putting words together to form meaningful units
Lexicon: What words mean
Morphology: How words change their form to perform differently in a language i.e. horse / horses
Phonetics: Phonemes and how words are built
Acoustics: What phonemes sound like and how to create them Waveforms show the variation in overall intensity (decibels) over time.
Spectrograms show the variation of individual frequency components Observations to make: Represents the entirety of a VUI experience
Placement of Spanish prompt would vary depending on type of call.
Confirmation is variable
Confirmation prompt is general
DO NOT FORGET TO BRING THE MINI-SPEAKERS!!!