Voice Recognition and Natural Language - Dallas TechFest 2016

Voice Recognition and
Natural Language
Dallas TechFest
January 29, 2016
Crispin Reedy @crispinTX
#DallasTechFest16

2© 2016 Versay Solutions LLC
• Voice User Interface Designer
• 10 years in the field
• Former coder; got interested in UX
• President of the Association for Voice
Interaction Design
• Consultant for Versay Solutions
@crispinTX
crispinreedy.com

Disclaimers
This Session Is About:
• What is speech
recognition anyway?
• Should I speech-enable
X? How?
• In general, how does it
work?
– What technologies should I
consider?
– What skills are important?
• What are the design
considerations?
It’s NOT About:
• Detailed code
• In depth how-tos
• Deep technical
knowledge
• Advanced ASR

What IS X?

How does this new modality
enable or enhance what I want
to do on this platform?

What IS X?

Terms & Technologies
• Speech Recognition
• Natural Language Understanding
• Text to Speech
• Voice Verification (Biometrics)

Speech Recognition
• Also known as “ASR”
– “Speech to Text” ?
“See the cat.”
Spoken
language
Machine-
readable
format

Natural Language Understanding
• Extracting meaning from natural text
– Not necessarily tied to speech recognition
“Hello, yes,
I’d like to
pay my
water bill.
Can you
help me with
that?
Action =
BillPay
BillType =
Water

Text to Speech
• Speech Synthesis
– Used to convert text to spoken words

Voice Verification
• Also called voiceprints, biometrics, voice
authentication, etc.
• Recognizes a person, not necessarily
what they are saying.
– You can have ASR without Voice Verification
– And vice versa
“My voice is
my password.”
“Authenticated.
Welcome, Mr.
Smith.”
✓

Speech Recognition
• Hands-free command /
control
• Dictation
• Input text
• Small form factor
device, etc.
Text To Speech
• Output text dynamically
• Respond to input
• Useful when no display
is available
Natural Language
Understanding
• Necessary at some level
for all language-based
input
• Also used to parse large
volumes of text
Voice Verification
• Security
Uses: Separate Applications

Uses: Combined
ASR
Application
Data
• Sign-In
• Interaction
• Request
• Action
• Meaning
• Access Data
• Output
TTS
NLU
Voice
prints
Verifi-
cation

True Multimodality
ASR
Application
Data
• Sign-In
• Interaction
• Request
• Action
• Meaning
• Access Data
• Output
TTS
NLU
Voice
prints
Verifi-
cation
Touch
Keyboard
Manage I/O Modality
Determine Meaning in
Context
Visual
Context!

Output: Text to Speech
• (Somewhat) mature technology
• (Fairly) easy to understand and use
– Note: “Create TTS audio” is not the same as
having a TTS engine

How it Works

TTS Engine
• Text in, speech out
• May do some text pre-processing
– St. James St.
– Saint James Street
– Punctuation
– If it doesn’t do this, you’ll have to yourself.
• Grapheme to phoneme transcription
• Identify intonation patterns
– Assign the correct lexical stress to the words

What Makes Good TTS?
• Phonemes change based on location
– “Cat”
– “Alligator”
• Elision
– “I’m. Awaiting. You.”
– “I’m awaiting you.”
• Intonation
– “Do you want coffee?”
– “Do you want soda, tea, or coffee?”

SSML
• XML based WC3 standard for Speech
Synthesis Markup
– Not universally supported by vendors.
• Tags for marking up text to produce a
more natural quality output.
– Emphasis
– Break
– Voice
– Prosody
– Pitch

SSML Example

When To Use It
• When high quality audio is not a
consideration
– TTS has improved considerably, but is still
noticeable
• When you have a lot of dynamic data
– If you just need to say a few things, it may be
overkill

Other Considerations
• More phonemes = higher quality voice
– Also means a bigger download and install (if on
device)
• Exceptions (addresses, names) can be iffy
– May require a lot of work to handle well
• Your data needs to be clean and ready to voice
back
– Acronyms, incomplete sentences will not sound good
• Some applications may have other acoustic
limitations
– Telephony
• It is possible to build a custom voice
– But it takes a lot of work!

Where To Find It
• Many commercial products available
– Most languages and dialects i.e. American
English, British English, etc.
– Many different voices
– Nuance, Cepstral, Inova
– Some open source
– Some APIs
• Chrome https://developer.chrome.com/apps/tts

ASR and NLU: Topics
• Complications of speech
– Why is it so hard?
• How it works: overview
• Early commercial adoptions
– IVR
• Design considerations
• Speech today
– Different vendors
• Should I voice-enable X?

30(The Speech Chain, Bell Labs, 1963)

31The Voice in the Machine: Pieraccini
World
Knowledge
Semantics
Syntax
Lexicon
Morphology
Phonetics
Acoustics
Linguistics
Physiology
Concepts
Phrases
Words
Phonemes
Sounds
ASR
NLU

Speech Is Ambiguous
• Speech is never stationary
– Coarticulation
• Noisy environments
• Accents
• Different speakers have voices with different
acoustic qualities
– Goats
• Challenges vary depending on what you are
going to recognize
– Spelling (short utterances) can be difficult even
for humans
– Phonetic alphabet (Military)

Language Is Ambiguous
• Humans can deduce meaning from context
and unknown words
“How can I help you?”
I’m having a problem with my account.
I’d like that one. No, not the green one, the red
one.
Time flies like an arrow.
Fruit flies like a banana.

Everything Is Ambiguous
• All modern speech recognition is
probabilistic
– GUI: Button clicked? true / false
– VUI: There is an 85% chance that button was
clicked

Three Dimensions of Speech Problems
35The Voice in the Machine: Pieraccini
Speaker Independence
Speaker
Dependent
Multiple
Speakers
Speaker
Independent
Isolated Words
Connected
Words
Natural Speech
10 words
1000 words
100,000 words
Unlimited
VocabularySize
Humanlike

History of Speech Recognition
• AUDREY: Davis, Biddulph, and Balashek -
Bell Labs 1952
• Analog
• Isolated digit recognition
– Pause between digits
• Speaker-dependent

Sampling
• The start of being able to digitally
manipulate audio

0 db
frequency Spectrogram vs. Waveform

1970’s: Template Matching
• Template matching approach
– “Brute force” model
– Quantitized spectrograms
– What about duration?
• Dynamic time warping
• Endpoint detection
– Difficult to do
• Feature extraction

1980’s: The Power of Statistics
• The recognition of connected speech
becomes a search for the best path in a large
network
– Problem of finding the probabilities
• Statistical Language Models
– Not all sequences of words are equally probable
– Rank all permissible sentences in terms of
probability
• “Correct” grammar is not applicable
• Restricted by domain
• Hidden Markov Models (HMM)
– Unified probabilistic model for speech

Hidden Markov Model Example
43"HiddenMarkovModel" by Tdunningvectorization (Wikimedia)
X — states
y — possible
observations
a — state transition
probabilities
b — output
probabilities

You’re Only As Good As What You’re
Trained On
• Corpora
– Collection of speech used to train a
recognizer
– Acoustic and/or Pronunciation Model
• Associates sounds with symbols and words.
• Created by a general speech corpora and a
phonetic and orthographic transcription
– Statistical Language Model (SLM)
• A probability distribution over sequences of words
• Created by a domain-specific speech corpora and
a tagged transcription to extract meaning

Training
Speech
Recognition
Engine
Acoustic
Model
SLM and/or
Grammar
Pronunciation
Model

Language Model vs. Grammar
• SLM
– Has to be trained against collected utterances
– Large potential set of what the caller can say
– Tagged with the meanings of what they can
say
• Grammar (GrXML)
– More tightly constrained than an SLM
– Easier to create
– Not “trained” in the same way
– System will only recognize what is in the
grammar

Utterance
Noise
Levels?
Barge-In?
Feature
Extraction
Endpointing
Speech
Recognition
Engine
Grammar or SLM
Probabilities
n:best list
Literal return
Tokens
Recognition Event

Natural Language Understanding
• Parsing input to extract meaning
• Covers a large field
– Commands
– Automatic classification of emails
– Newspaper articles, large chunks of text
• Lexicon
• Parser
• Grammar rules
• New tools / APIs

Levels of Meaning
Too Broad / Ambiguous Too MuchJust Right
“I’m having a problem
with my account.”
“Well, I was
looking at my
bill, because
I do that
every week,
and I was
reviewing
everything
on there,
and I saw…”
“I’m seeing an
unusual charge
on my bill.”
“How can I help you?”

Multi-Token Utterances
• “I’d like to transfer $50 from my checking
account to my savings account.”
– ACTION = Transfer
– FROM_ACCOUNT = Checking
– TO_ACCOUNT = Savings
– AMOUNT = $50
• Unfortunately, people don’t often naturally
produce these kinds of utterances.

Early Commercial Adoption
• IVR
– Touchtone / DTMF
• “For checking, press 1. For savings, press 2.”
– Directed Dialog (Grammar-based ASR)
• “Which account? Just say ‘checking,’ ‘savings,’ or
‘money market.’”
– Natural Language (SLM-based ASR)
• “From which account?”
• SpeechWorks / Nuance technology
• Voice XML / GrXML

Typical IVR Architecture
Voice Browser
VUI
VXML
PSTN /
VOIP
HTTP
App Server
/ Data
Connection
Data
SIP
MRCP
ASR
Server
TTS
Server

Anatomy of an VUI + NLU project
• Voice User Interface
Design
– High level design
• Design style, sound
and feel, IA,
– Detailed design
• Prompts (recorded)
• Grammars for directed
dialog states
• Data I/O
• SLM Creation
– Utterance capture
– Transcription
– Tagging
– Compiling and
deployment

VUI Design Doc – Detailed Example

Corpora Documentation Example

Design Considerations
• Types of Speech User Interfaces
– Command and Control
– Dictation
– Dialog-based
• Speech is a linear, time-based interface
– Multimodality introduces additional
complications

• If the recognizer doesn’t get something,
you have to reprompt.
• Don’t say “sorry.”
“Where are you traveling today?”
I’m going to…. <noise>
“What city was that?”

• Speech is interruptible
– Main Menu: Choose from: “Beverages,”
“Sandwiches,” “Sides,” “Salads,” or “Alcoholic
Drinks.”

• Prompts imply more than choices
– Would you like chocolate or vanilla?
• Yes
• Both

• Input must be limited *after* it is provided
– Can’t check the box on the client side to only
allow input of valid amounts
– “Sorry, you’re only allowed to transfer up to
$500.”

• Avoid using the word “Help” as a global
command.
• Instead, if there is a need to give
additional information, supply it in the first
or second reprompts.
– Or use specific keywords
– Other than “help”
• “You can also say ‘instructions.’”
• “Or, say ‘It’s something else.’”

User Centered Design Techniques
• A set of techniques designed to keep the focus on
the user during the design process
• May include but are not limited to:
– Conversations
• Specific to VUI design
– Read Aloud
• Specific to VUI design
– Card Sorts
• Used to construct an IA
– Personas
• Used in all modalities
– Usability Testing
• Used in all modalities
– A/B Testing
• Useful for applications that are already in production

Usability Testing

What IS X?

What’s the Use Case For Speech?
• Enabling application
– User can’t do it any other way
– New tasks
• Enhancing application
– User can do it now
– But speech makes it better
• Faster
• Safer
70Credit: Bruce Ballentine, EIG

How Hard Is It To Do?
• What do you need it for?
• What kind of device will you be running it
on?
– Connectivity? Can you use cloud based ASR?
– Do you have to download it? If so, how much
space do you have?
• How much control do you need over the
application / user interface?

Possibilities
Write an app (skill) for
an agent such as
Cortana / Alexa
Use cloud APIs to add
ASR to your app / device
/ page / gadget
Download an ASR and
use full-featured
capabilities for more
robust recognition
Build your own

Distributed: Today’s Speech Agents
• Siri
• Cortana
• Google Now
• Amazon Echo (Alexa)

Today’s Cloud-Based Speech APIs
• Distributed speech recognition
– Collection and compression of speech is on
the device
– The language models are typically on the
network
– Phone can be speaker-dependent
• Trains itself on your voice and on the acoustic
environments you are in most often
– Many companies are providing APIs to use
their speech recognition

AVS vs. Amazon Echo
• Could use AVS with the Amazon Echo, or
with your own device

Speech API Example: Alexa Voice
Services

Alexa Skill Example

Alexa “Skills”
• “Alexa, ask Yelp to find me a restaurant.”
– Cortana has similar integration
• Register your skill with Amazon and
publish it

Cloud vs. Downloadable / Embedded
• Microsoft
– Cortana integration
– Project Oxford API
• Google API
• Amazon
• Several new recent
startups
– Api.ai, Capio.ai,
Speechmatics,
iSpeech
• Microsoft
– Windows 10 Speech
APIs
– Microsoft Speech Server
• Nuance
– the 800 pound gorilla in
the room
• Interactions
– IBM Watson

Cloud vs. Downloadable / Embedded
• Easy to get started
• Lightweight
• Not much
specialized
knowledge
• Customizable
• Probably better
recognition
• Can be device-specific
• More features
• Higher powered
• Will require specialized
knowledge
• Speech scientist

Today’s NLU APIs
• Microsoft LUIS (part of Project Oxford)
• Api.ai

Open Source ASR
• CMU Sphinx
– pocketsphinx
• Kaldi
– http://kaldi-asr.org/
• Github
• New updates include some pretty interesting stuff
(DNN)
• Requires:
– Corpus
– Tech know-how

Who May You Need On Your Team
• Speech Scientist
• VUI Designer

Should I Speech-Enable X?

Should I Speech-Enable X?
Desktop App / Website
• Easy to get started with
API-based ASR
• But the use case may
not be as powerful
Tablet / Mobile
• Stronger use case
• But will the network be
available for APIs?
Industrial Device
• Great use case esp. with
multimodal
• But this is harder to do
and probably will be
custom
Gadget
• Decent use case
• APIs are tailored for this
• Will they do everything
you need?
• Will the extra modality
be a plus or just a “silly
add-on?”
Car
• Safety considerations
are high here
• Need better user
interfaces & more
robust
IVR
• Touchtone can still be
good for a lot of
applications
• Speech is good for
complex call routing and
input

Resources
• The Voice in the Machine: Building
Computers that Understand Speech –
Roberto Pieraccini
• YouTube video: “Open the Pod Bay Doors,
Siri”
• Best Practices in VUI Design: AVIxD Wiki
– http://videsign.wikispaces.com/
• AVIxD: Quarterly Brown Bags

Thanks!
@crispinTX
crispinreedy.com
creedy@versay.com

Voice Recognition and Natural Language - Dallas TechFest 2016

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (20)

Similaire à Voice Recognition and Natural Language - Dallas TechFest 2016

Similaire à Voice Recognition and Natural Language - Dallas TechFest 2016 (20)

Plus de Crispin Reedy

Plus de Crispin Reedy (12)

Dernier

Dernier (20)

Voice Recognition and Natural Language - Dallas TechFest 2016

Notes de l'éditeur