SlideShare une entreprise Scribd logo
1  sur  88
Voice Recognition and
Natural Language
Dallas TechFest
January 29, 2016
Crispin Reedy @crispinTX
#DallasTechFest16
2© 2016 Versay Solutions LLC
• Voice User Interface Designer
• 10 years in the field
• Former coder; got interested in UX
• President of the Association for Voice
Interaction Design
• Consultant for Versay Solutions
@crispinTX
crispinreedy.com
Disclaimers
This Session Is About:
• What is speech
recognition anyway?
• Should I speech-enable
X? How?
• In general, how does it
work?
– What technologies should I
consider?
– What skills are important?
• What are the design
considerations?
It’s NOT About:
• Detailed code
• In depth how-tos
• Deep technical
knowledge
• Advanced ASR
Should I Speech-Enable X?
What IS X?
6© 2016 Versay Solutions LLC
How does this new modality
enable or enhance what I want
to do on this platform?
What IS X?
8© 2016 Versay Solutions LLC
Terms & Technologies
• Speech Recognition
• Natural Language Understanding
• Text to Speech
• Voice Verification (Biometrics)
9© 2016 Versay Solutions LLC
Speech Recognition
• Also known as “ASR”
– “Speech to Text” ?
10© 2016 Versay Solutions LLC
“See the cat.”
Spoken
language
Machine-
readable
format
Natural Language Understanding
• Extracting meaning from natural text
– Not necessarily tied to speech recognition
11© 2016 Versay Solutions LLC
“Hello, yes,
I’d like to
pay my
water bill.
Can you
help me with
that?
Action =
BillPay
BillType =
Water
Text to Speech
• Speech Synthesis
– Used to convert text to spoken words
12© 2016 Versay Solutions LLC
Voice Verification
• Also called voiceprints, biometrics, voice
authentication, etc.
• Recognizes a person, not necessarily
what they are saying.
– You can have ASR without Voice Verification
– And vice versa
13© 2016 Versay Solutions LLC
“My voice is
my password.”
“Authenticated.
Welcome, Mr.
Smith.”
✓
14© 2016 Versay Solutions LLC
Speech Recognition
• Hands-free command /
control
• Dictation
• Input text
• Small form factor
device, etc.
Text To Speech
• Output text dynamically
• Respond to input
• Useful when no display
is available
Natural Language
Understanding
• Necessary at some level
for all language-based
input
• Also used to parse large
volumes of text
Voice Verification
• Security
Uses: Separate Applications
Uses: Combined
15© 2016 Versay Solutions LLC
ASR
Application
Data
• Sign-In
• Interaction
• Request
• Action
• Meaning
• Access Data
• Output
TTS
NLU
Voice
prints
Verifi-
cation
True Multimodality
16© 2016 Versay Solutions LLC
ASR
Application
Data
• Sign-In
• Interaction
• Request
• Action
• Meaning
• Access Data
• Output
TTS
NLU
Voice
prints
Verifi-
cation
Touch
Keyboard
Manage I/O Modality
Determine Meaning in
Context
Visual
Context!
Credit: Jon Bloom
Let’s Talk Speech!
Output: Text to Speech
• (Somewhat) mature technology
• (Fairly) easy to understand and use
– Note: “Create TTS audio” is not the same as
having a TTS engine
19© 2016 Versay Solutions LLC
How it Works
20© 2016 Versay Solutions LLC
TTS Engine
• Text in, speech out
• May do some text pre-processing
– St. James St.
– Saint James Street
– Punctuation
– If it doesn’t do this, you’ll have to yourself.
• Grapheme to phoneme transcription
• Identify intonation patterns
– Assign the correct lexical stress to the words
21© 2016 Versay Solutions LLC
What Makes Good TTS?
• Phonemes change based on location
– “Cat”
– “Alligator”
• Elision
– “I’m. Awaiting. You.”
– “I’m awaiting you.”
• Intonation
– “Do you want coffee?”
– “Do you want soda, tea, or coffee?”
22© 2016 Versay Solutions LLC
SSML
• XML based WC3 standard for Speech
Synthesis Markup
– Not universally supported by vendors.
• Tags for marking up text to produce a
more natural quality output.
– Emphasis
– Break
– Voice
– Prosody
– Pitch
23© 2016 Versay Solutions LLC
SSML Example
24© 2016 Versay Solutions LLC
When To Use It
• When high quality audio is not a
consideration
– TTS has improved considerably, but is still
noticeable
• When you have a lot of dynamic data
– If you just need to say a few things, it may be
overkill
25© 2016 Versay Solutions LLC
Other Considerations
• More phonemes = higher quality voice
– Also means a bigger download and install (if on
device)
• Exceptions (addresses, names) can be iffy
– May require a lot of work to handle well
• Your data needs to be clean and ready to voice
back
– Acronyms, incomplete sentences will not sound good
• Some applications may have other acoustic
limitations
– Telephony
• It is possible to build a custom voice
– But it takes a lot of work!
26© 2016 Versay Solutions LLC
Where To Find It
• Many commercial products available
– Most languages and dialects i.e. American
English, British English, etc.
– Many different voices
– Nuance, Cepstral, Inova
– Some open source
– Some APIs
• Chrome https://developer.chrome.com/apps/tts
27© 2016 Versay Solutions LLC
ASR and NLU
ASR and NLU: Topics
• Complications of speech
– Why is it so hard?
• How it works: overview
• Early commercial adoptions
– IVR
• Design considerations
• Speech today
– Different vendors
• Should I voice-enable X?
29© 2016 Versay Solutions LLC
30(The Speech Chain, Bell Labs, 1963)
31The Voice in the Machine: Pieraccini
World
Knowledge
Semantics
Syntax
Lexicon
Morphology
Phonetics
Acoustics
Linguistics
Physiology
Concepts
Phrases
Words
Phonemes
Sounds
ASR
NLU
Speech Is Ambiguous
• Speech is never stationary
– Coarticulation
• Noisy environments
• Accents
• Different speakers have voices with different
acoustic qualities
– Goats
• Challenges vary depending on what you are
going to recognize
– Spelling (short utterances) can be difficult even
for humans
– Phonetic alphabet (Military)
32© 2016 Versay Solutions LLC
Language Is Ambiguous
• Humans can deduce meaning from context
and unknown words
“How can I help you?”
I’m having a problem with my account.
I’d like that one. No, not the green one, the red
one.
Time flies like an arrow.
Fruit flies like a banana.
33© 2016 Versay Solutions LLC
Everything Is Ambiguous
• All modern speech recognition is
probabilistic
– GUI: Button clicked? true / false
– VUI: There is an 85% chance that button was
clicked
34© 2016 Versay Solutions LLC
Three Dimensions of Speech Problems
35The Voice in the Machine: Pieraccini
Speaker Independence
Speaker
Dependent
Multiple
Speakers
Speaker
Independent
Isolated Words
Connected
Words
Natural Speech
10 words
1000 words
100,000 words
Unlimited
VocabularySize
Humanlike
History of Speech Recognition
• AUDREY: Davis, Biddulph, and Balashek -
Bell Labs 1952
36© 2016 Versay Solutions LLC
• Analog
• Isolated digit recognition
– Pause between digits
• Speaker-dependent
Sampling
• The start of being able to digitally
manipulate audio
39© 2016 Versay Solutions LLC
40© 2016 Versay Solutions LLC
0 db
frequency Spectrogram vs. Waveform
1970’s: Template Matching
• Template matching approach
– “Brute force” model
– Quantitized spectrograms
– What about duration?
• Dynamic time warping
• Endpoint detection
– Difficult to do
• Feature extraction
41© 2016 Versay Solutions LLC
1980’s: The Power of Statistics
• The recognition of connected speech
becomes a search for the best path in a large
network
– Problem of finding the probabilities
• Statistical Language Models
– Not all sequences of words are equally probable
– Rank all permissible sentences in terms of
probability
• “Correct” grammar is not applicable
• Restricted by domain
• Hidden Markov Models (HMM)
– Unified probabilistic model for speech
42© 2016 Versay Solutions LLC
Hidden Markov Model Example
43"HiddenMarkovModel" by Tdunningvectorization (Wikimedia)
X — states
y — possible
observations
a — state transition
probabilities
b — output
probabilities
You’re Only As Good As What You’re
Trained On
• Corpora
– Collection of speech used to train a
recognizer
– Acoustic and/or Pronunciation Model
• Associates sounds with symbols and words.
• Created by a general speech corpora and a
phonetic and orthographic transcription
– Statistical Language Model (SLM)
• A probability distribution over sequences of words
• Created by a domain-specific speech corpora and
a tagged transcription to extract meaning
44© 2016 Versay Solutions LLC
Training
45© 2016 Versay Solutions LLC
Speech
Recognition
Engine
Acoustic
Model
SLM and/or
Grammar
Pronunciation
Model
Language Model vs. Grammar
• SLM
– Has to be trained against collected utterances
– Large potential set of what the caller can say
– Tagged with the meanings of what they can
say
• Grammar (GrXML)
– More tightly constrained than an SLM
– Easier to create
– Not “trained” in the same way
– System will only recognize what is in the
grammar
46© 2016 Versay Solutions LLC
47© 2016 Versay Solutions LLC
Utterance
Noise
Levels?
Barge-In?
Feature
Extraction
Endpointing
Speech
Recognition
Engine
Grammar or SLM
Probabilities
n:best list
Literal return
Tokens
Recognition Event
Natural Language Understanding
• Parsing input to extract meaning
• Covers a large field
– Commands
– Automatic classification of emails
– Newspaper articles, large chunks of text
• Lexicon
• Parser
• Grammar rules
• New tools / APIs
48© 2016 Versay Solutions LLC
Levels of Meaning
49© 2016 Versay Solutions LLC
Too Broad / Ambiguous Too MuchJust Right
“I’m having a problem
with my account.”
“Well, I was
looking at my
bill, because
I do that
every week,
and I was
reviewing
everything
on there,
and I saw…”
“I’m seeing an
unusual charge
on my bill.”
“How can I help you?”
Multi-Token Utterances
• “I’d like to transfer $50 from my checking
account to my savings account.”
– ACTION = Transfer
– FROM_ACCOUNT = Checking
– TO_ACCOUNT = Savings
– AMOUNT = $50
• Unfortunately, people don’t often naturally
produce these kinds of utterances.
50© 2016 Versay Solutions LLC
Early Commercial Adoption
• IVR
– Touchtone / DTMF
• “For checking, press 1. For savings, press 2.”
– Directed Dialog (Grammar-based ASR)
• “Which account? Just say ‘checking,’ ‘savings,’ or
‘money market.’”
– Natural Language (SLM-based ASR)
• “From which account?”
• SpeechWorks / Nuance technology
• Voice XML / GrXML
51© 2016 Versay Solutions LLC
53© 2016 Versay Solutions LLC
Typical IVR Architecture
54© 2016 Versay Solutions LLC
Voice Browser
VUI
VXML
PSTN /
VOIP
HTTP
App Server
/ Data
Connection
Data
SIP
MRCP
ASR
Server
TTS
Server
Anatomy of an VUI + NLU project
• Voice User Interface
Design
– High level design
• Design style, sound
and feel, IA,
– Detailed design
• Prompts (recorded)
• Grammars for directed
dialog states
• Data I/O
55© 2016 Versay Solutions LLC
• SLM Creation
– Utterance capture
– Transcription
– Tagging
– Compiling and
deployment
56© 2016 Versay Solutions LLC
VUI Design Doc – Detailed Example
57© 2016 Versay Solutions LLC
Corpora Documentation Example
58© 2016 Versay Solutions LLC
Design Considerations
• Types of Speech User Interfaces
– Command and Control
– Dictation
– Dialog-based
• Speech is a linear, time-based interface
– Multimodality introduces additional
complications
59© 2016 Versay Solutions LLC
Design Considerations
• If the recognizer doesn’t get something,
you have to reprompt.
• Don’t say “sorry.”
“Where are you traveling today?”
I’m going to…. <noise>
“What city was that?”
60© 2016 Versay Solutions LLC
Design Considerations
• Speech is interruptible
– Main Menu: Choose from: “Beverages,”
“Sandwiches,” “Sides,” “Salads,” or “Alcoholic
Drinks.”
61© 2016 Versay Solutions LLC
Design Considerations
• Prompts imply more than choices
– Would you like chocolate or vanilla?
• Yes
• Both
62© 2016 Versay Solutions LLC
Design Considerations
• Input must be limited *after* it is provided
– Can’t check the box on the client side to only
allow input of valid amounts
– “Sorry, you’re only allowed to transfer up to
$500.”
63© 2016 Versay Solutions LLC
Design Considerations
• Avoid using the word “Help” as a global
command.
• Instead, if there is a need to give
additional information, supply it in the first
or second reprompts.
– Or use specific keywords
– Other than “help”
• “You can also say ‘instructions.’”
• “Or, say ‘It’s something else.’”
64© 2016 Versay Solutions LLC
User Centered Design Techniques
• A set of techniques designed to keep the focus on
the user during the design process
• May include but are not limited to:
– Conversations
• Specific to VUI design
– Read Aloud
• Specific to VUI design
– Card Sorts
• Used to construct an IA
– Personas
• Used in all modalities
– Usability Testing
• Used in all modalities
– A/B Testing
• Useful for applications that are already in production
65© 2015 Versay Solutions LLC
Usability Testing
66© 2016 Versay Solutions LLC
67
Should I Speech-Enable X?
What IS X?
69© 2016 Versay Solutions LLC
What’s the Use Case For Speech?
• Enabling application
– User can’t do it any other way
– New tasks
• Enhancing application
– User can do it now
– But speech makes it better
• Faster
• Safer
70Credit: Bruce Ballentine, EIG
How Hard Is It To Do?
• What do you need it for?
• What kind of device will you be running it
on?
– Connectivity? Can you use cloud based ASR?
– Do you have to download it? If so, how much
space do you have?
• How much control do you need over the
application / user interface?
71© 2016 Versay Solutions LLC
Possibilities
72© 2016 Versay Solutions LLC
Write an app (skill) for
an agent such as
Cortana / Alexa
Use cloud APIs to add
ASR to your app / device
/ page / gadget
Download an ASR and
use full-featured
capabilities for more
robust recognition
Build your own
Distributed: Today’s Speech Agents
• Siri
• Cortana
• Google Now
• Amazon Echo (Alexa)
73© 2016 Versay Solutions LLC
Today’s Cloud-Based Speech APIs
• Distributed speech recognition
– Collection and compression of speech is on
the device
– The language models are typically on the
network
– Phone can be speaker-dependent
• Trains itself on your voice and on the acoustic
environments you are in most often
– Many companies are providing APIs to use
their speech recognition
74© 2016 Versay Solutions LLC
AVS vs. Amazon Echo
• Could use AVS with the Amazon Echo, or
with your own device
75© 2016 Versay Solutions LLC
Speech API Example: Alexa Voice
Services
76© 2016 Versay Solutions LLC
Alexa Skill Example
77© 2016 Versay Solutions LLC
78© 2016 Versay Solutions LLC
Alexa “Skills”
• “Alexa, ask Yelp to find me a restaurant.”
– Cortana has similar integration
• Register your skill with Amazon and
publish it
79© 2016 Versay Solutions LLC
Cloud vs. Downloadable / Embedded
• Microsoft
– Cortana integration
– Project Oxford API
• Google API
• Amazon
• Several new recent
startups
– Api.ai, Capio.ai,
Speechmatics,
iSpeech
80© 2016 Versay Solutions LLC
• Microsoft
– Windows 10 Speech
APIs
– Microsoft Speech Server
• Nuance
– the 800 pound gorilla in
the room
• Interactions
– IBM Watson
Cloud vs. Downloadable / Embedded
• Easy to get started
• Lightweight
• Not much
specialized
knowledge
81© 2016 Versay Solutions LLC
• Customizable
• Probably better
recognition
• Can be device-specific
• More features
• Higher powered
• Will require specialized
knowledge
• Speech scientist
Today’s NLU APIs
• Microsoft LUIS (part of Project Oxford)
• Api.ai
82© 2016 Versay Solutions LLC
Open Source ASR
• CMU Sphinx
– pocketsphinx
• Kaldi
– http://kaldi-asr.org/
• Github
• New updates include some pretty interesting stuff
(DNN)
• Requires:
– Corpus
– Tech know-how
83© 2016 Versay Solutions LLC
Who May You Need On Your Team
• Speech Scientist
• VUI Designer
84© 2016 Versay Solutions LLC
Should I Speech-Enable X?
85© 2016 Versay Solutions LLC
Should I Speech-Enable X?
86© 2016 Versay Solutions LLC
Desktop App / Website
• Easy to get started with
API-based ASR
• But the use case may
not be as powerful
Tablet / Mobile
• Stronger use case
• But will the network be
available for APIs?
Industrial Device
• Great use case esp. with
multimodal
• But this is harder to do
and probably will be
custom
Gadget
• Decent use case
• APIs are tailored for this
• Will they do everything
you need?
• Will the extra modality
be a plus or just a “silly
add-on?”
Car
• Safety considerations
are high here
• Need better user
interfaces & more
robust
IVR
• Touchtone can still be
good for a lot of
applications
• Speech is good for
complex call routing and
input
Resources
• The Voice in the Machine: Building
Computers that Understand Speech –
Roberto Pieraccini
• YouTube video: “Open the Pod Bay Doors,
Siri”
• Best Practices in VUI Design: AVIxD Wiki
– http://videsign.wikispaces.com/
• AVIxD: Quarterly Brown Bags
87© 2016 Versay Solutions LLC
88© 2016 Versay Solutions LLC
Thanks!
@crispinTX
crispinreedy.com
creedy@versay.com

Contenu connexe

Tendances

Speech Recognition: Transcription and transformation of human speech
Speech Recognition: Transcription and transformation of human speechSpeech Recognition: Transcription and transformation of human speech
Speech Recognition: Transcription and transformation of human speech
SubmissionResearchpa
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
Diptimaya Sarangi
 
Abstract of speech recognition
Abstract of speech recognitionAbstract of speech recognition
Abstract of speech recognition
Vinay Jaisriram
 
Speech recognition-using-wavelet-transform
Speech recognition-using-wavelet-transformSpeech recognition-using-wavelet-transform
Speech recognition-using-wavelet-transform
vidhateswapnil
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
Hugo Moreno
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
ankit_saluja
 

Tendances (19)

Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Speech Recognition: Transcription and transformation of human speech
Speech Recognition: Transcription and transformation of human speechSpeech Recognition: Transcription and transformation of human speech
Speech Recognition: Transcription and transformation of human speech
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail Inteligence
 
Speech recognition an overview
Speech recognition   an overviewSpeech recognition   an overview
Speech recognition an overview
 
Voice/Speech recognition in mobile devices
Voice/Speech recognition in mobile devicesVoice/Speech recognition in mobile devices
Voice/Speech recognition in mobile devices
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Ece speech-recognition-report
Ece speech-recognition-reportEce speech-recognition-report
Ece speech-recognition-report
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Abstract of speech recognition
Abstract of speech recognitionAbstract of speech recognition
Abstract of speech recognition
 
speech processing and recognition basic in data mining
speech processing and recognition basic in  data miningspeech processing and recognition basic in  data mining
speech processing and recognition basic in data mining
 
Seminar
SeminarSeminar
Seminar
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Speech recognition-using-wavelet-transform
Speech recognition-using-wavelet-transformSpeech recognition-using-wavelet-transform
Speech recognition-using-wavelet-transform
 
Voice input and speech recognition system in tourism/social media
Voice input and speech recognition system in tourism/social mediaVoice input and speech recognition system in tourism/social media
Voice input and speech recognition system in tourism/social media
 
SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK SPEECH RECOGNITION USING NEURAL NETWORK
SPEECH RECOGNITION USING NEURAL NETWORK
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition system
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech Recognition
 

En vedette

Where's Jarvis? The Future of Voice Recognition and Natural Language User In...
Where's Jarvis?  The Future of Voice Recognition and Natural Language User In...Where's Jarvis?  The Future of Voice Recognition and Natural Language User In...
Where's Jarvis? The Future of Voice Recognition and Natural Language User In...
Crispin Reedy
 
Smart Home Automation by LDCE student
Smart Home Automation by LDCE studentSmart Home Automation by LDCE student
Smart Home Automation by LDCE student
Mitul Lakhani
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
Amrita More
 
Home automation ppt-kamal lamichhane
Home automation ppt-kamal lamichhaneHome automation ppt-kamal lamichhane
Home automation ppt-kamal lamichhane
Kamal Lamichhane
 

En vedette (20)

Association for Voice Interaction Design Annual Meeting 2016
Association for Voice Interaction Design Annual Meeting 2016Association for Voice Interaction Design Annual Meeting 2016
Association for Voice Interaction Design Annual Meeting 2016
 
Where's Jarvis? The Future of Voice Recognition and Natural Language User In...
Where's Jarvis?  The Future of Voice Recognition and Natural Language User In...Where's Jarvis?  The Future of Voice Recognition and Natural Language User In...
Where's Jarvis? The Future of Voice Recognition and Natural Language User In...
 
Voice & Speech Recognition Technology in Healthcare
Voice &  Speech Recognition Technology in HealthcareVoice &  Speech Recognition Technology in Healthcare
Voice & Speech Recognition Technology in Healthcare
 
TTN (The Things Network) Dallas at TM PMI Dallas - 17Dec16
TTN (The Things Network) Dallas at TM PMI Dallas - 17Dec16TTN (The Things Network) Dallas at TM PMI Dallas - 17Dec16
TTN (The Things Network) Dallas at TM PMI Dallas - 17Dec16
 
Asr
AsrAsr
Asr
 
Cognitive Science, Past, Present, and Future
Cognitive Science, Past, Present, and FutureCognitive Science, Past, Present, and Future
Cognitive Science, Past, Present, and Future
 
Learning, Memory, and Representation (in Cognitive Science)
Learning, Memory, and Representation (in Cognitive Science)Learning, Memory, and Representation (in Cognitive Science)
Learning, Memory, and Representation (in Cognitive Science)
 
Prototyping Workshop - Wireframes, Mockups, Prototypes
Prototyping Workshop - Wireframes, Mockups, PrototypesPrototyping Workshop - Wireframes, Mockups, Prototypes
Prototyping Workshop - Wireframes, Mockups, Prototypes
 
Voice Recognition Wireless Home Automation System Based On Zigbee
Voice Recognition Wireless Home Automation System Based On ZigbeeVoice Recognition Wireless Home Automation System Based On Zigbee
Voice Recognition Wireless Home Automation System Based On Zigbee
 
Voice activated device
Voice activated deviceVoice activated device
Voice activated device
 
Photoshop E-learning
Photoshop E-learningPhotoshop E-learning
Photoshop E-learning
 
Smart Home Automation by LDCE student
Smart Home Automation by LDCE studentSmart Home Automation by LDCE student
Smart Home Automation by LDCE student
 
Controlling Home Appliances Using Voice
 Controlling Home Appliances Using Voice Controlling Home Appliances Using Voice
Controlling Home Appliances Using Voice
 
Smart Home Automation - An Overview
Smart Home Automation - An OverviewSmart Home Automation - An Overview
Smart Home Automation - An Overview
 
HOME AUTOMATION USING ARDUINO
HOME AUTOMATION USING ARDUINOHOME AUTOMATION USING ARDUINO
HOME AUTOMATION USING ARDUINO
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Home automation ppt-kamal lamichhane
Home automation ppt-kamal lamichhaneHome automation ppt-kamal lamichhane
Home automation ppt-kamal lamichhane
 
Presentation Smart Home With Home Automation
Presentation Smart Home With Home AutomationPresentation Smart Home With Home Automation
Presentation Smart Home With Home Automation
 
Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 

Similaire à Voice Recognition and Natural Language - Dallas TechFest 2016

Adventures on the Road to Enterprise Virtual Assistants
Adventures on the Road to Enterprise Virtual AssistantsAdventures on the Road to Enterprise Virtual Assistants
Adventures on the Road to Enterprise Virtual Assistants
Editt Gonen-Friedman
 

Similaire à Voice Recognition and Natural Language - Dallas TechFest 2016 (20)

Conversational User Interfaces, Past and Future
Conversational User Interfaces, Past and FutureConversational User Interfaces, Past and Future
Conversational User Interfaces, Past and Future
 
Where's Jarvis? The future of Voice Recognition and Natural Language User Int...
Where's Jarvis? The future of Voice Recognition and Natural Language User Int...Where's Jarvis? The future of Voice Recognition and Natural Language User Int...
Where's Jarvis? The future of Voice Recognition and Natural Language User Int...
 
Open source and free technologies for study skills
Open source and free technologies for study skillsOpen source and free technologies for study skills
Open source and free technologies for study skills
 
透過 Amazon Polly 為你的應用程式加入語音功能
透過 Amazon Polly 為你的應用程式加入語音功能透過 Amazon Polly 為你的應用程式加入語音功能
透過 Amazon Polly 為你的應用程式加入語音功能
 
DTUI6_chap09_accessiblePPT.pptx
DTUI6_chap09_accessiblePPT.pptxDTUI6_chap09_accessiblePPT.pptx
DTUI6_chap09_accessiblePPT.pptx
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Speech recognition system
Speech recognition systemSpeech recognition system
Speech recognition system
 
Chapter 9 Universal Design
Chapter 9 Universal DesignChapter 9 Universal Design
Chapter 9 Universal Design
 
Realtime captioning checklist
Realtime captioning checklistRealtime captioning checklist
Realtime captioning checklist
 
Collaborative and Continuous Digital Publishing
Collaborative and Continuous Digital Publishing Collaborative and Continuous Digital Publishing
Collaborative and Continuous Digital Publishing
 
QA Fest 2016. Роман Горин. Введение в системы распознавания речи глазами тест...
QA Fest 2016. Роман Горин. Введение в системы распознавания речи глазами тест...QA Fest 2016. Роман Горин. Введение в системы распознавания речи глазами тест...
QA Fest 2016. Роман Горин. Введение в системы распознавания речи глазами тест...
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Architecting Your Global Digital Experience House - Nicole Uhlig and Derek Pa...
Architecting Your Global Digital Experience House - Nicole Uhlig and Derek Pa...Architecting Your Global Digital Experience House - Nicole Uhlig and Derek Pa...
Architecting Your Global Digital Experience House - Nicole Uhlig and Derek Pa...
 
Voice User Interface Design - Big Design 2017
Voice User Interface Design - Big Design 2017Voice User Interface Design - Big Design 2017
Voice User Interface Design - Big Design 2017
 
The big DAM debate: Open source VS. proprietary software
The big DAM debate: Open source VS. proprietary softwareThe big DAM debate: Open source VS. proprietary software
The big DAM debate: Open source VS. proprietary software
 
Universal design HCI
Universal design HCIUniversal design HCI
Universal design HCI
 
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech ServerTulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
Tulsa Techfest 2008 - Creating A Voice User Interface With Speech Server
 
Adventures on the Road to Enterprise Virtual Assistants
Adventures on the Road to Enterprise Virtual AssistantsAdventures on the Road to Enterprise Virtual Assistants
Adventures on the Road to Enterprise Virtual Assistants
 
Narrate Your Way To Success
Narrate Your Way To SuccessNarrate Your Way To Success
Narrate Your Way To Success
 
Language Access for Legal Aid Websites
Language Access for Legal Aid WebsitesLanguage Access for Legal Aid Websites
Language Access for Legal Aid Websites
 

Plus de Crispin Reedy

Design Thinking Action Lab Exercise 1
Design Thinking Action Lab Exercise 1Design Thinking Action Lab Exercise 1
Design Thinking Action Lab Exercise 1
Crispin Reedy
 

Plus de Crispin Reedy (12)

Association for Voice Interaction Design - Annual Meeting 2018
Association for Voice Interaction Design - Annual Meeting 2018Association for Voice Interaction Design - Annual Meeting 2018
Association for Voice Interaction Design - Annual Meeting 2018
 
Assertive Niceness
Assertive NicenessAssertive Niceness
Assertive Niceness
 
Adding Visuals to Voice Panel - SpeechTEK 2017
Adding Visuals to Voice Panel - SpeechTEK 2017Adding Visuals to Voice Panel - SpeechTEK 2017
Adding Visuals to Voice Panel - SpeechTEK 2017
 
Chatbots vs. Voicebots Sunrise Session SpeechTEK 2017-final
Chatbots vs. Voicebots Sunrise Session SpeechTEK 2017-finalChatbots vs. Voicebots Sunrise Session SpeechTEK 2017-final
Chatbots vs. Voicebots Sunrise Session SpeechTEK 2017-final
 
Association for Voice Interaction Design Annual Meeting 2017
Association for Voice Interaction Design Annual Meeting 2017Association for Voice Interaction Design Annual Meeting 2017
Association for Voice Interaction Design Annual Meeting 2017
 
Top 10 Tips for Making Complicated Things Simple
Top 10 Tips for Making Complicated Things SimpleTop 10 Tips for Making Complicated Things Simple
Top 10 Tips for Making Complicated Things Simple
 
Going Solo: Design and Productivity Techniques for the Team of One
Going Solo: Design and Productivity Techniques for the Team of OneGoing Solo: Design and Productivity Techniques for the Team of One
Going Solo: Design and Productivity Techniques for the Team of One
 
Service Design and the Omnichannel Experience - SpeechTEK 2015
Service Design and the Omnichannel Experience - SpeechTEK 2015Service Design and the Omnichannel Experience - SpeechTEK 2015
Service Design and the Omnichannel Experience - SpeechTEK 2015
 
Association for Voice Interaction Design Annual Meeting 2015
Association for Voice Interaction Design Annual Meeting 2015Association for Voice Interaction Design Annual Meeting 2015
Association for Voice Interaction Design Annual Meeting 2015
 
SpeechTEK University Outtakes 2014: Zero Out Strategies
SpeechTEK University Outtakes 2014: Zero Out StrategiesSpeechTEK University Outtakes 2014: Zero Out Strategies
SpeechTEK University Outtakes 2014: Zero Out Strategies
 
2013 Speech TEK - Alphanumeric Recognition Discussion
2013 Speech TEK - Alphanumeric Recognition Discussion2013 Speech TEK - Alphanumeric Recognition Discussion
2013 Speech TEK - Alphanumeric Recognition Discussion
 
Design Thinking Action Lab Exercise 1
Design Thinking Action Lab Exercise 1Design Thinking Action Lab Exercise 1
Design Thinking Action Lab Exercise 1
 

Dernier

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 

Dernier (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Voice Recognition and Natural Language - Dallas TechFest 2016

  • 1. Voice Recognition and Natural Language Dallas TechFest January 29, 2016 Crispin Reedy @crispinTX #DallasTechFest16
  • 2. 2© 2016 Versay Solutions LLC • Voice User Interface Designer • 10 years in the field • Former coder; got interested in UX • President of the Association for Voice Interaction Design • Consultant for Versay Solutions @crispinTX crispinreedy.com
  • 3.
  • 4. Disclaimers This Session Is About: • What is speech recognition anyway? • Should I speech-enable X? How? • In general, how does it work? – What technologies should I consider? – What skills are important? • What are the design considerations? It’s NOT About: • Detailed code • In depth how-tos • Deep technical knowledge • Advanced ASR
  • 6. What IS X? 6© 2016 Versay Solutions LLC
  • 7. How does this new modality enable or enhance what I want to do on this platform?
  • 8. What IS X? 8© 2016 Versay Solutions LLC
  • 9. Terms & Technologies • Speech Recognition • Natural Language Understanding • Text to Speech • Voice Verification (Biometrics) 9© 2016 Versay Solutions LLC
  • 10. Speech Recognition • Also known as “ASR” – “Speech to Text” ? 10© 2016 Versay Solutions LLC “See the cat.” Spoken language Machine- readable format
  • 11. Natural Language Understanding • Extracting meaning from natural text – Not necessarily tied to speech recognition 11© 2016 Versay Solutions LLC “Hello, yes, I’d like to pay my water bill. Can you help me with that? Action = BillPay BillType = Water
  • 12. Text to Speech • Speech Synthesis – Used to convert text to spoken words 12© 2016 Versay Solutions LLC
  • 13. Voice Verification • Also called voiceprints, biometrics, voice authentication, etc. • Recognizes a person, not necessarily what they are saying. – You can have ASR without Voice Verification – And vice versa 13© 2016 Versay Solutions LLC “My voice is my password.” “Authenticated. Welcome, Mr. Smith.” ✓
  • 14. 14© 2016 Versay Solutions LLC Speech Recognition • Hands-free command / control • Dictation • Input text • Small form factor device, etc. Text To Speech • Output text dynamically • Respond to input • Useful when no display is available Natural Language Understanding • Necessary at some level for all language-based input • Also used to parse large volumes of text Voice Verification • Security Uses: Separate Applications
  • 15. Uses: Combined 15© 2016 Versay Solutions LLC ASR Application Data • Sign-In • Interaction • Request • Action • Meaning • Access Data • Output TTS NLU Voice prints Verifi- cation
  • 16. True Multimodality 16© 2016 Versay Solutions LLC ASR Application Data • Sign-In • Interaction • Request • Action • Meaning • Access Data • Output TTS NLU Voice prints Verifi- cation Touch Keyboard Manage I/O Modality Determine Meaning in Context Visual Context!
  • 19. Output: Text to Speech • (Somewhat) mature technology • (Fairly) easy to understand and use – Note: “Create TTS audio” is not the same as having a TTS engine 19© 2016 Versay Solutions LLC
  • 20. How it Works 20© 2016 Versay Solutions LLC
  • 21. TTS Engine • Text in, speech out • May do some text pre-processing – St. James St. – Saint James Street – Punctuation – If it doesn’t do this, you’ll have to yourself. • Grapheme to phoneme transcription • Identify intonation patterns – Assign the correct lexical stress to the words 21© 2016 Versay Solutions LLC
  • 22. What Makes Good TTS? • Phonemes change based on location – “Cat” – “Alligator” • Elision – “I’m. Awaiting. You.” – “I’m awaiting you.” • Intonation – “Do you want coffee?” – “Do you want soda, tea, or coffee?” 22© 2016 Versay Solutions LLC
  • 23. SSML • XML based WC3 standard for Speech Synthesis Markup – Not universally supported by vendors. • Tags for marking up text to produce a more natural quality output. – Emphasis – Break – Voice – Prosody – Pitch 23© 2016 Versay Solutions LLC
  • 24. SSML Example 24© 2016 Versay Solutions LLC
  • 25. When To Use It • When high quality audio is not a consideration – TTS has improved considerably, but is still noticeable • When you have a lot of dynamic data – If you just need to say a few things, it may be overkill 25© 2016 Versay Solutions LLC
  • 26. Other Considerations • More phonemes = higher quality voice – Also means a bigger download and install (if on device) • Exceptions (addresses, names) can be iffy – May require a lot of work to handle well • Your data needs to be clean and ready to voice back – Acronyms, incomplete sentences will not sound good • Some applications may have other acoustic limitations – Telephony • It is possible to build a custom voice – But it takes a lot of work! 26© 2016 Versay Solutions LLC
  • 27. Where To Find It • Many commercial products available – Most languages and dialects i.e. American English, British English, etc. – Many different voices – Nuance, Cepstral, Inova – Some open source – Some APIs • Chrome https://developer.chrome.com/apps/tts 27© 2016 Versay Solutions LLC
  • 29. ASR and NLU: Topics • Complications of speech – Why is it so hard? • How it works: overview • Early commercial adoptions – IVR • Design considerations • Speech today – Different vendors • Should I voice-enable X? 29© 2016 Versay Solutions LLC
  • 30. 30(The Speech Chain, Bell Labs, 1963)
  • 31. 31The Voice in the Machine: Pieraccini World Knowledge Semantics Syntax Lexicon Morphology Phonetics Acoustics Linguistics Physiology Concepts Phrases Words Phonemes Sounds ASR NLU
  • 32. Speech Is Ambiguous • Speech is never stationary – Coarticulation • Noisy environments • Accents • Different speakers have voices with different acoustic qualities – Goats • Challenges vary depending on what you are going to recognize – Spelling (short utterances) can be difficult even for humans – Phonetic alphabet (Military) 32© 2016 Versay Solutions LLC
  • 33. Language Is Ambiguous • Humans can deduce meaning from context and unknown words “How can I help you?” I’m having a problem with my account. I’d like that one. No, not the green one, the red one. Time flies like an arrow. Fruit flies like a banana. 33© 2016 Versay Solutions LLC
  • 34. Everything Is Ambiguous • All modern speech recognition is probabilistic – GUI: Button clicked? true / false – VUI: There is an 85% chance that button was clicked 34© 2016 Versay Solutions LLC
  • 35. Three Dimensions of Speech Problems 35The Voice in the Machine: Pieraccini Speaker Independence Speaker Dependent Multiple Speakers Speaker Independent Isolated Words Connected Words Natural Speech 10 words 1000 words 100,000 words Unlimited VocabularySize Humanlike
  • 36. History of Speech Recognition • AUDREY: Davis, Biddulph, and Balashek - Bell Labs 1952 36© 2016 Versay Solutions LLC • Analog • Isolated digit recognition – Pause between digits • Speaker-dependent
  • 37.
  • 38.
  • 39. Sampling • The start of being able to digitally manipulate audio 39© 2016 Versay Solutions LLC
  • 40. 40© 2016 Versay Solutions LLC 0 db frequency Spectrogram vs. Waveform
  • 41. 1970’s: Template Matching • Template matching approach – “Brute force” model – Quantitized spectrograms – What about duration? • Dynamic time warping • Endpoint detection – Difficult to do • Feature extraction 41© 2016 Versay Solutions LLC
  • 42. 1980’s: The Power of Statistics • The recognition of connected speech becomes a search for the best path in a large network – Problem of finding the probabilities • Statistical Language Models – Not all sequences of words are equally probable – Rank all permissible sentences in terms of probability • “Correct” grammar is not applicable • Restricted by domain • Hidden Markov Models (HMM) – Unified probabilistic model for speech 42© 2016 Versay Solutions LLC
  • 43. Hidden Markov Model Example 43"HiddenMarkovModel" by Tdunningvectorization (Wikimedia) X — states y — possible observations a — state transition probabilities b — output probabilities
  • 44. You’re Only As Good As What You’re Trained On • Corpora – Collection of speech used to train a recognizer – Acoustic and/or Pronunciation Model • Associates sounds with symbols and words. • Created by a general speech corpora and a phonetic and orthographic transcription – Statistical Language Model (SLM) • A probability distribution over sequences of words • Created by a domain-specific speech corpora and a tagged transcription to extract meaning 44© 2016 Versay Solutions LLC
  • 45. Training 45© 2016 Versay Solutions LLC Speech Recognition Engine Acoustic Model SLM and/or Grammar Pronunciation Model
  • 46. Language Model vs. Grammar • SLM – Has to be trained against collected utterances – Large potential set of what the caller can say – Tagged with the meanings of what they can say • Grammar (GrXML) – More tightly constrained than an SLM – Easier to create – Not “trained” in the same way – System will only recognize what is in the grammar 46© 2016 Versay Solutions LLC
  • 47. 47© 2016 Versay Solutions LLC Utterance Noise Levels? Barge-In? Feature Extraction Endpointing Speech Recognition Engine Grammar or SLM Probabilities n:best list Literal return Tokens Recognition Event
  • 48. Natural Language Understanding • Parsing input to extract meaning • Covers a large field – Commands – Automatic classification of emails – Newspaper articles, large chunks of text • Lexicon • Parser • Grammar rules • New tools / APIs 48© 2016 Versay Solutions LLC
  • 49. Levels of Meaning 49© 2016 Versay Solutions LLC Too Broad / Ambiguous Too MuchJust Right “I’m having a problem with my account.” “Well, I was looking at my bill, because I do that every week, and I was reviewing everything on there, and I saw…” “I’m seeing an unusual charge on my bill.” “How can I help you?”
  • 50. Multi-Token Utterances • “I’d like to transfer $50 from my checking account to my savings account.” – ACTION = Transfer – FROM_ACCOUNT = Checking – TO_ACCOUNT = Savings – AMOUNT = $50 • Unfortunately, people don’t often naturally produce these kinds of utterances. 50© 2016 Versay Solutions LLC
  • 51. Early Commercial Adoption • IVR – Touchtone / DTMF • “For checking, press 1. For savings, press 2.” – Directed Dialog (Grammar-based ASR) • “Which account? Just say ‘checking,’ ‘savings,’ or ‘money market.’” – Natural Language (SLM-based ASR) • “From which account?” • SpeechWorks / Nuance technology • Voice XML / GrXML 51© 2016 Versay Solutions LLC
  • 52.
  • 53. 53© 2016 Versay Solutions LLC
  • 54. Typical IVR Architecture 54© 2016 Versay Solutions LLC Voice Browser VUI VXML PSTN / VOIP HTTP App Server / Data Connection Data SIP MRCP ASR Server TTS Server
  • 55. Anatomy of an VUI + NLU project • Voice User Interface Design – High level design • Design style, sound and feel, IA, – Detailed design • Prompts (recorded) • Grammars for directed dialog states • Data I/O 55© 2016 Versay Solutions LLC • SLM Creation – Utterance capture – Transcription – Tagging – Compiling and deployment
  • 56. 56© 2016 Versay Solutions LLC
  • 57. VUI Design Doc – Detailed Example 57© 2016 Versay Solutions LLC
  • 58. Corpora Documentation Example 58© 2016 Versay Solutions LLC
  • 59. Design Considerations • Types of Speech User Interfaces – Command and Control – Dictation – Dialog-based • Speech is a linear, time-based interface – Multimodality introduces additional complications 59© 2016 Versay Solutions LLC
  • 60. Design Considerations • If the recognizer doesn’t get something, you have to reprompt. • Don’t say “sorry.” “Where are you traveling today?” I’m going to…. <noise> “What city was that?” 60© 2016 Versay Solutions LLC
  • 61. Design Considerations • Speech is interruptible – Main Menu: Choose from: “Beverages,” “Sandwiches,” “Sides,” “Salads,” or “Alcoholic Drinks.” 61© 2016 Versay Solutions LLC
  • 62. Design Considerations • Prompts imply more than choices – Would you like chocolate or vanilla? • Yes • Both 62© 2016 Versay Solutions LLC
  • 63. Design Considerations • Input must be limited *after* it is provided – Can’t check the box on the client side to only allow input of valid amounts – “Sorry, you’re only allowed to transfer up to $500.” 63© 2016 Versay Solutions LLC
  • 64. Design Considerations • Avoid using the word “Help” as a global command. • Instead, if there is a need to give additional information, supply it in the first or second reprompts. – Or use specific keywords – Other than “help” • “You can also say ‘instructions.’” • “Or, say ‘It’s something else.’” 64© 2016 Versay Solutions LLC
  • 65. User Centered Design Techniques • A set of techniques designed to keep the focus on the user during the design process • May include but are not limited to: – Conversations • Specific to VUI design – Read Aloud • Specific to VUI design – Card Sorts • Used to construct an IA – Personas • Used in all modalities – Usability Testing • Used in all modalities – A/B Testing • Useful for applications that are already in production 65© 2015 Versay Solutions LLC
  • 66. Usability Testing 66© 2016 Versay Solutions LLC
  • 67. 67
  • 69. What IS X? 69© 2016 Versay Solutions LLC
  • 70. What’s the Use Case For Speech? • Enabling application – User can’t do it any other way – New tasks • Enhancing application – User can do it now – But speech makes it better • Faster • Safer 70Credit: Bruce Ballentine, EIG
  • 71. How Hard Is It To Do? • What do you need it for? • What kind of device will you be running it on? – Connectivity? Can you use cloud based ASR? – Do you have to download it? If so, how much space do you have? • How much control do you need over the application / user interface? 71© 2016 Versay Solutions LLC
  • 72. Possibilities 72© 2016 Versay Solutions LLC Write an app (skill) for an agent such as Cortana / Alexa Use cloud APIs to add ASR to your app / device / page / gadget Download an ASR and use full-featured capabilities for more robust recognition Build your own
  • 73. Distributed: Today’s Speech Agents • Siri • Cortana • Google Now • Amazon Echo (Alexa) 73© 2016 Versay Solutions LLC
  • 74. Today’s Cloud-Based Speech APIs • Distributed speech recognition – Collection and compression of speech is on the device – The language models are typically on the network – Phone can be speaker-dependent • Trains itself on your voice and on the acoustic environments you are in most often – Many companies are providing APIs to use their speech recognition 74© 2016 Versay Solutions LLC
  • 75. AVS vs. Amazon Echo • Could use AVS with the Amazon Echo, or with your own device 75© 2016 Versay Solutions LLC
  • 76. Speech API Example: Alexa Voice Services 76© 2016 Versay Solutions LLC
  • 77. Alexa Skill Example 77© 2016 Versay Solutions LLC
  • 78. 78© 2016 Versay Solutions LLC
  • 79. Alexa “Skills” • “Alexa, ask Yelp to find me a restaurant.” – Cortana has similar integration • Register your skill with Amazon and publish it 79© 2016 Versay Solutions LLC
  • 80. Cloud vs. Downloadable / Embedded • Microsoft – Cortana integration – Project Oxford API • Google API • Amazon • Several new recent startups – Api.ai, Capio.ai, Speechmatics, iSpeech 80© 2016 Versay Solutions LLC • Microsoft – Windows 10 Speech APIs – Microsoft Speech Server • Nuance – the 800 pound gorilla in the room • Interactions – IBM Watson
  • 81. Cloud vs. Downloadable / Embedded • Easy to get started • Lightweight • Not much specialized knowledge 81© 2016 Versay Solutions LLC • Customizable • Probably better recognition • Can be device-specific • More features • Higher powered • Will require specialized knowledge • Speech scientist
  • 82. Today’s NLU APIs • Microsoft LUIS (part of Project Oxford) • Api.ai 82© 2016 Versay Solutions LLC
  • 83. Open Source ASR • CMU Sphinx – pocketsphinx • Kaldi – http://kaldi-asr.org/ • Github • New updates include some pretty interesting stuff (DNN) • Requires: – Corpus – Tech know-how 83© 2016 Versay Solutions LLC
  • 84. Who May You Need On Your Team • Speech Scientist • VUI Designer 84© 2016 Versay Solutions LLC
  • 85. Should I Speech-Enable X? 85© 2016 Versay Solutions LLC
  • 86. Should I Speech-Enable X? 86© 2016 Versay Solutions LLC Desktop App / Website • Easy to get started with API-based ASR • But the use case may not be as powerful Tablet / Mobile • Stronger use case • But will the network be available for APIs? Industrial Device • Great use case esp. with multimodal • But this is harder to do and probably will be custom Gadget • Decent use case • APIs are tailored for this • Will they do everything you need? • Will the extra modality be a plus or just a “silly add-on?” Car • Safety considerations are high here • Need better user interfaces & more robust IVR • Touchtone can still be good for a lot of applications • Speech is good for complex call routing and input
  • 87. Resources • The Voice in the Machine: Building Computers that Understand Speech – Roberto Pieraccini • YouTube video: “Open the Pod Bay Doors, Siri” • Best Practices in VUI Design: AVIxD Wiki – http://videsign.wikispaces.com/ • AVIxD: Quarterly Brown Bags 87© 2016 Versay Solutions LLC
  • 88. 88© 2016 Versay Solutions LLC Thanks! @crispinTX crispinreedy.com creedy@versay.com

Notes de l'éditeur

  1. DO NOT FORGET TO BRING THE MINI-SPEAKERS!!!
  2. Siri, Alexa, Cortana: Voice recognition is hot. This session will give an overview of the voice recognition and natural language ecosystem and the technologies behind the experience. For example: What is the difference between voice recognition and natural language understanding? What are some of the common technologies in the market today? What are design considerations around these types of interfaces? This is an introductory session designed for people interested the exploring the possibilities of voice and conversational user interfaces, especially when considering the Internet of Things.
  3. Siri, Alexa, Cortana: Voice recognition is hot. This session will give an overview of the voice recognition and natural language ecosystem and the technologies behind the experience. For example: What is the difference between voice recognition and natural language understanding? What are some of the common technologies in the market today? What are design considerations around these types of interfaces? This is an introductory session designed for people interested the exploring the possibilities of voice and conversational user interfaces, especially when considering the Internet of Things.
  4. Computers: Apps and webpages. Consoles: Gaming / Connectivity Mobile and Tablet Industrial devices – especially something task driven Gadgets Cars The phone
  5. Computers: Apps and webpages. Consoles: Gaming / Connectivity Mobile and Tablet Industrial devices – especially something task driven Gadgets Cars The phone Essentially what we’re coming to terms with here is a new input modality. It’s one that doesn’t always work very well – for reasons we’ll get into later. But, it can be a very powerful one when it does work well. It’s also a lot harder to figure out how to properly combine speech with everything else that is going on in your environment.
  6. Not going to discuss this one in a lot of detail today but it’s important that you understand the difference between these technologies.
  7. Human voice talent Hundreds of hours of recording Digitized Phonemes: Concatenated speech synthesis
  8. World Knowledge: Concepts of the world around us, i.e. Tables have four legs, what is left and right, what is a car, etc. This is the level before language Semantics: The first level of language. Knowledge can be represented in structured meaningful elements. Example: semantics of a party invitation Syntax: The rules that govern putting words together to form meaningful units Lexicon: What words mean Morphology: How words change their form to perform differently in a language i.e. horse / horses Phonetics: Phonemes and how words are built Acoustics: What phonemes sound like and how to create them
  9. Waveforms show the variation in overall intensity (decibels) over time. Spectrograms show the variation of individual frequency components
  10. Observations to make: Represents the entirety of a VUI experience Placement of Spanish prompt would vary depending on type of call. Confirmation is variable Confirmation prompt is general
  11. DO NOT FORGET TO BRING THE MINI-SPEAKERS!!!