SlideShare une entreprise Scribd logo
1  sur  28
Intelligent Conversational Agents for Ambient Computing
1
Ruhi Sarikaya
Director, Alexa AI
SIGIR 2022, Madrid
Outline
• Long range view of fundamental trends and shifts in computing and User
Experience
• What does IoT and context mean for ambient conversational AI?
• How does Conversational AI work?
• Self-Learning: Implicit and explicit customer feedback based learning
• Q &A
2
3
Alexa Devices
4
Alexa Everywhere
Human Interaction with the Digital World
Human Senses: sight, hearing, touch, smell, taste Computer ‘Senses’
• No sight & no hearing (until recently)
• Form of Human Input: typing & tactile
Gap
• Computers (and backend services) are not yet
designed for receiving voice input to operate
Problem
• You need to physically touch to computers
• It tethers you to a screen, ‘immobilizes’ you
 Friction!
• The perception of our senses are created and
stored in different parts of the brain
• Current computing cycle: Mobile internet [Meeker, Morgan Stanley, 2014]
• No room for growth for connecting people to internet via smartphone (after 2020)
• What is next?
 IoT and intelligent connected systems & services  Ambient Intelligence with Conversational AI as the UX layer
1
10
100
1000
10000
100000
1000000
1960 1970 1980 1990 2000 2010 2020
10X Computing Cycles
MiniComputer
10M+ Units
PC
100M+ Units
IoT
100B+ Units
Mainframe
1M+ Units
Desktop Internet
1B+ Units/Users
Mobile Internet
10B+ Units
Mobile
Phones
Tablets
eReaders
MP3 Players
Telematics
.....
Any Device
Increased integration
Smaller form factor
Increased power & storage
Lower costs
Improved UI
The New Computing Cycle
6
0
200
400
600
800
1000
1200
1400
1600
1800
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Global Computing Device Shipments (in millions)
Smart Phone PC+Laptop Tablet
15.41
17.68
20.35
23.14
26.66
30.73
35.82
42.62
51.51
62.12
75.44
0
10
20
30
40
50
60
70
80
2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Connected
Devices
in
Billions
IoT Worldwide Install Base (billions)
Internet of Things (IoT): Connected Smart Devices with Sensors
• Sensors: smaller, low power and cheap
• 1 trillion sensors by 2022
• Digital Nervous System: location data (GPS),
eyes and ears with camera and microphone,
sensors (motion, temp, light, pressure, etc.)
Data Aggregation
Linking
Reasoning (AI)
Decision Making (AI)
(Real-time)
Industry Consumers
Phones
Wearables
TVs
Appliances
Home Automation
Home Monitoring
Machinery
Smart Cities
Transportation
Healthcare
Factories
Automation
Collective IoT
Intelligence
Smart Home
• Over 90% of our lives spent inside of a
building
• Intelligent & Responsive physical
environment
• IoT integrates the physical world with
the digital world
• World around us is reasoning and talking
back to us in real time
IT LOOKS LIKE
YOU LEFT THE
LIGHTS ON,
WOULD YOU LIKE
ME TO TURN
THEM OFF?
LIGHTS LOCKS APPLIANCES
Examples of Ambient Intelligence:Alexa Hunches and Routines
Why does IoT matter for Conversational AI?
• “Alexa, play hunger games?”
• What is the user’s intent?
• play_music? play_video? play_audiobook?
• “Alexa, what should I do for dinner?”
• What is the user’s intent?
• book_restaurant? order_food? find_recipe?
• Ground truth for a large combination of [person x device x context] data? How do we scale learning?
• “Alexa, order me two towels?”
• What is the user’s intent?
• shopping? room service?
• “Alexa, what is the temperature?”
• What is the user’s intent?
• weather forecast? temperature inside the home?
temperature of the oven?
• IoT is increasing the complexity (and opportunity) of the world
• Requires real-time communication with a reasoning environment
• Creates new forms of ‘context’
• Context:
• Set of circumstances/facts that surround a particular event, situation or entity for AI systems to sense, reason
and adapt better to the physical and digital world
 Identity & State, Device Types, Physical/Digital Activity on Devices/Systems, Time, Device & User Location,
state and changes in environment as measured by sensors,….
• Why does context matter for conversational AI?
• Contextual Ambiguity: Users do not have any ambiguity when they issue a command to an intelligent assistant
Orchestrator
Skills
Weather
ASR
NLU
TTS
“speak” directive
intent
recognition result
recognize
Nbest interpretations
recognition result
text/SSML
user’s utterance
Alexa’s voice
Alexa’s voice
How Does Conversational AI work?
* Orchestration -
ASR, NLU, Routing,
TTS, Application
Services
* Intent Routing to
Applications
* Session
Management
* Dialog
Management –
multi-turn
interactions
* Abstraction of
device features to
applications
10
Alexa, what is the weather?
Routing
(intent, skill)
Nbest interpretations
Machine Learning Types
(in terms of types of supervision/feedback)
• Supervised learning is the task of learning a prediction function that maps an input to an output based on example input-
output pairs: y = f(x) (e.g. DNN, Logistic Regression, SVM)
Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by
minimizing the prediction error on the training set
Testing: apply f to a never before seen test example x and output the predicted value y = f(x)
• Unsupervised learning looks for patterns in input data, which does not have any pre-existing labels. It allows for modeling
of probability densities over inputs to deduce structures (e.g. K-means, PCA, LDA).
• Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data during
training. Different variants; self-training, co-training, generative methods, graph based methods, etc.
• Self-Supervised Learning, predicting one part of the input from what it knows about another, without any human
supervision (e.g. BERT, ROBERTA, GPT3).
• Reinforcement learning (RL) is concerned with how agents should take actions in an environment in order to maximize
the notion of cumulative reward. 11
 Typically achieves very accurate predictions with sufficient data!
Self Learning for Conversational AI
What do we mean by Self-learning?
• Framework that enables learning autonomously from user-system interactions (e.g.
barge-in, reformulations), system signals, and predictive models
• It can be considered as a layer that combines supervised learning, semi-supervised learning and RL
• Zero component specific manual annotation to train and improve the machine learning models
• Leverage customer’s implicit and explicit feedback and system signals to train and improve ML
models in Conversational AI stack both offline and in runtime
Why self-learning?
• Speed: Rapid scenario building and deployment
• Cost: Minimizing manual annotation cost
• Ambiguity: Customers (vs annotators) know what they mean and want best
• Privacy: Does not require human access to customer data
12
Customer Feedback BasedAutomatedGroundTruth Generation
• A multi-year initiative to shiftAlexa ML model development from manual-annotation based to primarily
self-learning based approach by leveraging various feedbacks
• Explicit feedback (e.g. “Alexa: Did I answer you question? User: Yes”)
• Implicit feedback (e.g. User barge-in a turn or rephrase her request)
• Unsolicited feedback (e.g. User say “Alexa, thank you!” or “Alexa, I am not Derek, I am Dan”)
• Mission: automatically generate labels for 100% of Alexa utterances and for all annotation workflows in
near-real time by leveraging customers interactions and their feedbacks
• Goals: provide automatically accumulated signals and data to
• Protect user privacy (by removing human reviewers from the loop)
• Improve model accuracy (by providing more personalized labels)
• Reduce annotation cost
13
Prod model outputs
(ASR 1-best, NLU 1-
best, etc)
Alexa Models (ASR,
NLU, etc.)
Confidence
Prediction
Confidence
Level?
Alternative
Hypotheses
Generation
Implicit Exploration
(i.e. directly replacing 1-best w/
alternative hypotheses.)
Explicit Exploration
(i.e. present multiple hypotheses to
customer (e.g. voice confirmation,
on-screen choices.)
Alexa Runtime System
Feedback Collection & Understanding
NLU (DC, IC,
NER)
Multi-task Label Generation Models
Feedback Based Annotation
Data
Train New Models
Other data (unlabeled,
existing annotation, etc)
New Modules
Other Alexa
Modules
Exploration Module
Feedback Collection and Label Generation Module
Legends
Customer Feedback Based Ground Truth Generation Overview
Unsolicited
Feedback
(e.g. “Alexa,
thank you”)
Exploration
Decider
High
Low implicit
explicit
ASR (Error
prediction,
etc)
Dialog
Success
Estimation
Implicit
Feedback
(e.g. barge-in,
stop,
rephrase)
Explicit Feedback
(e.g. “did I
answer you
question?”,
”yes”)
14
15
Model Architecture for Customer Feedback Based Ground Truth Generation
Multi-task Label Generation Model
Features
- Dialogue context (user utterance, Alexa response,
previous turns, next turns, etc.)
- System metadata (domain, intent, dialog status,
confidence scores etc.)
Model
- Turn encoder + dialogue level transformer
- Turn level textual encoders is RoBERTa
Multi-task learning heads
- Explicit user feedback (e.g. user say “thank you”);
- Inferred user feedback (e.g. user play music for 30
seconds after voice command)
- Manual annotation
Self-supervised Pretraining
- Synthetic contrastive data (i.e. randomly swap answers
from a different dialog as defect sample).
Model Details
Turn 1
Turn 2
Turn n
E2E Defect
Annotation
Transcription
NLU
Annotation
Dialog Goal
Annotation
Target Turn
Categorical
Features:
Domain
Intent
Dialogue Status
……….
Real Valued /
Binary Features
Textual Encoder
Request
Response
………
MLP
RoBERTa
Speaker ID
Speaker ID
Data
Transformer
(Dialogue Level)
Concat
Layer
(Turn
Level)
MODEL
DATA TASKS
E2E Defect
Estimation
Intent
Classification
ASR
Recognition
Named Entity
Recognition
Goal Evaluation
Goal
Segmentation
16
Automated Ground Truth Generation Results
Goal Segmentation/Evaluation
Table1. Goal segmentation and evaluation tasks. We compare model prediction accuracy
against human (single-pass) annotation accuracy (note here that we use 3-pass Gold
annotation as Ground-truth). ”Single turn” means dialogues with only 1 turn, “Multi turn”
means dialogues with multiple turns. ”Single-Task” denotes models separately fine-tuned
on one task at a time, whereas “Multi-task” denotes models fine-tuned with multiple tasks
together. “Combined Accuracy” and “Combined Weighted F1 score” is a combination of
goal segmentation and evaluation tasks.
Intent Classification / Named Entity Recognition
Table2. Intent Classification. Comparing our model using Dialogue context against a RoBERTa based
baseline model for the Intent classification tasks for Shopping domain. (bolded rows shows intents
with largest improvements)
Table3. Slot tagging. Comparing our model using Dialogue context against a RoBERTa + CRF
based baseline model for the Slot tagging task for Shopping domain. (bolded rows show slot
types with largest improvements)
• Gupta, S. et al. “RoBERTaIQ: An efficient framework for automatic interaction quality estimation of
dialogue systems”. KDD 2021
• Wang Z. et al. “Contextual rephrase detection for reducing friction in dialogue system”. EMNLP 2021
• Park, D. et al. “Large-scale hybrid approach for predicting user satisfaction with conversational
agents”. NeurIPS, 2020
Publications:
17
Defect Correction with Self-learning Framework
• Enable self-learning in Alexa to reduce Customer Perceived Defects and enhance its understanding in real-time,
with context, without any human annotator in the loop
Prevention
Correction
1. Detect Defects
Customer Perceived Defect (CPD) metric
Alexa, play Buddha
Buddha Spa from Ama…
Alexa, stop
2. Learn Corrections
Rephrases, follow ups, or dialogs.
Customer Perceived Defect!
3. Correct Defects
At runtime, generate alternate
utterances (aka Query Rewriting)
Alexa, play Buddha
play Boo’d Up
Success!
Alexa, play Boo’d Up
Playing Boo’d Up by …
Success!
4. Automatic Guardrails
Several guardrails to prevent
trustbusters/regressions
Automatic blocklisting
Reducing False Wake
Sensitive Utterances
Alexa, play Buddha
Detection: Daily
Real-time
Learning & Deployment: Daily Blocklisting: 2 hrs. to Near-real time
Self-Learning based Defect Reduction in Large-Scale Conversational AI Agents
Precomputed
Rewriting
Pipeline
Online
Rewriting
Pipeline
Two general ways to provide rewrites for the reformulation engine:
• Precomputed Rewriting: this pipeline produces request-rewrite as key-value pairs offline and loads the pairs during
runtime. It takes advantage of the availability of offline information (e.g. user’s own rephrase, offline metrics) and
larger latency budget.
• Online Rewriting: this pipeline leverages rewrite models (e.g. retrieval/ranking models or generation models) and
online contextual information (e.g. previous dialog turns, dialog location, times) to produces rewrite in an online
mode. It enables rewriting for long tail defect queries.
18
Query
Response
Query
Response
Query
Response
Query
Rephrase
Examples Model Architecture
User: play tyler hero explicit
Agent: Here’s hypothetical hero, by Tyler
Rothrock
User: play tyler hero explicit by jack harlow
Agent: Sorry, I can’t find that
…
[User] play tyler hero explicit [Agent] Here’s
hypothetical hero, by Tyler Rothrock [User] play tyler
hero explicit by jack harolow [Agent] Sorry, I can’t find
that …
Session
input:
Play tyler hero by jack harlow (0.9)
Play tyler hero (0.05)
Precompute Rewriting: Contextual Rephrase Detection in Conversational Agent
“Contextual Rephrase Detection for Reducing Friction in Dialogue Systems”, Wang et al., EMNLP 2021
19
Precompute Rewriting: Feedback-based Self-learning in Conversational AI agents
• Users provide feedback to
Alexa in the form of
rephrases.
• Recurring user rephrases like
(a), (b), (c) are encoded in
Absorbing Markov chains.
• By resolving the Markov
model as in (d), we surface
the rewrite that is more
likely to result in success as
in (e).
• “Feedback-based self-learning in large-scale conversational AI agents”, Ponnusamy et al., AAAI 2020
• “Self-aware feedback-based self-learning in large-scale conversational AI”, Ponnusamy et al., to appear in NAACL 2022
20
Online Rewriting: Search based Self-learning Query Rewriting System
Personalized
Indexer
Global Indexer
Personalized
Index
Global Index Global Retrieval/Ranking
Models
Personalized
Retrieval/Ranking Models
Rewrite
Merging Logic
User query
Rewrite
Customer interaction
with AI devices
Customer Purchase history
…
Customer Contact Names
Customer Routine Phrase
“Personalized Search-based Query Rewrite System for Conversational AI”, Cho et al., NLP4ConvAI 2021
User query: “how’s
the weather in
Wikeson”
Global top1 rewrite: “how’s
the weather in Wilkeson
Washington”
Personal top1 rewrite:
“how’s the weather in
Wilkerson California”
Final rewrite: “how’s the
weather in Wilkerson
California”
Example
Offline
Online
“Search based self-learning query rewrite system in conversational AI”, Fan et al., De-MaL 2021
21
• Precompute Rewriting: Deployed the model in [1] across 11 locales spanning 6 languages. Online
A/B demonstrated a significant reduction (p-value of ≤0.0001) in defects experienced with a relative
defect reduction of ranging from 22.73% to 31.22%.
• Online Rewriting: Deployed the systems in [2] in en-US. Online A/B demonstrated a significant (p-
value < 0.001) relative reduction of defect rate (13%). Launching the personalized system on top of
the global one led to an additional significant defect rate reduction of 4%.
Selected Experimental Results for Query Rewriting
[1] “Self-aware feedback-based self-learning in large-scale conversational AI”, Ponnusamy et al., to appear in NAACL 2022
[2] “Search based self-learning query rewrite system in conversational AI”, Fan et al., De-MaL 2021
Rewrite Examples
Type Request Rewrite
Global rewrite Full volume Volume ten
Global rewrite Don’t ever play that song Thumbs down this song
Global rewrite Play a. b. c. Play the alphabet song
Personalized rewrite Open angry sleepy time playlist Open avery sleepy time playlist
Personalized rewrite Pair with johnson’s iphone Pair with john’s iphone
Personalized rewrite Play drivers license Play the song drivers license by
olivia rodrigo
22
Win:Loss Ratio 8.5 : 1
Learning
Latency
24 hrs
Teachable AI
• Customers can interactively teach Alexa and instantly adapt her to their personal preferences, such
as, “I’m a Warriors fan,” or, “I like Italian restaurants,” or, “I prefer Big Sky for my weather,” by
• initiating a conversation with Alexa at any time
• Alexa proactively sensing a teachable moment (e.g. repeat usage or unsatisfactory response)
and clarifying a preference.
• initiating a guided Q&A with Alexa with a simple cue like, “Alexa, learn my preferences,” and
sharing their favorites across topics like sporting, food and weather interests.
• Personalized Experiences: The next time customers query Alexa on related topics, like their sports
update, restaurants nearby, or weather update, Alexa will bear their interests in mind to curate
personalized selections.
23
Preference Teaching
24
Failure Point Isolation: Predict which component failed
Figure. Component-level architecture of a typical conversational assistant.
Color-codes correspond to Turn 1 on next slide (fatal ASR error and non-fatal
ERR error)
Predicted Classes:
• False Wakes (FW)
• ASR errors
• NLU errors
• Entity Resolution errors (ERR)
• Result errors
• Correct (no error)
25
Failure Point Isolation: Examples
Turn 1
• ASR: Failure Point
• NLU: Correct
• ERR: Wrong but not the Failure Point
• FPI output = {ASR error}
Turn 2
• ASR: Non-fatal error (“the” missing)
• NLU: Correct
• ERR: Correct
• FPI output = {Correct}
Turn 3
• ASR: Correct
• NLU: Correct
• ERR: Correct
• FPI output = {Correct}
26
Failure Point Isolation (FPI) model vs Human Performance*
• Human F1-score is calculated for a single human against an panel of expert annotators
• FPI model outperforms humans for Result and Correct cases
• False Wake performance is the weakest at 71.2%
• Detection of ASR, ERR and NLU errors is at 90-95% of human performance
* Khaziev et al. FPI: Failure Point Isolation in Large-scale Conversational Assistants, NAACL-HLT 2022 Industry Track
27
It is still Day 1!
28

Contenu connexe

Similaire à Intelligent Conversational Agents for Ambient Computing SIGIR 2022 Ruhi Sarikaya Amazon Science.pptx

Internet of Things - Benefits for the Ummah
Internet of Things - Benefits for the UmmahInternet of Things - Benefits for the Ummah
Internet of Things - Benefits for the UmmahDr. Mazlan Abbas
 
Hac IT 4. Emerging Technologies (1).pdf
Hac IT 4. Emerging Technologies  (1).pdfHac IT 4. Emerging Technologies  (1).pdf
Hac IT 4. Emerging Technologies (1).pdfAAFREEN SHAIKH
 
Unlocking New Todays: Artificial Intelligence and Data Platforms on AWS
Unlocking New Todays: Artificial Intelligence and Data Platforms on AWSUnlocking New Todays: Artificial Intelligence and Data Platforms on AWS
Unlocking New Todays: Artificial Intelligence and Data Platforms on AWSAmazon Web Services
 
FinalPPT-StJoseph (3).pptx
FinalPPT-StJoseph (3).pptxFinalPPT-StJoseph (3).pptx
FinalPPT-StJoseph (3).pptxssuser046cf5
 
Functionalities in AI Applications and Use Cases (OECD)
Functionalities in AI Applications and Use Cases (OECD)Functionalities in AI Applications and Use Cases (OECD)
Functionalities in AI Applications and Use Cases (OECD)AnandSRao1962
 
AI for UI: How AI technology may support human-technology interaction by Roop...
AI for UI: How AI technology may support human-technology interaction by Roop...AI for UI: How AI technology may support human-technology interaction by Roop...
AI for UI: How AI technology may support human-technology interaction by Roop...Mindtrek
 
User Experience for Internet of Things
User Experience for Internet of ThingsUser Experience for Internet of Things
User Experience for Internet of ThingsCatherine Robson
 
Gartner: Top 10 Technology Trends 2015
Gartner: Top 10 Technology Trends 2015Gartner: Top 10 Technology Trends 2015
Gartner: Top 10 Technology Trends 2015Den Reymer
 
artificial intelligence artificial intelligence artificial intelligence.pptx
artificial intelligence artificial intelligence artificial intelligence.pptxartificial intelligence artificial intelligence artificial intelligence.pptx
artificial intelligence artificial intelligence artificial intelligence.pptxDarkMirrow
 
Human Computer Interaction
Human Computer InteractionHuman Computer Interaction
Human Computer InteractionJitu Choudhary
 
Choosing the right Technologies for your next unicorn.
Choosing the right Technologies for your next unicorn.Choosing the right Technologies for your next unicorn.
Choosing the right Technologies for your next unicorn.Gladson DSouza
 
What's The Role Of Machine Learning In Fast Data And Streaming Applications?
What's The Role Of Machine Learning In Fast Data And Streaming Applications?What's The Role Of Machine Learning In Fast Data And Streaming Applications?
What's The Role Of Machine Learning In Fast Data And Streaming Applications?Lightbend
 
CHAPTER 9-EMERGING TRENDS.pptx
CHAPTER 9-EMERGING TRENDS.pptxCHAPTER 9-EMERGING TRENDS.pptx
CHAPTER 9-EMERGING TRENDS.pptxanror264
 
20130503 iCore at calipso workshop fia dublin
20130503 iCore at calipso workshop fia dublin20130503 iCore at calipso workshop fia dublin
20130503 iCore at calipso workshop fia dublinRaffaele Giaffreda
 
Wearable Computing and Human Computer Interfaces
Wearable Computing and Human Computer InterfacesWearable Computing and Human Computer Interfaces
Wearable Computing and Human Computer InterfacesJeffrey Funk
 
Bringing together smart things and people to realize smarter environments sho...
Bringing together smart things and people to realize smarter environments sho...Bringing together smart things and people to realize smarter environments sho...
Bringing together smart things and people to realize smarter environments sho...Diego López-de-Ipiña González-de-Artaza
 
The Future for Smart Technology Architects
The Future for Smart Technology ArchitectsThe Future for Smart Technology Architects
The Future for Smart Technology ArchitectsPaul Preiss
 
Filip Maertens - AI, Machine Learning and Chatbots: Think AI-first
Filip Maertens - AI, Machine Learning and Chatbots: Think AI-first Filip Maertens - AI, Machine Learning and Chatbots: Think AI-first
Filip Maertens - AI, Machine Learning and Chatbots: Think AI-first Patrick Van Renterghem
 

Similaire à Intelligent Conversational Agents for Ambient Computing SIGIR 2022 Ruhi Sarikaya Amazon Science.pptx (20)

Internet of Things - Benefits for the Ummah
Internet of Things - Benefits for the UmmahInternet of Things - Benefits for the Ummah
Internet of Things - Benefits for the Ummah
 
Hac IT 4. Emerging Technologies (1).pdf
Hac IT 4. Emerging Technologies  (1).pdfHac IT 4. Emerging Technologies  (1).pdf
Hac IT 4. Emerging Technologies (1).pdf
 
Unlocking New Todays: Artificial Intelligence and Data Platforms on AWS
Unlocking New Todays: Artificial Intelligence and Data Platforms on AWSUnlocking New Todays: Artificial Intelligence and Data Platforms on AWS
Unlocking New Todays: Artificial Intelligence and Data Platforms on AWS
 
FinalPPT-StJoseph (3).pptx
FinalPPT-StJoseph (3).pptxFinalPPT-StJoseph (3).pptx
FinalPPT-StJoseph (3).pptx
 
Functionalities in AI Applications and Use Cases (OECD)
Functionalities in AI Applications and Use Cases (OECD)Functionalities in AI Applications and Use Cases (OECD)
Functionalities in AI Applications and Use Cases (OECD)
 
AI for UI: How AI technology may support human-technology interaction by Roop...
AI for UI: How AI technology may support human-technology interaction by Roop...AI for UI: How AI technology may support human-technology interaction by Roop...
AI for UI: How AI technology may support human-technology interaction by Roop...
 
I learning lot
I learning lotI learning lot
I learning lot
 
User Experience for Internet of Things
User Experience for Internet of ThingsUser Experience for Internet of Things
User Experience for Internet of Things
 
Gartner: Top 10 Technology Trends 2015
Gartner: Top 10 Technology Trends 2015Gartner: Top 10 Technology Trends 2015
Gartner: Top 10 Technology Trends 2015
 
artificial intelligence artificial intelligence artificial intelligence.pptx
artificial intelligence artificial intelligence artificial intelligence.pptxartificial intelligence artificial intelligence artificial intelligence.pptx
artificial intelligence artificial intelligence artificial intelligence.pptx
 
HCI
HCIHCI
HCI
 
Human Computer Interaction
Human Computer InteractionHuman Computer Interaction
Human Computer Interaction
 
Choosing the right Technologies for your next unicorn.
Choosing the right Technologies for your next unicorn.Choosing the right Technologies for your next unicorn.
Choosing the right Technologies for your next unicorn.
 
What's The Role Of Machine Learning In Fast Data And Streaming Applications?
What's The Role Of Machine Learning In Fast Data And Streaming Applications?What's The Role Of Machine Learning In Fast Data And Streaming Applications?
What's The Role Of Machine Learning In Fast Data And Streaming Applications?
 
CHAPTER 9-EMERGING TRENDS.pptx
CHAPTER 9-EMERGING TRENDS.pptxCHAPTER 9-EMERGING TRENDS.pptx
CHAPTER 9-EMERGING TRENDS.pptx
 
20130503 iCore at calipso workshop fia dublin
20130503 iCore at calipso workshop fia dublin20130503 iCore at calipso workshop fia dublin
20130503 iCore at calipso workshop fia dublin
 
Wearable Computing and Human Computer Interfaces
Wearable Computing and Human Computer InterfacesWearable Computing and Human Computer Interfaces
Wearable Computing and Human Computer Interfaces
 
Bringing together smart things and people to realize smarter environments sho...
Bringing together smart things and people to realize smarter environments sho...Bringing together smart things and people to realize smarter environments sho...
Bringing together smart things and people to realize smarter environments sho...
 
The Future for Smart Technology Architects
The Future for Smart Technology ArchitectsThe Future for Smart Technology Architects
The Future for Smart Technology Architects
 
Filip Maertens - AI, Machine Learning and Chatbots: Think AI-first
Filip Maertens - AI, Machine Learning and Chatbots: Think AI-first Filip Maertens - AI, Machine Learning and Chatbots: Think AI-first
Filip Maertens - AI, Machine Learning and Chatbots: Think AI-first
 

Dernier

Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRDelhi Call girls
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)Areesha Ahmad
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 

Dernier (20)

9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
9953056974 Young Call Girls In Mahavir enclave Indian Quality Escort service
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 

Intelligent Conversational Agents for Ambient Computing SIGIR 2022 Ruhi Sarikaya Amazon Science.pptx

  • 1. Intelligent Conversational Agents for Ambient Computing 1 Ruhi Sarikaya Director, Alexa AI SIGIR 2022, Madrid
  • 2. Outline • Long range view of fundamental trends and shifts in computing and User Experience • What does IoT and context mean for ambient conversational AI? • How does Conversational AI work? • Self-Learning: Implicit and explicit customer feedback based learning • Q &A 2
  • 5. Human Interaction with the Digital World Human Senses: sight, hearing, touch, smell, taste Computer ‘Senses’ • No sight & no hearing (until recently) • Form of Human Input: typing & tactile Gap • Computers (and backend services) are not yet designed for receiving voice input to operate Problem • You need to physically touch to computers • It tethers you to a screen, ‘immobilizes’ you  Friction! • The perception of our senses are created and stored in different parts of the brain
  • 6. • Current computing cycle: Mobile internet [Meeker, Morgan Stanley, 2014] • No room for growth for connecting people to internet via smartphone (after 2020) • What is next?  IoT and intelligent connected systems & services  Ambient Intelligence with Conversational AI as the UX layer 1 10 100 1000 10000 100000 1000000 1960 1970 1980 1990 2000 2010 2020 10X Computing Cycles MiniComputer 10M+ Units PC 100M+ Units IoT 100B+ Units Mainframe 1M+ Units Desktop Internet 1B+ Units/Users Mobile Internet 10B+ Units Mobile Phones Tablets eReaders MP3 Players Telematics ..... Any Device Increased integration Smaller form factor Increased power & storage Lower costs Improved UI The New Computing Cycle 6 0 200 400 600 800 1000 1200 1400 1600 1800 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 Global Computing Device Shipments (in millions) Smart Phone PC+Laptop Tablet 15.41 17.68 20.35 23.14 26.66 30.73 35.82 42.62 51.51 62.12 75.44 0 10 20 30 40 50 60 70 80 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 Connected Devices in Billions IoT Worldwide Install Base (billions)
  • 7. Internet of Things (IoT): Connected Smart Devices with Sensors • Sensors: smaller, low power and cheap • 1 trillion sensors by 2022 • Digital Nervous System: location data (GPS), eyes and ears with camera and microphone, sensors (motion, temp, light, pressure, etc.) Data Aggregation Linking Reasoning (AI) Decision Making (AI) (Real-time) Industry Consumers Phones Wearables TVs Appliances Home Automation Home Monitoring Machinery Smart Cities Transportation Healthcare Factories Automation Collective IoT Intelligence Smart Home • Over 90% of our lives spent inside of a building • Intelligent & Responsive physical environment • IoT integrates the physical world with the digital world • World around us is reasoning and talking back to us in real time
  • 8. IT LOOKS LIKE YOU LEFT THE LIGHTS ON, WOULD YOU LIKE ME TO TURN THEM OFF? LIGHTS LOCKS APPLIANCES Examples of Ambient Intelligence:Alexa Hunches and Routines
  • 9. Why does IoT matter for Conversational AI? • “Alexa, play hunger games?” • What is the user’s intent? • play_music? play_video? play_audiobook? • “Alexa, what should I do for dinner?” • What is the user’s intent? • book_restaurant? order_food? find_recipe? • Ground truth for a large combination of [person x device x context] data? How do we scale learning? • “Alexa, order me two towels?” • What is the user’s intent? • shopping? room service? • “Alexa, what is the temperature?” • What is the user’s intent? • weather forecast? temperature inside the home? temperature of the oven? • IoT is increasing the complexity (and opportunity) of the world • Requires real-time communication with a reasoning environment • Creates new forms of ‘context’ • Context: • Set of circumstances/facts that surround a particular event, situation or entity for AI systems to sense, reason and adapt better to the physical and digital world  Identity & State, Device Types, Physical/Digital Activity on Devices/Systems, Time, Device & User Location, state and changes in environment as measured by sensors,…. • Why does context matter for conversational AI? • Contextual Ambiguity: Users do not have any ambiguity when they issue a command to an intelligent assistant
  • 10. Orchestrator Skills Weather ASR NLU TTS “speak” directive intent recognition result recognize Nbest interpretations recognition result text/SSML user’s utterance Alexa’s voice Alexa’s voice How Does Conversational AI work? * Orchestration - ASR, NLU, Routing, TTS, Application Services * Intent Routing to Applications * Session Management * Dialog Management – multi-turn interactions * Abstraction of device features to applications 10 Alexa, what is the weather? Routing (intent, skill) Nbest interpretations
  • 11. Machine Learning Types (in terms of types of supervision/feedback) • Supervised learning is the task of learning a prediction function that maps an input to an output based on example input- output pairs: y = f(x) (e.g. DNN, Logistic Regression, SVM) Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by minimizing the prediction error on the training set Testing: apply f to a never before seen test example x and output the predicted value y = f(x) • Unsupervised learning looks for patterns in input data, which does not have any pre-existing labels. It allows for modeling of probability densities over inputs to deduce structures (e.g. K-means, PCA, LDA). • Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data during training. Different variants; self-training, co-training, generative methods, graph based methods, etc. • Self-Supervised Learning, predicting one part of the input from what it knows about another, without any human supervision (e.g. BERT, ROBERTA, GPT3). • Reinforcement learning (RL) is concerned with how agents should take actions in an environment in order to maximize the notion of cumulative reward. 11  Typically achieves very accurate predictions with sufficient data!
  • 12. Self Learning for Conversational AI What do we mean by Self-learning? • Framework that enables learning autonomously from user-system interactions (e.g. barge-in, reformulations), system signals, and predictive models • It can be considered as a layer that combines supervised learning, semi-supervised learning and RL • Zero component specific manual annotation to train and improve the machine learning models • Leverage customer’s implicit and explicit feedback and system signals to train and improve ML models in Conversational AI stack both offline and in runtime Why self-learning? • Speed: Rapid scenario building and deployment • Cost: Minimizing manual annotation cost • Ambiguity: Customers (vs annotators) know what they mean and want best • Privacy: Does not require human access to customer data 12
  • 13. Customer Feedback BasedAutomatedGroundTruth Generation • A multi-year initiative to shiftAlexa ML model development from manual-annotation based to primarily self-learning based approach by leveraging various feedbacks • Explicit feedback (e.g. “Alexa: Did I answer you question? User: Yes”) • Implicit feedback (e.g. User barge-in a turn or rephrase her request) • Unsolicited feedback (e.g. User say “Alexa, thank you!” or “Alexa, I am not Derek, I am Dan”) • Mission: automatically generate labels for 100% of Alexa utterances and for all annotation workflows in near-real time by leveraging customers interactions and their feedbacks • Goals: provide automatically accumulated signals and data to • Protect user privacy (by removing human reviewers from the loop) • Improve model accuracy (by providing more personalized labels) • Reduce annotation cost 13
  • 14. Prod model outputs (ASR 1-best, NLU 1- best, etc) Alexa Models (ASR, NLU, etc.) Confidence Prediction Confidence Level? Alternative Hypotheses Generation Implicit Exploration (i.e. directly replacing 1-best w/ alternative hypotheses.) Explicit Exploration (i.e. present multiple hypotheses to customer (e.g. voice confirmation, on-screen choices.) Alexa Runtime System Feedback Collection & Understanding NLU (DC, IC, NER) Multi-task Label Generation Models Feedback Based Annotation Data Train New Models Other data (unlabeled, existing annotation, etc) New Modules Other Alexa Modules Exploration Module Feedback Collection and Label Generation Module Legends Customer Feedback Based Ground Truth Generation Overview Unsolicited Feedback (e.g. “Alexa, thank you”) Exploration Decider High Low implicit explicit ASR (Error prediction, etc) Dialog Success Estimation Implicit Feedback (e.g. barge-in, stop, rephrase) Explicit Feedback (e.g. “did I answer you question?”, ”yes”) 14
  • 15. 15 Model Architecture for Customer Feedback Based Ground Truth Generation Multi-task Label Generation Model Features - Dialogue context (user utterance, Alexa response, previous turns, next turns, etc.) - System metadata (domain, intent, dialog status, confidence scores etc.) Model - Turn encoder + dialogue level transformer - Turn level textual encoders is RoBERTa Multi-task learning heads - Explicit user feedback (e.g. user say “thank you”); - Inferred user feedback (e.g. user play music for 30 seconds after voice command) - Manual annotation Self-supervised Pretraining - Synthetic contrastive data (i.e. randomly swap answers from a different dialog as defect sample). Model Details Turn 1 Turn 2 Turn n E2E Defect Annotation Transcription NLU Annotation Dialog Goal Annotation Target Turn Categorical Features: Domain Intent Dialogue Status ………. Real Valued / Binary Features Textual Encoder Request Response ……… MLP RoBERTa Speaker ID Speaker ID Data Transformer (Dialogue Level) Concat Layer (Turn Level) MODEL DATA TASKS E2E Defect Estimation Intent Classification ASR Recognition Named Entity Recognition Goal Evaluation Goal Segmentation
  • 16. 16 Automated Ground Truth Generation Results Goal Segmentation/Evaluation Table1. Goal segmentation and evaluation tasks. We compare model prediction accuracy against human (single-pass) annotation accuracy (note here that we use 3-pass Gold annotation as Ground-truth). ”Single turn” means dialogues with only 1 turn, “Multi turn” means dialogues with multiple turns. ”Single-Task” denotes models separately fine-tuned on one task at a time, whereas “Multi-task” denotes models fine-tuned with multiple tasks together. “Combined Accuracy” and “Combined Weighted F1 score” is a combination of goal segmentation and evaluation tasks. Intent Classification / Named Entity Recognition Table2. Intent Classification. Comparing our model using Dialogue context against a RoBERTa based baseline model for the Intent classification tasks for Shopping domain. (bolded rows shows intents with largest improvements) Table3. Slot tagging. Comparing our model using Dialogue context against a RoBERTa + CRF based baseline model for the Slot tagging task for Shopping domain. (bolded rows show slot types with largest improvements) • Gupta, S. et al. “RoBERTaIQ: An efficient framework for automatic interaction quality estimation of dialogue systems”. KDD 2021 • Wang Z. et al. “Contextual rephrase detection for reducing friction in dialogue system”. EMNLP 2021 • Park, D. et al. “Large-scale hybrid approach for predicting user satisfaction with conversational agents”. NeurIPS, 2020 Publications:
  • 17. 17 Defect Correction with Self-learning Framework • Enable self-learning in Alexa to reduce Customer Perceived Defects and enhance its understanding in real-time, with context, without any human annotator in the loop Prevention Correction 1. Detect Defects Customer Perceived Defect (CPD) metric Alexa, play Buddha Buddha Spa from Ama… Alexa, stop 2. Learn Corrections Rephrases, follow ups, or dialogs. Customer Perceived Defect! 3. Correct Defects At runtime, generate alternate utterances (aka Query Rewriting) Alexa, play Buddha play Boo’d Up Success! Alexa, play Boo’d Up Playing Boo’d Up by … Success! 4. Automatic Guardrails Several guardrails to prevent trustbusters/regressions Automatic blocklisting Reducing False Wake Sensitive Utterances Alexa, play Buddha Detection: Daily Real-time Learning & Deployment: Daily Blocklisting: 2 hrs. to Near-real time
  • 18. Self-Learning based Defect Reduction in Large-Scale Conversational AI Agents Precomputed Rewriting Pipeline Online Rewriting Pipeline Two general ways to provide rewrites for the reformulation engine: • Precomputed Rewriting: this pipeline produces request-rewrite as key-value pairs offline and loads the pairs during runtime. It takes advantage of the availability of offline information (e.g. user’s own rephrase, offline metrics) and larger latency budget. • Online Rewriting: this pipeline leverages rewrite models (e.g. retrieval/ranking models or generation models) and online contextual information (e.g. previous dialog turns, dialog location, times) to produces rewrite in an online mode. It enables rewriting for long tail defect queries. 18
  • 19. Query Response Query Response Query Response Query Rephrase Examples Model Architecture User: play tyler hero explicit Agent: Here’s hypothetical hero, by Tyler Rothrock User: play tyler hero explicit by jack harlow Agent: Sorry, I can’t find that … [User] play tyler hero explicit [Agent] Here’s hypothetical hero, by Tyler Rothrock [User] play tyler hero explicit by jack harolow [Agent] Sorry, I can’t find that … Session input: Play tyler hero by jack harlow (0.9) Play tyler hero (0.05) Precompute Rewriting: Contextual Rephrase Detection in Conversational Agent “Contextual Rephrase Detection for Reducing Friction in Dialogue Systems”, Wang et al., EMNLP 2021 19
  • 20. Precompute Rewriting: Feedback-based Self-learning in Conversational AI agents • Users provide feedback to Alexa in the form of rephrases. • Recurring user rephrases like (a), (b), (c) are encoded in Absorbing Markov chains. • By resolving the Markov model as in (d), we surface the rewrite that is more likely to result in success as in (e). • “Feedback-based self-learning in large-scale conversational AI agents”, Ponnusamy et al., AAAI 2020 • “Self-aware feedback-based self-learning in large-scale conversational AI”, Ponnusamy et al., to appear in NAACL 2022 20
  • 21. Online Rewriting: Search based Self-learning Query Rewriting System Personalized Indexer Global Indexer Personalized Index Global Index Global Retrieval/Ranking Models Personalized Retrieval/Ranking Models Rewrite Merging Logic User query Rewrite Customer interaction with AI devices Customer Purchase history … Customer Contact Names Customer Routine Phrase “Personalized Search-based Query Rewrite System for Conversational AI”, Cho et al., NLP4ConvAI 2021 User query: “how’s the weather in Wikeson” Global top1 rewrite: “how’s the weather in Wilkeson Washington” Personal top1 rewrite: “how’s the weather in Wilkerson California” Final rewrite: “how’s the weather in Wilkerson California” Example Offline Online “Search based self-learning query rewrite system in conversational AI”, Fan et al., De-MaL 2021 21
  • 22. • Precompute Rewriting: Deployed the model in [1] across 11 locales spanning 6 languages. Online A/B demonstrated a significant reduction (p-value of ≤0.0001) in defects experienced with a relative defect reduction of ranging from 22.73% to 31.22%. • Online Rewriting: Deployed the systems in [2] in en-US. Online A/B demonstrated a significant (p- value < 0.001) relative reduction of defect rate (13%). Launching the personalized system on top of the global one led to an additional significant defect rate reduction of 4%. Selected Experimental Results for Query Rewriting [1] “Self-aware feedback-based self-learning in large-scale conversational AI”, Ponnusamy et al., to appear in NAACL 2022 [2] “Search based self-learning query rewrite system in conversational AI”, Fan et al., De-MaL 2021 Rewrite Examples Type Request Rewrite Global rewrite Full volume Volume ten Global rewrite Don’t ever play that song Thumbs down this song Global rewrite Play a. b. c. Play the alphabet song Personalized rewrite Open angry sleepy time playlist Open avery sleepy time playlist Personalized rewrite Pair with johnson’s iphone Pair with john’s iphone Personalized rewrite Play drivers license Play the song drivers license by olivia rodrigo 22 Win:Loss Ratio 8.5 : 1 Learning Latency 24 hrs
  • 23. Teachable AI • Customers can interactively teach Alexa and instantly adapt her to their personal preferences, such as, “I’m a Warriors fan,” or, “I like Italian restaurants,” or, “I prefer Big Sky for my weather,” by • initiating a conversation with Alexa at any time • Alexa proactively sensing a teachable moment (e.g. repeat usage or unsatisfactory response) and clarifying a preference. • initiating a guided Q&A with Alexa with a simple cue like, “Alexa, learn my preferences,” and sharing their favorites across topics like sporting, food and weather interests. • Personalized Experiences: The next time customers query Alexa on related topics, like their sports update, restaurants nearby, or weather update, Alexa will bear their interests in mind to curate personalized selections. 23
  • 25. Failure Point Isolation: Predict which component failed Figure. Component-level architecture of a typical conversational assistant. Color-codes correspond to Turn 1 on next slide (fatal ASR error and non-fatal ERR error) Predicted Classes: • False Wakes (FW) • ASR errors • NLU errors • Entity Resolution errors (ERR) • Result errors • Correct (no error) 25
  • 26. Failure Point Isolation: Examples Turn 1 • ASR: Failure Point • NLU: Correct • ERR: Wrong but not the Failure Point • FPI output = {ASR error} Turn 2 • ASR: Non-fatal error (“the” missing) • NLU: Correct • ERR: Correct • FPI output = {Correct} Turn 3 • ASR: Correct • NLU: Correct • ERR: Correct • FPI output = {Correct} 26
  • 27. Failure Point Isolation (FPI) model vs Human Performance* • Human F1-score is calculated for a single human against an panel of expert annotators • FPI model outperforms humans for Result and Correct cases • False Wake performance is the weakest at 71.2% • Detection of ASR, ERR and NLU errors is at 90-95% of human performance * Khaziev et al. FPI: Failure Point Isolation in Large-scale Conversational Assistants, NAACL-HLT 2022 Industry Track 27
  • 28. It is still Day 1! 28

Notes de l'éditeur

  1. Ack Gabriella, org committee, SIGIR. Thank you for joining the talk. The summary of the talk is captured in the title itself. We want to enable natural contextual interactions for ambient computing, to do that we need scalable self-learning methods that can handle ambiguity and context for accurately understanding the user’s request and provide the best possible answer. The backdrop for the talk is conversational systems like Alexa. Let us get right to it.
  2. We celebrated Alexa’s 7th birthday on last Saturday, on Nov 6th. Here you see some of the Alexa Echo family of devices starting launched over the years starting with the first generation. Note that of these devices has specific name, Alexa is agent or AI behind it and it lives in the cloud.
  3. Alexa AI and devices are not built just for your homes, as Amazon we partnered with other companies to bring Alexa to senior living communities, hospitality business as in Marriott hotels, hospitals , here you see Boston Children’s Hospital. Timothy Driscoll, Director of Technology Strategy at Boston Children’s Hospital, says, “Boston Children's Hospital is using Amazon Echo devices to provide an array of features to patients including entertainment in the form of music and games, hospital and unit-specific frequently-asked questions, and control of the in-room televisions. Our patients will soon be able to express their needs for things like pain management support - "Alexa, tell my nurse I'm in pain," or general comfort - "Alexa, tell my nurse I need a pillow."  Alexa experiences are augmented with end-point specific experiences as call nurses, hospitals related frequently asked questions as in hospitals. Two weeks ago also launched Alexa in Disney resorts,. Alexa is also integrated into infotainment systems of most of the car manufacturers, There is not a single week that goes by that we do not announce something about Alexa or Alexa integration with partners. Just last week also announced Alexa on Range rovers. What is really happening?
  4. Let us talk about the basics first. How do we sense the world around us? We (as humans) have our senses and we primarily rely on our vision and hearing. These are our primary means of getting input from the outside world. Let us look at computers and how they sense the world? How do they get input from the outside world and particularly from humans? Let’s look at the slightly longer horizon and go to 1946, this is ENIAC the mother of modern day computers. This picture is taken at Upenn. The Apple iPhone is different -- many of the elements of its multi-touch user interface require you to touch multiple points on the screen simultaneously (capacitive screens vs resistive screens) We have been manually pushing our wills into our tools, literally by using our hands. ENIAC programmers Frances Bilas (later Frances Spence) and Betty Jean Jennings (later Jean Bartik) stand at its main control panels.  With ENIAC's 40 panels still under construction, and its 18,000 vacuum tube technology uncertain, the engineers had no time for programming manuals or classes. Bartik and the other women taught themselves ENIAC's operation from its logical and electrical block diagrams, and then figured out how to program it. They created their own flow charts, programming sheets, wrote the program and placed it on the ENIAC using a challenging physical interface, which had hundreds of wires and 3,000 switches. Eniac was impressive: 80 feet long and 8 feet high, weighing 30 tons, with 18,000 vacuum tubes, 70,000 resistors, 10,000 capacitors and 1,500 relays. It had a memory capacity of 20 words and was programmed by setting 6,000 dials and switches, a task that took a crew of workers many hours. The Apple-1 computer, built by hand in 1976 by Steve Wozniak in Apple co-founder Steve Jobs' garage or his sister's bedroom, fetched nearly twice its pre-sale high estimate, Bonhams said. - Nowadays a typical cell-phone has over 40 sensors: accelerometers, gyroscope, thermometers, wifi signals, Bluetooth, RF signal gathering etc.
  5. Low power and cheap sensors are integrated into devices, physical systems and buildings in different industry verticals, from transportation to health care to factories. Likewise on consumer side, more and more sensors are integrated into many devices and appliances we use everyday. It is estimated that there will be about 1 trillion sensors in the world, that is about 120 sensors per person in the world. These sensors measure changes in the environment or in user’s state from temperature, to pressure, touch, camera, light etc. to user’s heart rate to physical movements. More specifically, with GPS data coupled with camera and microphone, coupled with internet connectivity, we are building a digital nervous system for the environment. This paves the way for an intelligent and responsive environment
  6. The fundamental challenge of Verity is that we need to convert customer feedback, which is oftentimes binary or low dimensional (‘yes/no’, ‘defect/non-defect’), into high dimensional predicted labels such as transcription, NER annotation. The key strategy to do so is through the “hypotheses” → “exploration” → “confirmation” process. i.e. Verity system generates or selects potential alternative hypotheses for the labels we want customer’s feedback on, injects those hypotheses into Alexa runtime either through implicit exploration (e.g. query rewrite, or nbest ranking) or through explicit exploration (e.g. ask customer “do you mean frozen the movie, right?”), then collects customer feedbacks for these hypotheses to generate high confidence synthetic labels Verity are composed of two major components: 1) Exploration Module and 2) Feedback Collection and Label Generation Module. The responsibility of the exploration module includes: 1) Confidence prediction. Predict confidence of the current turn in order to decide whether or not to trigger exploration; 2) Hypothesis generation. Generate alternative hypotheses for the labels that Verity want to collect customer feedback on; 3) Exploration. Inject the hypotheses into Alexa runtime to explore, either through: a) implicit exploration i.e. directly replacing existing 1-best hypothesis with alternative hypotheses through systems, or b) explicit exploration i.e. presenting multiple hypotheses for customers to choose from through mechanisms such as voice confirmation and on-screen dynamic feedback (e.g. presenting a list of candidate transcriptions and ask “which of the following transcriptions matches what you said? ”) The responsibility of the feedback collection and label generation module is to collect/interpret customer feedback, and convert those information into high quality synthetic labels that could be used for training respective models. The feedback collection & understanding module collects user feedbacks such as implicit feedbacks (barge-in, paraphrases), explicit feedbacks (e.g. “did I answer your question?”, “yes”). and unsolicited feedback (e.g. positive feedback: ”alexa, thank you”, negative feedbacks “alexa, that is not what I said”). The Multi-task label generation model is a multi-task deep-learning based model that labels data leveraging contexts derived from customer feedbacks.
  7. Verity Label Generation Model architecture: Turn level encoder + Dialogue/session level transformer Turn level textual encoder: RoBERTa MLP for categorical/numerical features Input features include: Textual features (user utterance, alexa response, previous turns, next turns) Categorical features (domain, intent, dialog status) Numerical features (num of tokens, asr/nlu confidence scores) Raw audio data Training Multi-task learning heads. Using explicit/implicit user feedback Manual annotation Self-supervision: Synthetic contrastive learning
  8. Left panel: Results for Goal Segmentation and Evaluation. We use single-pass human annotation as baseline, and compares two variants of our models: 1) fine-tuned with single task separately; 2) fine-tuned in a Multi-task setting. Note that we use three-pass “Gold” annotation as ground-truth. DA profile: professionally trained human annotators who has passed annotation quality bar; Results are broken down to two subsets: 1) “Single turn” whereas each goal contains only one turn, 2) “Multi-turn” whereas each goal contains a dialogue of multiple turns. We show three metrics respectively: 1) Segmentation accuracy, 2) Combined (segmentation and evaluation) accuracy; 3) Combined F-1 score (weighted average of F-1 scores for three goal classes “success”, “failure”, “not actionable”). “Data support” provide the size of test dataset. Key takeaways: Human is more accurate on goal segmentation task but models outperforms human on goal evaluation task. Combining goal segmentation and evaluation, model is slightly better than human in terms of Accuracy but slightly worse than human on weighted F1 score (due to class imbalance, human has higher precision than model, which contributes to human’s higher F-1 but lower accuracy). Multi-task generally outperforms single-task. This is evaluated on full dataset, model outperforms human on high confidence subset (50% highest confidence data). Right panel shows results for Intent Classification and Slot tagging tasks. Here we uses a RoBERTa model without dialogue context as baseline, and compare our model (RoBERTa with dialogue context) against the baseline. For Slot tagging task, we use RoBERTa + CRF as baseline Key takeaways: Context based model outperforms baseline on all intents and slot types. Context based model brought significant improvements on specific intents and slot types, e.g. 12% relative improvement for “CheckOrderStatusIntent” and 7% relative improvement for “ShoppingListType” Data is only single-pass so not compared to human
  9. This slide describes how the rewrite system (FLARE) fits into the spoken dialog system and coordinates with other parts to achieve defect reduction. We further introduced two pipelines to deliver rewrites within the “Reformulation Engine” (FLARE): (1) Precompute (offline modeling) (2) Online. The text description covers the concept of each pipeline, and their cons and pros: Precompute: Pros: (1) no latency constrains and complex model can be utilized; (2) rich offline signals that is not available during online mode, e.g. user follow up turns, user rephrase, Alexa response, metrics such as CPDR, Music30sPrecision, Video-Click-Through-Rate Cons: (1) can’t utilize online contextual information, e.g. previous turns, screen-text, etc. (2) can not correct long-tail queries Online: Pros: (1) better utilization of online contextual signal to improve the rewrite quality, e.g. previous turns, screen-text, ASR-nBest, etc; (2) capture rewrite opportunities for long tail queries; Cons: (1) latency constrains; (2) absent of offline information, e.g. NLU hypothesis, CPDR metrics, Alexa’s response to user query, etc. Guardrails and metrics (CPDR) is not covered in this slide Existing publications from Alexa that can be cited (which is already in production) Precompute: “Self-aware feedback-based self-learning in large-scale conversational AI”, Ponnusamy et al., to appear in NAACL 2022 “Contextual Rephrase Detection for Reducing Friction in Dialogue Systems”, Wang et al., EMNLP 2021 “Feedback-based self-learning in large-scale conversational AI agents”, Ponnusamy et al., IAAI 2020 Runtime “Personalized Search-based Query Rewrite System for Conversational AI”, Cho et al., NLP4ConvAI 2021 “Search based self-learning query rewrite system in conversational AI”, Fan et al., De-MaL 2021
  10. The slides provides a neural-based approach to extract rephrase from session data. The left side provides an example session where rephrase can be extracted. The right side provides the model architecture about how it will encode the session information and output a span detection that indicates the rephrase. Example description (left side): The first turn (“play tyler hero explicit”) and the second turn (“play tyler hero explicit by jack harlow”) is defect (marked as ‘RED’ ) The first turn is defect because user is rephrasing; the second turn is defect because Alexa responds “Sorry, I can’t find that” The third turn (‘play tyler hero by jack harlow’) is non-defect (marked as ‘Green’). And the first turn and the third turn construct a rephrase pairs that can be used for rewriting purpose The fourth turn (“play record year by eric church”) is non-defect (marked as “Green”). We added this turn to illustrate that the session contains rephrase and non-rephrase (e.g. user swtich topics), and the model needs to learn to differentiate it Model Architecture description (right side) Description of the task: We consider a dataset D of M multi-turn dialogue sessions, such that D = {Si}, i=1..M, and every session S is an ordered set of N turns: S = {(Qi,Ri)}, i=1..N. Here i indicates the index of turn, and each turn i consists of a pair (Qi,Ri), where Qi is the user’s query and Ri is the agent’s response to query Qi. Any two successive turns have a time gap of less than a minute. Given a dialogue session S and a source turn, i.e., input pair of query and response (Qi,Ri), the goal of our model is to predict whether Qi is rephrased in any of the following turns (Qj,Rj)| i < j ≤ N. If so, the model should predict the span of Qj and return null otherwise. Input: input is truncated to a maximum of 512 tokens and a maximum of 10 turns Time-bin is used to encode time intervals between turns. Consider a source turn (including a request and a response) t_src = (Q_src,R_src), for which we want to detect a rephrase in the session. We refer to its timestamp as ω_src. We calculate the time difference ∆i = ωi − ω_src, where ωi is the timestamp of a turn ti, for all the turns in the session. ∆i ∀i ∈ [1,n] are then mapped to their respective time-bin tokens. These time-bin tokens represent equal sized intervals in ∆’s range of [-60, 60] seconds. The model we used are listed as BERT, but it can also be RoBERTa Output: We cast rephrase detection as a span prediction problem where we predict the probability of start and end span locations on each token’s position, using the embedding output of the f inal BERT layer. We introduce a start vector W_S and an end vector W_E (both W_s and W_E will be trainable parameters) Assuming the final hidden vector for the ith input token as Ti, the score of a candidate span from position i to position j is defined as: s_ij = WS · Ti + WE ·Tj, where i < j. We use snone = W_S ·T_CLS +W_E · T_CLS to represent the score of no-rephrase span. We set threshold τ to decide whether to predict no-rephrase or not. If maxj>i s_ij > s_none + τ, then we regard the maximum score span as the rephrase span and null otherwise. We use some heuristic post filtering to generate the final rewrite pairs launched in production
  11. Markov based query rewriting model that learns from recurring customer rephrase patterns.
  12. This slides describe about how the DFS system work (using retrieval/rank models to generate rewrite instead of relying on customer’s own rephrasing). The ‘routine’ work was reflected in the “Customer routine phrase” injection, which will be used to build personalized index. Motivation of the work: Enable long tail traffic rewrite We generate rewrite based on customer habitual usage patterns with the agent. The global layer is added to (1) avoid over index to personalized cases (e.g. a user who has strong affinity of ‘we don’t talk any more’ might have interest to explore a recent new song ‘we don’t about Bruno any more’; (2) to correct rewrite that has not shown in the user’s interaction history but are popular queries among global users The example on the top right showed that: (1) user gave a query that is ambiguous and as ASR errors, the system find two possible rewrite from both global and personalized layer, and chose the personalized layer in the end The Retrieval model uses a Dense Passage Retrieval (DPR) models that extract embedding for index and for query respectively and use some similarity measurement to decide the rewrite score The Ranking model combines fuzzy match (e.g. through a single encoder structure) with various meta data (e.g. impression, CPDR, etc.) to make a reranking decision
  13. Key Takeaways: Customers can instantly personalize Alexa in the moment by teaching her (as opposed to waiting for Alexa’s ML models becoming smarter over time)
  14. Modern dialog assistants are complex systems that process user requests in multiple stages(see Figure 1). First, a voice trigger (or wake word) model determines whether the user is speaking to the assistant. Following the trigger component, an ASR module converts user audio stream into text. This text is sent to NLU component which determines the user request. ER system recognizes and resolves entities, and the system generates the best possible response (Result stage) using several sub-systems that are specific to each dialog assistant. Finally, the response is rendered into a human-like speech using a Text-to-Speech (TTS) system. In order to keep improving such assistants it is important to identify defects at scale. The manual analysis to identify defects and root causes of defect is infeasible large traffic volumes. In our system are are not only detecting system defects, but we are identifying which component of the dialog assistant is responsible for this defect. It is important to note an error in an upstream component (e.g. ASR) can propagate through the system to the final response to the user. In such cases it is likely that multiple components fail. Thus, we focus on the first component that fails in a way that is irrecoverable, which we call the “failure point”. In this work we recognize five error points as well as a “correct” class (meaning no component failed). The possible Failure points are: FW (errors in voice trigger), ASR Error (errors in use speech transcription), NLU Errors (IC+DC errors, for example incorrectly routing play Harry Potter to Video or music), ERR errors (Entity Resolution and Recognition), Result Errors (incorrect result, for example playing wrong Harry Potter movie).
  15. To be better illustrate Failure Point Problem, let's examine this multi-turn dialog. In the first turn, the user is traying to open a garage door, however the conversational assistant didn't recognize user's speech correctly and it thought that user wanted to open a "garbage door". Entity Resolution system didn't recover from this error and also failed. Finally, the dialog assistant fails to perform the correct action. In this turn, ASR is the failure point. We don't mark Entity resolution as the failure point even though it also failed, because the ERR might have worked in case of correct ASR. On the second turn user repeats the request. ASR makes a small error by not recognizing article "the" in the speech, Dialog Assistant Makes a correct action, hence we would mark this turn as correct, as the ASR Error didn't lead to the system failure. The last turn highlights one of the limitation of our method. The user is asking the dialog assistant to make a sandwich, which is an action dialog assistants can not do today. All systems have worked correct, however the user is not satisfied. In our work, we do not consider such turns as defective, however the sentiment-based approaches would mark this as unsatisfactory. By the way, in theory, the system expectation might have been to invoke a “pleasantry/joke style response”, in which case this would have indeed been a defect even from system perspective.
  16. Our best Failure Point Isolation model achieves close to human performance on average across different categories (>92% vs human). This model uses extended dialog context, features derived from logs of the assistants (e.g., ASR confidence), and traces of decision making components (e.g., NLU intent). The weakest performance is in FalseWake (FW) class at 70%. We outperform humans in Result and Correct class detection. ASR, ER, NLU is in the 90-95% range.