Intelligent Conversational Agents for Ambient Computing SIGIR 2022 Ruhi Sarikaya Amazon Science.pptx

Intelligent Conversational Agents for Ambient Computing
1
Ruhi Sarikaya
Director, Alexa AI
SIGIR 2022, Madrid

Outline
• Long range view of fundamental trends and shifts in computing and User
Experience
• What does IoT and context mean for ambient conversational AI?
• How does Conversational AI work?
• Self-Learning: Implicit and explicit customer feedback based learning
• Q &A
2

Human Interaction with the Digital World
Human Senses: sight, hearing, touch, smell, taste Computer ‘Senses’
• No sight & no hearing (until recently)
• Form of Human Input: typing & tactile
Gap
• Computers (and backend services) are not yet
designed for receiving voice input to operate
Problem
• You need to physically touch to computers
• It tethers you to a screen, ‘immobilizes’ you
 Friction!
• The perception of our senses are created and
stored in different parts of the brain

• Current computing cycle: Mobile internet [Meeker, Morgan Stanley, 2014]
• No room for growth for connecting people to internet via smartphone (after 2020)
• What is next?
 IoT and intelligent connected systems & services  Ambient Intelligence with Conversational AI as the UX layer
1
10
100
1000
10000
100000
1000000
1960 1970 1980 1990 2000 2010 2020
10X Computing Cycles
MiniComputer
10M+ Units
PC
100M+ Units
IoT
100B+ Units
Mainframe
1M+ Units
Desktop Internet
1B+ Units/Users
Mobile Internet
10B+ Units
Mobile
Phones
Tablets
eReaders
MP3 Players
Telematics
.....
Any Device
Increased integration
Smaller form factor
Increased power & storage
Lower costs
Improved UI
The New Computing Cycle
6
0
200
400
600
800
1000
1200
1400
1600
1800
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Global Computing Device Shipments (in millions)
Smart Phone PC+Laptop Tablet
15.41
17.68
20.35
23.14
26.66
30.73
35.82
42.62
51.51
62.12
75.44
0
10
20
30
40
50
60
70
80
2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Connected
Devices
in
Billions
IoT Worldwide Install Base (billions)

Internet of Things (IoT): Connected Smart Devices with Sensors
• Sensors: smaller, low power and cheap
• 1 trillion sensors by 2022
• Digital Nervous System: location data (GPS),
eyes and ears with camera and microphone,
sensors (motion, temp, light, pressure, etc.)
Data Aggregation
Linking
Reasoning (AI)
Decision Making (AI)
(Real-time)
Industry Consumers
Phones
Wearables
TVs
Appliances
Home Automation
Home Monitoring
Machinery
Smart Cities
Transportation
Healthcare
Factories
Automation
Collective IoT
Intelligence
Smart Home
• Over 90% of our lives spent inside of a
building
• Intelligent & Responsive physical
environment
• IoT integrates the physical world with
the digital world
• World around us is reasoning and talking
back to us in real time

IT LOOKS LIKE
YOU LEFT THE
LIGHTS ON,
WOULD YOU LIKE
ME TO TURN
THEM OFF?
LIGHTS LOCKS APPLIANCES
Examples of Ambient Intelligence:Alexa Hunches and Routines

Why does IoT matter for Conversational AI?
• “Alexa, play hunger games?”
• What is the user’s intent?
• play_music? play_video? play_audiobook?
• “Alexa, what should I do for dinner?”
• book_restaurant? order_food? find_recipe?
• Ground truth for a large combination of [person x device x context] data? How do we scale learning?
• “Alexa, order me two towels?”
• shopping? room service?
• “Alexa, what is the temperature?”
• weather forecast? temperature inside the home?
temperature of the oven?
• IoT is increasing the complexity (and opportunity) of the world
• Requires real-time communication with a reasoning environment
• Creates new forms of ‘context’
• Context:
• Set of circumstances/facts that surround a particular event, situation or entity for AI systems to sense, reason
and adapt better to the physical and digital world
 Identity & State, Device Types, Physical/Digital Activity on Devices/Systems, Time, Device & User Location,
state and changes in environment as measured by sensors,….
• Why does context matter for conversational AI?
• Contextual Ambiguity: Users do not have any ambiguity when they issue a command to an intelligent assistant

Orchestrator
Skills
Weather
ASR
NLU
TTS
“speak” directive
intent
recognition result
recognize
Nbest interpretations
recognition result
text/SSML
user’s utterance
Alexa’s voice
Alexa’s voice
How Does Conversational AI work?
* Orchestration -
ASR, NLU, Routing,
TTS, Application
Services
* Intent Routing to
Applications
* Session
Management
* Dialog
Management –
multi-turn
interactions
* Abstraction of
device features to
applications
10
Alexa, what is the weather?
Routing
(intent, skill)
Nbest interpretations

Machine Learning Types
(in terms of types of supervision/feedback)
• Supervised learning is the task of learning a prediction function that maps an input to an output based on example input-
output pairs: y = f(x) (e.g. DNN, Logistic Regression, SVM)
Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by
minimizing the prediction error on the training set
Testing: apply f to a never before seen test example x and output the predicted value y = f(x)
• Unsupervised learning looks for patterns in input data, which does not have any pre-existing labels. It allows for modeling
of probability densities over inputs to deduce structures (e.g. K-means, PCA, LDA).
• Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data during
training. Different variants; self-training, co-training, generative methods, graph based methods, etc.
• Self-Supervised Learning, predicting one part of the input from what it knows about another, without any human
supervision (e.g. BERT, ROBERTA, GPT3).
• Reinforcement learning (RL) is concerned with how agents should take actions in an environment in order to maximize
the notion of cumulative reward. 11
 Typically achieves very accurate predictions with sufficient data!

Self Learning for Conversational AI
What do we mean by Self-learning?
• Framework that enables learning autonomously from user-system interactions (e.g.
barge-in, reformulations), system signals, and predictive models
• It can be considered as a layer that combines supervised learning, semi-supervised learning and RL
• Zero component specific manual annotation to train and improve the machine learning models
• Leverage customer’s implicit and explicit feedback and system signals to train and improve ML
models in Conversational AI stack both offline and in runtime
Why self-learning?
• Speed: Rapid scenario building and deployment
• Cost: Minimizing manual annotation cost
• Ambiguity: Customers (vs annotators) know what they mean and want best
• Privacy: Does not require human access to customer data
12

Customer Feedback BasedAutomatedGroundTruth Generation
• A multi-year initiative to shiftAlexa ML model development from manual-annotation based to primarily
self-learning based approach by leveraging various feedbacks
• Explicit feedback (e.g. “Alexa: Did I answer you question? User: Yes”)
• Implicit feedback (e.g. User barge-in a turn or rephrase her request)
• Unsolicited feedback (e.g. User say “Alexa, thank you!” or “Alexa, I am not Derek, I am Dan”)
• Mission: automatically generate labels for 100% of Alexa utterances and for all annotation workflows in
near-real time by leveraging customers interactions and their feedbacks
• Goals: provide automatically accumulated signals and data to
• Protect user privacy (by removing human reviewers from the loop)
• Improve model accuracy (by providing more personalized labels)
• Reduce annotation cost
13

Prod model outputs
(ASR 1-best, NLU 1-
best, etc)
Alexa Models (ASR,
NLU, etc.)
Confidence
Prediction
Confidence
Level?
Alternative
Hypotheses
Generation
Implicit Exploration
(i.e. directly replacing 1-best w/
alternative hypotheses.)
Explicit Exploration
(i.e. present multiple hypotheses to
customer (e.g. voice confirmation,
on-screen choices.)
Alexa Runtime System
Feedback Collection & Understanding
NLU (DC, IC,
NER)
Multi-task Label Generation Models
Feedback Based Annotation
Data
Train New Models
Other data (unlabeled,
existing annotation, etc)
New Modules
Other Alexa
Modules
Exploration Module
Feedback Collection and Label Generation Module
Legends
Customer Feedback Based Ground Truth Generation Overview
Unsolicited
Feedback
(e.g. “Alexa,
thank you”)
Exploration
Decider
High
Low implicit
explicit
ASR (Error
prediction,
etc)
Dialog
Success
Estimation
Implicit
Feedback
(e.g. barge-in,
stop,
rephrase)
Explicit Feedback
(e.g. “did I
answer you
question?”,
”yes”)
14

15
Model Architecture for Customer Feedback Based Ground Truth Generation
Multi-task Label Generation Model
Features
- Dialogue context (user utterance, Alexa response,
previous turns, next turns, etc.)
- System metadata (domain, intent, dialog status,
confidence scores etc.)
Model
- Turn encoder + dialogue level transformer
- Turn level textual encoders is RoBERTa
Multi-task learning heads
- Explicit user feedback (e.g. user say “thank you”);
- Inferred user feedback (e.g. user play music for 30
seconds after voice command)
- Manual annotation
Self-supervised Pretraining
- Synthetic contrastive data (i.e. randomly swap answers
from a different dialog as defect sample).
Model Details
Turn 1
Turn 2
Turn n
E2E Defect
Annotation
Transcription
NLU
Annotation
Dialog Goal
Annotation
Target Turn
Categorical
Features:
Domain
Intent
Dialogue Status
……….
Real Valued /
Binary Features
Textual Encoder
Request
Response
………
MLP
RoBERTa
Speaker ID
Speaker ID
Data
Transformer
(Dialogue Level)
Concat
Layer
(Turn
Level)
MODEL
DATA TASKS
E2E Defect
Estimation
Intent
Classification
ASR
Recognition
Named Entity
Recognition
Goal Evaluation
Goal
Segmentation

16
Automated Ground Truth Generation Results
Goal Segmentation/Evaluation
Table1. Goal segmentation and evaluation tasks. We compare model prediction accuracy
against human (single-pass) annotation accuracy (note here that we use 3-pass Gold
annotation as Ground-truth). ”Single turn” means dialogues with only 1 turn, “Multi turn”
means dialogues with multiple turns. ”Single-Task” denotes models separately fine-tuned
on one task at a time, whereas “Multi-task” denotes models fine-tuned with multiple tasks
together. “Combined Accuracy” and “Combined Weighted F1 score” is a combination of
goal segmentation and evaluation tasks.
Intent Classification / Named Entity Recognition
Table2. Intent Classification. Comparing our model using Dialogue context against a RoBERTa based
baseline model for the Intent classification tasks for Shopping domain. (bolded rows shows intents
with largest improvements)
Table3. Slot tagging. Comparing our model using Dialogue context against a RoBERTa + CRF
based baseline model for the Slot tagging task for Shopping domain. (bolded rows show slot
types with largest improvements)
• Gupta, S. et al. “RoBERTaIQ: An efficient framework for automatic interaction quality estimation of
dialogue systems”. KDD 2021
• Wang Z. et al. “Contextual rephrase detection for reducing friction in dialogue system”. EMNLP 2021
• Park, D. et al. “Large-scale hybrid approach for predicting user satisfaction with conversational
agents”. NeurIPS, 2020
Publications:

17
Defect Correction with Self-learning Framework
• Enable self-learning in Alexa to reduce Customer Perceived Defects and enhance its understanding in real-time,
with context, without any human annotator in the loop
Prevention
Correction
1. Detect Defects
Customer Perceived Defect (CPD) metric
Alexa, play Buddha
Buddha Spa from Ama…
Alexa, stop
2. Learn Corrections
Rephrases, follow ups, or dialogs.
Customer Perceived Defect!
3. Correct Defects
At runtime, generate alternate
utterances (aka Query Rewriting)
Alexa, play Buddha
play Boo’d Up
Success!
Alexa, play Boo’d Up
Playing Boo’d Up by …
Success!
4. Automatic Guardrails
Several guardrails to prevent
trustbusters/regressions
Automatic blocklisting
Reducing False Wake
Sensitive Utterances
Alexa, play Buddha
Detection: Daily
Real-time
Learning & Deployment: Daily Blocklisting: 2 hrs. to Near-real time

Self-Learning based Defect Reduction in Large-Scale Conversational AI Agents
Precomputed
Rewriting
Pipeline
Online
Rewriting
Pipeline
Two general ways to provide rewrites for the reformulation engine:
• Precomputed Rewriting: this pipeline produces request-rewrite as key-value pairs offline and loads the pairs during
runtime. It takes advantage of the availability of offline information (e.g. user’s own rephrase, offline metrics) and
larger latency budget.
• Online Rewriting: this pipeline leverages rewrite models (e.g. retrieval/ranking models or generation models) and
online contextual information (e.g. previous dialog turns, dialog location, times) to produces rewrite in an online
mode. It enables rewriting for long tail defect queries.
18

Query
Response
Query
Response
Query
Response
Query
Rephrase
Examples Model Architecture
User: play tyler hero explicit
Agent: Here’s hypothetical hero, by Tyler
Rothrock
User: play tyler hero explicit by jack harlow
Agent: Sorry, I can’t find that
…
[User] play tyler hero explicit [Agent] Here’s
hypothetical hero, by Tyler Rothrock [User] play tyler
hero explicit by jack harolow [Agent] Sorry, I can’t find
that …
Session
input:
Play tyler hero by jack harlow (0.9)
Play tyler hero (0.05)
Precompute Rewriting: Contextual Rephrase Detection in Conversational Agent
“Contextual Rephrase Detection for Reducing Friction in Dialogue Systems”, Wang et al., EMNLP 2021
19

Precompute Rewriting: Feedback-based Self-learning in Conversational AI agents
• Users provide feedback to
Alexa in the form of
rephrases.
• Recurring user rephrases like
(a), (b), (c) are encoded in
Absorbing Markov chains.
• By resolving the Markov
model as in (d), we surface
the rewrite that is more
likely to result in success as
in (e).
• “Feedback-based self-learning in large-scale conversational AI agents”, Ponnusamy et al., AAAI 2020
• “Self-aware feedback-based self-learning in large-scale conversational AI”, Ponnusamy et al., to appear in NAACL 2022
20

Online Rewriting: Search based Self-learning Query Rewriting System
Personalized
Indexer
Global Indexer
Personalized
Index
Global Index Global Retrieval/Ranking
Models
Personalized
Retrieval/Ranking Models
Rewrite
Merging Logic
User query
Rewrite
Customer interaction
with AI devices
Customer Purchase history
…
Customer Contact Names
Customer Routine Phrase
“Personalized Search-based Query Rewrite System for Conversational AI”, Cho et al., NLP4ConvAI 2021
User query: “how’s
the weather in
Wikeson”
Global top1 rewrite: “how’s
the weather in Wilkeson
Washington”
Personal top1 rewrite:
“how’s the weather in
Wilkerson California”
Final rewrite: “how’s the
weather in Wilkerson
California”
Example
Offline
Online
“Search based self-learning query rewrite system in conversational AI”, Fan et al., De-MaL 2021
21

• Precompute Rewriting: Deployed the model in [1] across 11 locales spanning 6 languages. Online
A/B demonstrated a significant reduction (p-value of ≤0.0001) in defects experienced with a relative
defect reduction of ranging from 22.73% to 31.22%.
• Online Rewriting: Deployed the systems in [2] in en-US. Online A/B demonstrated a significant (p-
value < 0.001) relative reduction of defect rate (13%). Launching the personalized system on top of
the global one led to an additional significant defect rate reduction of 4%.
Selected Experimental Results for Query Rewriting
[1] “Self-aware feedback-based self-learning in large-scale conversational AI”, Ponnusamy et al., to appear in NAACL 2022
[2] “Search based self-learning query rewrite system in conversational AI”, Fan et al., De-MaL 2021
Rewrite Examples
Type Request Rewrite
Global rewrite Full volume Volume ten
Global rewrite Don’t ever play that song Thumbs down this song
Global rewrite Play a. b. c. Play the alphabet song
Personalized rewrite Open angry sleepy time playlist Open avery sleepy time playlist
Personalized rewrite Pair with johnson’s iphone Pair with john’s iphone
Personalized rewrite Play drivers license Play the song drivers license by
olivia rodrigo
22
Win:Loss Ratio 8.5 : 1
Learning
Latency
24 hrs

Teachable AI
• Customers can interactively teach Alexa and instantly adapt her to their personal preferences, such
as, “I’m a Warriors fan,” or, “I like Italian restaurants,” or, “I prefer Big Sky for my weather,” by
• initiating a conversation with Alexa at any time
• Alexa proactively sensing a teachable moment (e.g. repeat usage or unsatisfactory response)
and clarifying a preference.
• initiating a guided Q&A with Alexa with a simple cue like, “Alexa, learn my preferences,” and
sharing their favorites across topics like sporting, food and weather interests.
• Personalized Experiences: The next time customers query Alexa on related topics, like their sports
update, restaurants nearby, or weather update, Alexa will bear their interests in mind to curate
personalized selections.
23

Failure Point Isolation: Predict which component failed
Figure. Component-level architecture of a typical conversational assistant.
Color-codes correspond to Turn 1 on next slide (fatal ASR error and non-fatal
ERR error)
Predicted Classes:
• False Wakes (FW)
• ASR errors
• NLU errors
• Entity Resolution errors (ERR)
• Result errors
• Correct (no error)
25

Failure Point Isolation: Examples
Turn 1
• ASR: Failure Point
• NLU: Correct
• ERR: Wrong but not the Failure Point
• FPI output = {ASR error}
Turn 2
• ASR: Non-fatal error (“the” missing)
• NLU: Correct
• ERR: Correct
• FPI output = {Correct}
Turn 3
• ASR: Correct
• NLU: Correct
• ERR: Correct
• FPI output = {Correct}
26

Failure Point Isolation (FPI) model vs Human Performance*
• Human F1-score is calculated for a single human against an panel of expert annotators
• FPI model outperforms humans for Result and Correct cases
• False Wake performance is the weakest at 71.2%
• Detection of ASR, ERR and NLU errors is at 90-95% of human performance
* Khaziev et al. FPI: Failure Point Isolation in Large-scale Conversational Assistants, NAACL-HLT 2022 Industry Track
27

Intelligent Conversational Agents for Ambient Computing SIGIR 2022 Ruhi Sarikaya Amazon Science.pptx

Recommandé

Recommandé

Contenu connexe

Similaire à Intelligent Conversational Agents for Ambient Computing SIGIR 2022 Ruhi Sarikaya Amazon Science.pptx

Similaire à Intelligent Conversational Agents for Ambient Computing SIGIR 2022 Ruhi Sarikaya Amazon Science.pptx (20)

Dernier

Dernier (20)

Intelligent Conversational Agents for Ambient Computing SIGIR 2022 Ruhi Sarikaya Amazon Science.pptx

Notes de l'éditeur