The presentation looks at the following: 1) Long range view of fundamental trends and shifts in computing and User Experience, 2)
What does IoT and context mean for ambient conversational AI?, 3)
How does Conversational AI work?
Self-Learning: Implicit and explicit customer feedback based learning.
2. Outline
• Long range view of fundamental trends and shifts in computing and User
Experience
• What does IoT and context mean for ambient conversational AI?
• How does Conversational AI work?
• Self-Learning: Implicit and explicit customer feedback based learning
• Q &A
2
5. Human Interaction with the Digital World
Human Senses: sight, hearing, touch, smell, taste Computer ‘Senses’
• No sight & no hearing (until recently)
• Form of Human Input: typing & tactile
Gap
• Computers (and backend services) are not yet
designed for receiving voice input to operate
Problem
• You need to physically touch to computers
• It tethers you to a screen, ‘immobilizes’ you
Friction!
• The perception of our senses are created and
stored in different parts of the brain
6. • Current computing cycle: Mobile internet [Meeker, Morgan Stanley, 2014]
• No room for growth for connecting people to internet via smartphone (after 2020)
• What is next?
IoT and intelligent connected systems & services Ambient Intelligence with Conversational AI as the UX layer
1
10
100
1000
10000
100000
1000000
1960 1970 1980 1990 2000 2010 2020
10X Computing Cycles
MiniComputer
10M+ Units
PC
100M+ Units
IoT
100B+ Units
Mainframe
1M+ Units
Desktop Internet
1B+ Units/Users
Mobile Internet
10B+ Units
Mobile
Phones
Tablets
eReaders
MP3 Players
Telematics
.....
Any Device
Increased integration
Smaller form factor
Increased power & storage
Lower costs
Improved UI
The New Computing Cycle
6
0
200
400
600
800
1000
1200
1400
1600
1800
2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
Global Computing Device Shipments (in millions)
Smart Phone PC+Laptop Tablet
15.41
17.68
20.35
23.14
26.66
30.73
35.82
42.62
51.51
62.12
75.44
0
10
20
30
40
50
60
70
80
2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025
Connected
Devices
in
Billions
IoT Worldwide Install Base (billions)
7. Internet of Things (IoT): Connected Smart Devices with Sensors
• Sensors: smaller, low power and cheap
• 1 trillion sensors by 2022
• Digital Nervous System: location data (GPS),
eyes and ears with camera and microphone,
sensors (motion, temp, light, pressure, etc.)
Data Aggregation
Linking
Reasoning (AI)
Decision Making (AI)
(Real-time)
Industry Consumers
Phones
Wearables
TVs
Appliances
Home Automation
Home Monitoring
Machinery
Smart Cities
Transportation
Healthcare
Factories
Automation
Collective IoT
Intelligence
Smart Home
• Over 90% of our lives spent inside of a
building
• Intelligent & Responsive physical
environment
• IoT integrates the physical world with
the digital world
• World around us is reasoning and talking
back to us in real time
8. IT LOOKS LIKE
YOU LEFT THE
LIGHTS ON,
WOULD YOU LIKE
ME TO TURN
THEM OFF?
LIGHTS LOCKS APPLIANCES
Examples of Ambient Intelligence:Alexa Hunches and Routines
9. Why does IoT matter for Conversational AI?
• “Alexa, play hunger games?”
• What is the user’s intent?
• play_music? play_video? play_audiobook?
• “Alexa, what should I do for dinner?”
• What is the user’s intent?
• book_restaurant? order_food? find_recipe?
• Ground truth for a large combination of [person x device x context] data? How do we scale learning?
• “Alexa, order me two towels?”
• What is the user’s intent?
• shopping? room service?
• “Alexa, what is the temperature?”
• What is the user’s intent?
• weather forecast? temperature inside the home?
temperature of the oven?
• IoT is increasing the complexity (and opportunity) of the world
• Requires real-time communication with a reasoning environment
• Creates new forms of ‘context’
• Context:
• Set of circumstances/facts that surround a particular event, situation or entity for AI systems to sense, reason
and adapt better to the physical and digital world
Identity & State, Device Types, Physical/Digital Activity on Devices/Systems, Time, Device & User Location,
state and changes in environment as measured by sensors,….
• Why does context matter for conversational AI?
• Contextual Ambiguity: Users do not have any ambiguity when they issue a command to an intelligent assistant
10. Orchestrator
Skills
Weather
ASR
NLU
TTS
“speak” directive
intent
recognition result
recognize
Nbest interpretations
recognition result
text/SSML
user’s utterance
Alexa’s voice
Alexa’s voice
How Does Conversational AI work?
* Orchestration -
ASR, NLU, Routing,
TTS, Application
Services
* Intent Routing to
Applications
* Session
Management
* Dialog
Management –
multi-turn
interactions
* Abstraction of
device features to
applications
10
Alexa, what is the weather?
Routing
(intent, skill)
Nbest interpretations
11. Machine Learning Types
(in terms of types of supervision/feedback)
• Supervised learning is the task of learning a prediction function that maps an input to an output based on example input-
output pairs: y = f(x) (e.g. DNN, Logistic Regression, SVM)
Training: given a training set of labeled examples {(x1,y1), …, (xN,yN)}, estimate the prediction function f by
minimizing the prediction error on the training set
Testing: apply f to a never before seen test example x and output the predicted value y = f(x)
• Unsupervised learning looks for patterns in input data, which does not have any pre-existing labels. It allows for modeling
of probability densities over inputs to deduce structures (e.g. K-means, PCA, LDA).
• Semi-supervised learning combines a small amount of labeled data with a large amount of unlabeled data during
training. Different variants; self-training, co-training, generative methods, graph based methods, etc.
• Self-Supervised Learning, predicting one part of the input from what it knows about another, without any human
supervision (e.g. BERT, ROBERTA, GPT3).
• Reinforcement learning (RL) is concerned with how agents should take actions in an environment in order to maximize
the notion of cumulative reward. 11
Typically achieves very accurate predictions with sufficient data!
12. Self Learning for Conversational AI
What do we mean by Self-learning?
• Framework that enables learning autonomously from user-system interactions (e.g.
barge-in, reformulations), system signals, and predictive models
• It can be considered as a layer that combines supervised learning, semi-supervised learning and RL
• Zero component specific manual annotation to train and improve the machine learning models
• Leverage customer’s implicit and explicit feedback and system signals to train and improve ML
models in Conversational AI stack both offline and in runtime
Why self-learning?
• Speed: Rapid scenario building and deployment
• Cost: Minimizing manual annotation cost
• Ambiguity: Customers (vs annotators) know what they mean and want best
• Privacy: Does not require human access to customer data
12
13. Customer Feedback BasedAutomatedGroundTruth Generation
• A multi-year initiative to shiftAlexa ML model development from manual-annotation based to primarily
self-learning based approach by leveraging various feedbacks
• Explicit feedback (e.g. “Alexa: Did I answer you question? User: Yes”)
• Implicit feedback (e.g. User barge-in a turn or rephrase her request)
• Unsolicited feedback (e.g. User say “Alexa, thank you!” or “Alexa, I am not Derek, I am Dan”)
• Mission: automatically generate labels for 100% of Alexa utterances and for all annotation workflows in
near-real time by leveraging customers interactions and their feedbacks
• Goals: provide automatically accumulated signals and data to
• Protect user privacy (by removing human reviewers from the loop)
• Improve model accuracy (by providing more personalized labels)
• Reduce annotation cost
13
14. Prod model outputs
(ASR 1-best, NLU 1-
best, etc)
Alexa Models (ASR,
NLU, etc.)
Confidence
Prediction
Confidence
Level?
Alternative
Hypotheses
Generation
Implicit Exploration
(i.e. directly replacing 1-best w/
alternative hypotheses.)
Explicit Exploration
(i.e. present multiple hypotheses to
customer (e.g. voice confirmation,
on-screen choices.)
Alexa Runtime System
Feedback Collection & Understanding
NLU (DC, IC,
NER)
Multi-task Label Generation Models
Feedback Based Annotation
Data
Train New Models
Other data (unlabeled,
existing annotation, etc)
New Modules
Other Alexa
Modules
Exploration Module
Feedback Collection and Label Generation Module
Legends
Customer Feedback Based Ground Truth Generation Overview
Unsolicited
Feedback
(e.g. “Alexa,
thank you”)
Exploration
Decider
High
Low implicit
explicit
ASR (Error
prediction,
etc)
Dialog
Success
Estimation
Implicit
Feedback
(e.g. barge-in,
stop,
rephrase)
Explicit Feedback
(e.g. “did I
answer you
question?”,
”yes”)
14
15. 15
Model Architecture for Customer Feedback Based Ground Truth Generation
Multi-task Label Generation Model
Features
- Dialogue context (user utterance, Alexa response,
previous turns, next turns, etc.)
- System metadata (domain, intent, dialog status,
confidence scores etc.)
Model
- Turn encoder + dialogue level transformer
- Turn level textual encoders is RoBERTa
Multi-task learning heads
- Explicit user feedback (e.g. user say “thank you”);
- Inferred user feedback (e.g. user play music for 30
seconds after voice command)
- Manual annotation
Self-supervised Pretraining
- Synthetic contrastive data (i.e. randomly swap answers
from a different dialog as defect sample).
Model Details
Turn 1
Turn 2
Turn n
E2E Defect
Annotation
Transcription
NLU
Annotation
Dialog Goal
Annotation
Target Turn
Categorical
Features:
Domain
Intent
Dialogue Status
……….
Real Valued /
Binary Features
Textual Encoder
Request
Response
………
MLP
RoBERTa
Speaker ID
Speaker ID
Data
Transformer
(Dialogue Level)
Concat
Layer
(Turn
Level)
MODEL
DATA TASKS
E2E Defect
Estimation
Intent
Classification
ASR
Recognition
Named Entity
Recognition
Goal Evaluation
Goal
Segmentation
16. 16
Automated Ground Truth Generation Results
Goal Segmentation/Evaluation
Table1. Goal segmentation and evaluation tasks. We compare model prediction accuracy
against human (single-pass) annotation accuracy (note here that we use 3-pass Gold
annotation as Ground-truth). ”Single turn” means dialogues with only 1 turn, “Multi turn”
means dialogues with multiple turns. ”Single-Task” denotes models separately fine-tuned
on one task at a time, whereas “Multi-task” denotes models fine-tuned with multiple tasks
together. “Combined Accuracy” and “Combined Weighted F1 score” is a combination of
goal segmentation and evaluation tasks.
Intent Classification / Named Entity Recognition
Table2. Intent Classification. Comparing our model using Dialogue context against a RoBERTa based
baseline model for the Intent classification tasks for Shopping domain. (bolded rows shows intents
with largest improvements)
Table3. Slot tagging. Comparing our model using Dialogue context against a RoBERTa + CRF
based baseline model for the Slot tagging task for Shopping domain. (bolded rows show slot
types with largest improvements)
• Gupta, S. et al. “RoBERTaIQ: An efficient framework for automatic interaction quality estimation of
dialogue systems”. KDD 2021
• Wang Z. et al. “Contextual rephrase detection for reducing friction in dialogue system”. EMNLP 2021
• Park, D. et al. “Large-scale hybrid approach for predicting user satisfaction with conversational
agents”. NeurIPS, 2020
Publications:
17. 17
Defect Correction with Self-learning Framework
• Enable self-learning in Alexa to reduce Customer Perceived Defects and enhance its understanding in real-time,
with context, without any human annotator in the loop
Prevention
Correction
1. Detect Defects
Customer Perceived Defect (CPD) metric
Alexa, play Buddha
Buddha Spa from Ama…
Alexa, stop
2. Learn Corrections
Rephrases, follow ups, or dialogs.
Customer Perceived Defect!
3. Correct Defects
At runtime, generate alternate
utterances (aka Query Rewriting)
Alexa, play Buddha
play Boo’d Up
Success!
Alexa, play Boo’d Up
Playing Boo’d Up by …
Success!
4. Automatic Guardrails
Several guardrails to prevent
trustbusters/regressions
Automatic blocklisting
Reducing False Wake
Sensitive Utterances
Alexa, play Buddha
Detection: Daily
Real-time
Learning & Deployment: Daily Blocklisting: 2 hrs. to Near-real time
18. Self-Learning based Defect Reduction in Large-Scale Conversational AI Agents
Precomputed
Rewriting
Pipeline
Online
Rewriting
Pipeline
Two general ways to provide rewrites for the reformulation engine:
• Precomputed Rewriting: this pipeline produces request-rewrite as key-value pairs offline and loads the pairs during
runtime. It takes advantage of the availability of offline information (e.g. user’s own rephrase, offline metrics) and
larger latency budget.
• Online Rewriting: this pipeline leverages rewrite models (e.g. retrieval/ranking models or generation models) and
online contextual information (e.g. previous dialog turns, dialog location, times) to produces rewrite in an online
mode. It enables rewriting for long tail defect queries.
18
19. Query
Response
Query
Response
Query
Response
Query
Rephrase
Examples Model Architecture
User: play tyler hero explicit
Agent: Here’s hypothetical hero, by Tyler
Rothrock
User: play tyler hero explicit by jack harlow
Agent: Sorry, I can’t find that
…
[User] play tyler hero explicit [Agent] Here’s
hypothetical hero, by Tyler Rothrock [User] play tyler
hero explicit by jack harolow [Agent] Sorry, I can’t find
that …
Session
input:
Play tyler hero by jack harlow (0.9)
Play tyler hero (0.05)
Precompute Rewriting: Contextual Rephrase Detection in Conversational Agent
“Contextual Rephrase Detection for Reducing Friction in Dialogue Systems”, Wang et al., EMNLP 2021
19
20. Precompute Rewriting: Feedback-based Self-learning in Conversational AI agents
• Users provide feedback to
Alexa in the form of
rephrases.
• Recurring user rephrases like
(a), (b), (c) are encoded in
Absorbing Markov chains.
• By resolving the Markov
model as in (d), we surface
the rewrite that is more
likely to result in success as
in (e).
• “Feedback-based self-learning in large-scale conversational AI agents”, Ponnusamy et al., AAAI 2020
• “Self-aware feedback-based self-learning in large-scale conversational AI”, Ponnusamy et al., to appear in NAACL 2022
20
21. Online Rewriting: Search based Self-learning Query Rewriting System
Personalized
Indexer
Global Indexer
Personalized
Index
Global Index Global Retrieval/Ranking
Models
Personalized
Retrieval/Ranking Models
Rewrite
Merging Logic
User query
Rewrite
Customer interaction
with AI devices
Customer Purchase history
…
Customer Contact Names
Customer Routine Phrase
“Personalized Search-based Query Rewrite System for Conversational AI”, Cho et al., NLP4ConvAI 2021
User query: “how’s
the weather in
Wikeson”
Global top1 rewrite: “how’s
the weather in Wilkeson
Washington”
Personal top1 rewrite:
“how’s the weather in
Wilkerson California”
Final rewrite: “how’s the
weather in Wilkerson
California”
Example
Offline
Online
“Search based self-learning query rewrite system in conversational AI”, Fan et al., De-MaL 2021
21
22. • Precompute Rewriting: Deployed the model in [1] across 11 locales spanning 6 languages. Online
A/B demonstrated a significant reduction (p-value of ≤0.0001) in defects experienced with a relative
defect reduction of ranging from 22.73% to 31.22%.
• Online Rewriting: Deployed the systems in [2] in en-US. Online A/B demonstrated a significant (p-
value < 0.001) relative reduction of defect rate (13%). Launching the personalized system on top of
the global one led to an additional significant defect rate reduction of 4%.
Selected Experimental Results for Query Rewriting
[1] “Self-aware feedback-based self-learning in large-scale conversational AI”, Ponnusamy et al., to appear in NAACL 2022
[2] “Search based self-learning query rewrite system in conversational AI”, Fan et al., De-MaL 2021
Rewrite Examples
Type Request Rewrite
Global rewrite Full volume Volume ten
Global rewrite Don’t ever play that song Thumbs down this song
Global rewrite Play a. b. c. Play the alphabet song
Personalized rewrite Open angry sleepy time playlist Open avery sleepy time playlist
Personalized rewrite Pair with johnson’s iphone Pair with john’s iphone
Personalized rewrite Play drivers license Play the song drivers license by
olivia rodrigo
22
Win:Loss Ratio 8.5 : 1
Learning
Latency
24 hrs
23. Teachable AI
• Customers can interactively teach Alexa and instantly adapt her to their personal preferences, such
as, “I’m a Warriors fan,” or, “I like Italian restaurants,” or, “I prefer Big Sky for my weather,” by
• initiating a conversation with Alexa at any time
• Alexa proactively sensing a teachable moment (e.g. repeat usage or unsatisfactory response)
and clarifying a preference.
• initiating a guided Q&A with Alexa with a simple cue like, “Alexa, learn my preferences,” and
sharing their favorites across topics like sporting, food and weather interests.
• Personalized Experiences: The next time customers query Alexa on related topics, like their sports
update, restaurants nearby, or weather update, Alexa will bear their interests in mind to curate
personalized selections.
23
25. Failure Point Isolation: Predict which component failed
Figure. Component-level architecture of a typical conversational assistant.
Color-codes correspond to Turn 1 on next slide (fatal ASR error and non-fatal
ERR error)
Predicted Classes:
• False Wakes (FW)
• ASR errors
• NLU errors
• Entity Resolution errors (ERR)
• Result errors
• Correct (no error)
25
27. Failure Point Isolation (FPI) model vs Human Performance*
• Human F1-score is calculated for a single human against an panel of expert annotators
• FPI model outperforms humans for Result and Correct cases
• False Wake performance is the weakest at 71.2%
• Detection of ASR, ERR and NLU errors is at 90-95% of human performance
* Khaziev et al. FPI: Failure Point Isolation in Large-scale Conversational Assistants, NAACL-HLT 2022 Industry Track
27
Ack Gabriella, org committee, SIGIR.
Thank you for joining the talk. The summary of the talk is captured in the title itself. We want to enable natural contextual interactions for ambient computing, to do that we need scalable self-learning methods that can handle ambiguity and context for accurately understanding the user’s request and provide the best possible answer. The backdrop for the talk is conversational systems like Alexa. Let us get right to it.
We celebrated Alexa’s 7th birthday on last Saturday, on Nov 6th. Here you see some of the Alexa Echo family of devices starting launched over the years starting with the first generation. Note that of these devices has specific name, Alexa is agent or AI behind it and it lives in the cloud.
Alexa AI and devices are not built just for your homes, as Amazon we partnered with other companies to bring Alexa to senior living communities, hospitality business as in Marriott hotels, hospitals , here you see Boston Children’s Hospital.
Timothy Driscoll, Director of Technology Strategy at Boston Children’s Hospital, says, “Boston Children's Hospital is using Amazon Echo devices to provide an array of features to patients including entertainment in the form of music and games, hospital and unit-specific frequently-asked questions, and control of the in-room televisions. Our patients will soon be able to express their needs for things like pain management support - "Alexa, tell my nurse I'm in pain," or general comfort - "Alexa, tell my nurse I need a pillow."
Alexa experiences are augmented with end-point specific experiences as call nurses, hospitals related frequently asked questions as in hospitals.
Two weeks ago also launched Alexa in Disney resorts,. Alexa is also integrated into infotainment systems of most of the car manufacturers,
There is not a single week that goes by that we do not announce something about Alexa or Alexa integration with partners. Just last week also announced Alexa on Range rovers.
What is really happening?
Let us talk about the basics first. How do we sense the world around us? We (as humans) have our senses and we primarily rely on our vision and hearing. These are our primary means of getting input from the outside world.
Let us look at computers and how they sense the world? How do they get input from the outside world and particularly from humans?
Let’s look at the slightly longer horizon and go to 1946, this is ENIAC the mother of modern day computers. This picture is taken at Upenn.
The Apple iPhone is different -- many of the elements of its multi-touch user interface require you to touch multiple points on the screen simultaneously (capacitive screens vs resistive screens)
We have been manually pushing our wills into our tools, literally by using our hands.
ENIAC programmers Frances Bilas (later Frances Spence) and Betty Jean Jennings (later Jean Bartik) stand at its main control panels. With ENIAC's 40 panels still under construction, and its 18,000 vacuum tube technology uncertain, the engineers had no time for programming manuals or classes. Bartik and the other women taught themselves ENIAC's operation from its logical and electrical block diagrams, and then figured out how to program it. They created their own flow charts, programming sheets, wrote the program and placed it on the ENIAC using a challenging physical interface, which had hundreds of wires and 3,000 switches.
Eniac was impressive: 80 feet long and 8 feet high, weighing 30 tons, with 18,000 vacuum tubes, 70,000 resistors, 10,000 capacitors and 1,500 relays. It had a memory capacity of 20 words and was programmed by setting 6,000 dials and switches, a task that took a crew of workers many hours.
The Apple-1 computer, built by hand in 1976 by Steve Wozniak in Apple co-founder Steve Jobs' garage or his sister's bedroom, fetched nearly twice its pre-sale high estimate, Bonhams said.
- Nowadays a typical cell-phone has over 40 sensors: accelerometers, gyroscope, thermometers, wifi signals, Bluetooth, RF signal gathering etc.
Low power and cheap sensors are integrated into devices, physical systems and buildings in different industry verticals, from transportation to health care to factories.
Likewise on consumer side, more and more sensors are integrated into many devices and appliances we use everyday.
It is estimated that there will be about 1 trillion sensors in the world, that is about 120 sensors per person in the world.
These sensors measure changes in the environment or in user’s state from temperature, to pressure, touch, camera, light etc. to user’s heart rate to physical movements.
More specifically, with GPS data coupled with camera and microphone, coupled with internet connectivity, we are building a digital nervous system for the environment. This paves the way for an intelligent and responsive environment
The fundamental challenge of Verity is that we need to convert customer feedback, which is oftentimes binary or low dimensional (‘yes/no’, ‘defect/non-defect’), into high dimensional predicted labels such as transcription, NER annotation. The key strategy to do so is through the “hypotheses” → “exploration” → “confirmation” process. i.e. Verity system generates or selects potential alternative hypotheses for the labels we want customer’s feedback on, injects those hypotheses into Alexa runtime either through implicit exploration (e.g. query rewrite, or nbest ranking) or through explicit exploration (e.g. ask customer “do you mean frozen the movie, right?”), then collects customer feedbacks for these hypotheses to generate high confidence synthetic labels
Verity are composed of two major components: 1) Exploration Module and 2) Feedback Collection and Label Generation Module.
The responsibility of the exploration module includes: 1) Confidence prediction. Predict confidence of the current turn in order to decide whether or not to trigger exploration; 2) Hypothesis generation. Generate alternative hypotheses for the labels that Verity want to collect customer feedback on; 3) Exploration. Inject the hypotheses into Alexa runtime to explore, either through: a) implicit exploration i.e. directly replacing existing 1-best hypothesis with alternative hypotheses through systems, or b) explicit exploration i.e. presenting multiple hypotheses for customers to choose from through mechanisms such as voice confirmation and on-screen dynamic feedback (e.g. presenting a list of candidate transcriptions and ask “which of the following transcriptions matches what you said? ”)
The responsibility of the feedback collection and label generation module is to collect/interpret customer feedback, and convert those information into high quality synthetic labels that could be used for training respective models. The feedback collection & understanding module collects user feedbacks such as implicit feedbacks (barge-in, paraphrases), explicit feedbacks (e.g. “did I answer your question?”, “yes”). and unsolicited feedback (e.g. positive feedback: ”alexa, thank you”, negative feedbacks “alexa, that is not what I said”). The Multi-task label generation model is a multi-task deep-learning based model that labels data leveraging contexts derived from customer feedbacks.
Verity Label Generation Model architecture:
Turn level encoder + Dialogue/session level transformer
Turn level textual encoder: RoBERTa
MLP for categorical/numerical features
Input features include:
Textual features (user utterance, alexa response, previous turns, next turns)
Categorical features (domain, intent, dialog status)
Numerical features (num of tokens, asr/nlu confidence scores)
Raw audio data
Training
Multi-task learning heads.
Using explicit/implicit user feedback
Manual annotation
Self-supervision: Synthetic contrastive learning
Left panel: Results for Goal Segmentation and Evaluation.
We use single-pass human annotation as baseline, and compares two variants of our models: 1) fine-tuned with single task separately; 2) fine-tuned in a Multi-task setting. Note that we use three-pass “Gold” annotation as ground-truth.
DA profile: professionally trained human annotators who has passed annotation quality bar;
Results are broken down to two subsets: 1) “Single turn” whereas each goal contains only one turn, 2) “Multi-turn” whereas each goal contains a dialogue of multiple turns.
We show three metrics respectively: 1) Segmentation accuracy, 2) Combined (segmentation and evaluation) accuracy; 3) Combined F-1 score (weighted average of F-1 scores for three goal classes “success”, “failure”, “not actionable”). “Data support” provide the size of test dataset.
Key takeaways:
Human is more accurate on goal segmentation task but models outperforms human on goal evaluation task.
Combining goal segmentation and evaluation, model is slightly better than human in terms of Accuracy but slightly worse than human on weighted F1 score (due to class imbalance, human has higher precision than model, which contributes to human’s higher F-1 but lower accuracy).
Multi-task generally outperforms single-task.
This is evaluated on full dataset, model outperforms human on high confidence subset (50% highest confidence data).
Right panel shows results for Intent Classification and Slot tagging tasks.
Here we uses a RoBERTa model without dialogue context as baseline, and compare our model (RoBERTa with dialogue context) against the baseline.
For Slot tagging task, we use RoBERTa + CRF as baseline
Key takeaways:
Context based model outperforms baseline on all intents and slot types.
Context based model brought significant improvements on specific intents and slot types, e.g. 12% relative improvement for “CheckOrderStatusIntent” and 7% relative improvement for “ShoppingListType”
Data is only single-pass so not compared to human
This slide describes how the rewrite system (FLARE) fits into the spoken dialog system and coordinates with other parts to achieve defect reduction.
We further introduced two pipelines to deliver rewrites within the “Reformulation Engine” (FLARE): (1) Precompute (offline modeling) (2) Online. The text description covers the concept of each pipeline, and their cons and pros:Precompute:
Pros: (1) no latency constrains and complex model can be utilized; (2) rich offline signals that is not available during online mode, e.g. user follow up turns, user rephrase, Alexa response, metrics such as CPDR, Music30sPrecision, Video-Click-Through-Rate
Cons: (1) can’t utilize online contextual information, e.g. previous turns, screen-text, etc. (2) can not correct long-tail queries
Online:
Pros: (1) better utilization of online contextual signal to improve the rewrite quality, e.g. previous turns, screen-text, ASR-nBest, etc; (2) capture rewrite opportunities for long tail queries;
Cons: (1) latency constrains; (2) absent of offline information, e.g. NLU hypothesis, CPDR metrics, Alexa’s response to user query, etc.
Guardrails and metrics (CPDR) is not covered in this slide
Existing publications from Alexa that can be cited (which is already in production)
Precompute:
“Self-aware feedback-based self-learning in large-scale conversational AI”, Ponnusamy et al., to appear in NAACL 2022
“Contextual Rephrase Detection for Reducing Friction in Dialogue Systems”, Wang et al., EMNLP 2021
“Feedback-based self-learning in large-scale conversational AI agents”, Ponnusamy et al., IAAI 2020
Runtime
“Personalized Search-based Query Rewrite System for Conversational AI”, Cho et al., NLP4ConvAI 2021
“Search based self-learning query rewrite system in conversational AI”, Fan et al., De-MaL 2021
The slides provides a neural-based approach to extract rephrase from session data. The left side provides an example session where rephrase can be extracted. The right side provides the model architecture about how it will encode the session information and output a span detection that indicates the rephrase. Example description (left side):
The first turn (“play tyler hero explicit”) and the second turn (“play tyler hero explicit by jack harlow”) is defect (marked as ‘RED’ )
The first turn is defect because user is rephrasing; the second turn is defect because Alexa responds “Sorry, I can’t find that”
The third turn (‘play tyler hero by jack harlow’) is non-defect (marked as ‘Green’). And the first turn and the third turn construct a rephrase pairs that can be used for rewriting purpose
The fourth turn (“play record year by eric church”) is non-defect (marked as “Green”). We added this turn to illustrate that the session contains rephrase and non-rephrase (e.g. user swtich topics), and the model needs to learn to differentiate it
Model Architecture description (right side)
Description of the task: We consider a dataset D of M multi-turn dialogue sessions, such that D = {Si}, i=1..M, and every session S is an ordered set of N turns: S = {(Qi,Ri)}, i=1..N. Here i indicates the index of turn, and each turn i consists of a pair (Qi,Ri), where Qi is the user’s query and Ri is the agent’s response to query Qi. Any two successive turns have a time gap of less than a minute. Given a dialogue session S and a source turn, i.e., input pair of query and response (Qi,Ri), the goal of our model is to predict whether Qi is rephrased in any of the following turns (Qj,Rj)| i < j ≤ N. If so, the model should predict the span of Qj and return null otherwise.
Input:
input is truncated to a maximum of 512 tokens and a maximum of 10 turns
Time-bin is used to encode time intervals between turns. Consider a source turn (including a request and a response) t_src = (Q_src,R_src), for which we want to detect a rephrase in the session. We refer to its timestamp as ω_src. We calculate the time difference ∆i = ωi − ω_src, where ωi is the timestamp of a turn ti, for all the turns in the session. ∆i ∀i ∈ [1,n] are then mapped to their respective time-bin tokens. These time-bin tokens represent equal sized intervals in ∆’s range of [-60, 60] seconds.
The model we used are listed as BERT, but it can also be RoBERTa
Output:
We cast rephrase detection as a span prediction problem where we predict the probability of start and end span locations on each token’s position, using the embedding output of the f inal BERT layer.
We introduce a start vector W_S and an end vector W_E (both W_s and W_E will be trainable parameters)
Assuming the final hidden vector for the ith input token as Ti, the score of a candidate span from position i to position j is defined as: s_ij = WS · Ti + WE ·Tj, where i < j. We use snone = W_S ·T_CLS +W_E · T_CLS to represent the score of no-rephrase span. We set threshold τ to decide whether to predict no-rephrase or not. If maxj>i s_ij > s_none + τ, then we regard the maximum score span as the rephrase span and null otherwise.
We use some heuristic post filtering to generate the final rewrite pairs launched in production
Markov based query rewriting model that learns from recurring customer rephrase patterns.
This slides describe about how the DFS system work (using retrieval/rank models to generate rewrite instead of relying on customer’s own rephrasing). The ‘routine’ work was reflected in the “Customer routine phrase” injection, which will be used to build personalized index.
Motivation of the work:
Enable long tail traffic rewrite
We generate rewrite based on customer habitual usage patterns with the agent. The global layer is added to (1) avoid over index to personalized cases (e.g. a user who has strong affinity of ‘we don’t talk any more’ might have interest to explore a recent new song ‘we don’t about Bruno any more’; (2) to correct rewrite that has not shown in the user’s interaction history but are popular queries among global users
The example on the top right showed that: (1) user gave a query that is ambiguous and as ASR errors, the system find two possible rewrite from both global and personalized layer, and chose the personalized layer in the end
The Retrieval model uses a Dense Passage Retrieval (DPR) models that extract embedding for index and for query respectively and use some similarity measurement to decide the rewrite score
The Ranking model combines fuzzy match (e.g. through a single encoder structure) with various meta data (e.g. impression, CPDR, etc.) to make a reranking decision
Key Takeaways:
Customers can instantly personalize Alexa in the moment by teaching her (as opposed to waiting for Alexa’s ML models becoming smarter over time)
Modern dialog assistants are complex systems that process user requests in multiple stages(see Figure 1). First, a voice trigger (or wake word) model determines whether the user is speaking to the assistant. Following the trigger component, an ASR module converts user audio stream into text. This text is sent to NLU component which determines the user request. ER system recognizes and resolves entities, and the system generates the best possible response (Result stage) using several sub-systems that are specific to each dialog assistant. Finally, the response is rendered into a human-like speech using a Text-to-Speech (TTS) system.
In order to keep improving such assistants it is important to identify defects at scale. The manual analysis to identify defects and root causes of defect is infeasible large traffic volumes. In our system are are not only detecting system defects, but we are identifying which component of the dialog assistant is responsible for this defect.
It is important to note an error in an upstream component (e.g. ASR) can propagate through the system to the final response to the user. In such cases it is likely that multiple components fail. Thus, we focus on the first component that fails in a way that is irrecoverable, which we call the “failure point”.
In this work we recognize five error points as well as a “correct” class (meaning no component failed). The possible Failure points are: FW (errors in voice trigger), ASR Error (errors in use speech transcription), NLU Errors (IC+DC errors, for example incorrectly routing play Harry Potter to Video or music), ERR errors (Entity Resolution and Recognition), Result Errors (incorrect result, for example playing wrong Harry Potter movie).
To be better illustrate Failure Point Problem, let's examine this multi-turn dialog. In the first turn, the user is traying to open a garage door, however the conversational assistant didn't recognize user's speech correctly and it thought that user wanted to open a "garbage door". Entity Resolution system didn't recover from this error and also failed. Finally, the dialog assistant fails to perform the correct action. In this turn, ASR is the failure point. We don't mark Entity resolution as the failure point even though it also failed, because the ERR might have worked in case of correct ASR.
On the second turn user repeats the request. ASR makes a small error by not recognizing article "the" in the speech, Dialog Assistant Makes a correct action, hence we would mark this turn as correct, as the ASR Error didn't lead to the system failure.
The last turn highlights one of the limitation of our method. The user is asking the dialog assistant to make a sandwich, which is an action dialog assistants can not do today. All systems have worked correct, however the user is not satisfied. In our work, we do not consider such turns as defective, however the sentiment-based approaches would mark this as unsatisfactory. By the way, in theory, the system expectation might have been to invoke a “pleasantry/joke style response”, in which case this would have indeed been a defect even from system perspective.
Our best Failure Point Isolation model achieves close to human performance on average across different categories (>92% vs human). This model uses extended dialog context, features derived from logs of the assistants (e.g., ASR confidence), and traces of decision making components (e.g., NLU intent). The weakest performance is in FalseWake (FW) class at 70%. We outperform humans in Result and Correct class detection. ASR, ER, NLU is in the 90-95% range.