[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems

LinkedIn
Deepak Agarwal Bee-Chung Chen Qi He Jaewon Yang Liang Zhang

1:30 Introduction
1:40 End-to-End Workflow
1:50 Basic version: Single-Turn Question Answering
2:40 Advanced version: Multi-Turn Question Answering
3:30 break
4:00 Overview of Our Approach at LinkedIn
4:15 LinkedIn Help Center
4:30 Analytics Bot
5:15 Conclusion and Q & A

●
●
○
Where is AAAI
2019?
Hilton Hawaiian
Village Waikiki
Beach Resort

●
●
○
○
Where can I have
japanese food
there?
Tomi Sushi is a
nice japanese
restaurant

●
●
○
○
○
Please help me
make a reservation
at Tomi Sushi.
Great. When and
how many people?

●
●
○
○
○
○
○
I forgot my
password of the
registration site.
Can you help?
Sure. What is
your account id?

●
○
○
■
■
●
○
○

●
○
■
■
Article/paragraph
Question:
Answer:
Question:
Answer:
Question:
Answer:

●
○
■
■ Many datasets are available!

●
○
■
■
●
○
Who founded
Linkedin?
Reid Hoffman,
Allen Blue, ...
Knowledge
Base
(KB)
KB Query:
(?, IsFounderOf, LinkedIn)

LinkedIn
Knowledge
Base
How should I hire
for this AI
engineer position?
Recruiter Assistant Job Seeker Assistant
Find good jobs
for me.
… urgency ...… skills ...… job market ...
other
considerations
These are good
candidates ...
… career goals ...… skill gaps ...… job markets ...other considerationsThese are good job
positions for you ...

●
○
○
■
● …
○
●
○
○

●
○
○
●
○
○
○
●
○
○
●

Where can I have
japanese food
in the downtown?
Natural Language
Understanding
(NLU)
Dialogue State
Tracking
(DST)
Action Generation
(Dialogue Policy)
Natural Language
Generation
(NLG)
KB, DB, Index
Tomi Sushi is a
nice japanese
restaurant

Natural Language
Understanding
(NLU)
Input: User Utterance
Where can I have
japanese food
in the downtown?
(Speech-to-Text)
Output: Interpretation
Intent: Find_Restaurant
Type = Japanese
Area = Downtown
KB, DB, Index
Restaurant types: Japanese, Chinese, …
Location:
Country: USA
State: Hawaii
City: Honolulu
Area: Downtown

Dialogue State
Tracking
(DST)
Input:
Current Interpretation
Type = Japanese
Area = Downtown
Past State
Intent: Find_Flight
FromCity = San Jose
FromState = California
FromCountry = US
ToCity = Honolulu
ToState = Hawaii
ToCountry = US
Output: State
Type = Japanese
Area = Downtown
City = Honolulu
State = Hawaii
Country = US
KB, DB, Index
Restaurant types: Japanese, Chinese, …
Location:
Country: USA
State: Hawaii
City: Honolulu
Area: Downtown
Or, an embedding vector

Action Generation
(Dialogue Policy)
Output: Action
Action: Suggest_Restaurant
Type = Japanese
Name = Tomi Sushi
KB, DB, Index
Restaurant Search API
Input: (Type, Location)
Output: A list of restaurants ranked by their ratings
Input: State
Type = Japanese
Area = Downtown
City = Honolulu
State = Hawaii
Country = US

Natural Language
Generation
(NLG)
Output: System Utterance
Tomi Sushi is a nice
Japanese restaurant
(Text-to-Speech)
Or, other UI elements
KB, DB, Index
Input: Action
Type = Japanese
Name = Tomi Sushi
Knowledge cards of restaurants
Address, phone number, hours
Menu, price range, reviews

Where can I have
japanese food
in the downtown?
Sequence-to-
Sequence Model
with
Memory KB, DB, Index
Tomi Sushi is a
nice japanese
restaurant

Modular approach: Practical
End-to-end learning approach: Research, reading comprehension
Basic version:
Single-turn question answering
NLU:
Main Focus
No DST
Action Generation:
Rule-Based
NLG:
Template-Based
KB, DB, Index

Modular approach: Practical
End-to-end learning approach: Research, reading comprehension
Advanced version:
Multi-turn question answering
NLU:
Neural Net
DST:
Neural Net
Dialogue Policy:
Neural Net
NLG:
Neural Net
KB, DB, Index

NLU:
Main Focus
No DST
Dialogue Policy:
Rule-Based
NLG:
Template-Based
KB, DB, Index
Pipelined or Joint Learning
Logic
Form
KB, DB
executerule/template-based
action generation
Sequence to Sequence
Domain Detection Intent Detection Slot Filling
ASR
Text
Audio
Text
Direct Search

Logic
Form
KB, DB
action generation
ASR
Text
Audio
Motivations
● Template / slot filling: simple. Practical in
a domain-specific goal-oriented Q&A

Domain/Intent detection is a semantic text classification problem.
domain/intent detection
Select Flight
From Airline_Travel_Table
...
fill in arguments
semantic frame/template

Classic Sentence
Classification
Query Classification in
Search
Domain/Intent Detection
(Text Classification) in Q&A
Input Written language sentence Keywords Spoken language sentence with
significant utterance variations
Training
data
Rich (News articles,
Reviews, Tweets, TREC)
Rich (Click-through) Few (Human labels)
State-of-the
-arts
[Kalchbrenner et al., ACL
2014]
[Kim, EMNLP 2014]
CNN
[Shen et al., CIKM 2014]
[Palangi et al., TASLP 2016]
CLSM, LSTM-DSSM
[Tur et al., ICASSP 2012]
[Ravuri and Stolcke,
Interspeech 2015]
DCN, RNN, BERT (LinkedIn)

• 2 questions with different intents
- “How was the Mexican
restaurant” - “Tell me about
Mexican restaurants”
• Q1: Can we automatically
generate contextual features for
entities?
Temporal scope of entity Utterance Variations
• 2 questions with the same intent
- “Show me weekend flights
between JFK and SFO’
- “I want to fly from San
Francisco to New York next
Sunday”
• Q2: Can they generate the
same answer?
• Q3: Which question will
generate the better answer?
• Significant unknown words,
unknown syntactic structures for
the same semantic
• Q4: Can we efficiently expand
the training data?
• Q5: Can we significantly reduce
the need of training data?
Lack of training data
Deep Neural Networks
- Paraphrase
- Active Q&A (RL)
- Bots Simulator, Domain-independent
Grammar
- Paraphrase, Character-level Modeling,
Transfer Learning

Slot Filling is to extract slot/concept values from the question for a set of predefined slots/concepts.
More specifically, it is often modeled as a sequence labeling task with explicit alignment.
extract semantic concepts
Select Flight
From Airline_Travel_Table
Where dept_city = “Boston” and arr_city = “New York”
and date = today
fill in arguments/slots
semantic frame/template

Semantic frame
• Pre-defined slots/concepts by
goal-oriented dialog systems
- vs. open-domain dialog systems
• Q6: Can we leverage domain
knowledge?
• Long entity phrase has strong
slot dependency
• Q7: Is model sensitive to slot
position?
• Q8: Shall we globally assign
labels together?
Slot dependency
Sentence watch star war episode IV a new hope
Slots O B-mov I-mov I-mov I-mov I-mov I-mov I-mov
• Input and output are of the same
length
- vs. other sequence labeling tasks
(machine translation and speech
recognition): the output is of the
variable length
• Q9: Can slot filling model
leverage the “explicit
alignment”?
Explicit alignment
Knowledge-based Model
Bidirectional RNN, Slot
Language Model, RNN-CRF
Attention-Based RNN

Benefits
1. Only 1 model needs to be trained, fine-tuned for
multiple tasks, and deployed
2. Tasks enhance each other. For example, if the
intent of a sentence is to find a flight, the sentence
likely contains the departure and arrival cities, and
vice versa
3. Outperform separate models for each task
Joint Learning
Q10: What is the most effective learning
structure of this multi-task learning?
Q11: Should we jointly optimize the loss
function or not?
append intent to the
beginning/end of slots
2 different tasks
…...

Logic
Form
KB, DB
execute
Sequence to SequenceText
Motivations
● Simply the workflow: directly generate
the final action
● Theoretically, model very complex Q&A
Challenges
● Grammar mismatch between question
and logic form
● Practically, suffer from the lack of
training data for complex Q&A
Question
● Q12: How to collect domain-specific
training data?
Grammar-complied Methods

KB, DB
Direct SearchText
Motivations
● Can we retrieve answer in KB/DB like querying
search engine? Yes, but:
● Limited to simple Q&A (retrieve a single fact)
Challenge
● It is hard to find the answer if the answer is
supported by multiple knowledge facts
Question
● Q13: Can we build a scalable framework for
extracting the answer from multiple knowledge
facts?
Memory Network

Logic
Form
KB, DB
action generation
ASR
Text
Audio
Text
Direct Search
Covered
● Deep Neural Networks for
template-based Q&A
● Memory Networks for direct search Q&A
● Reduce the need of training data
Not covered
● Sequence to Sequence: comply with the
grammar of target
● Active Q&A (RL)

CNN
RNN/LSTM
Seq2Seq
Attention
Transformer
Advantages: good for variable-length representations such as sequences and long-range info propagation
Problems: the sequentiality prohibits parallelization, sequence-aligned states are wasteful (O(n)) for
long-range dependencies, hard to model hierarchy
RNN/Gated RNN (LSTM/GRU) is the core of Seq2Seq
Advantages: good for variable-length output
Problems: the same as RNN
Attention + RNN-based Seq2Seq
Advantages: O(1) for long-range dependency, yield more interpretable models
Problems: the same as RNN
Multi-head attention + non-recurrent Seq2Seq
Advantages: O(1) for long-range dependency, yield more interpretable models, easy to parallelize
So far, state-of-the-art
Advantages: fit intuition that most dependencies are local, easy to parallelize, easy to model hierarchy
Problems: path length between positions can be logarithmic for long-range dependency (O(logn)), needs a
lot of tricks to model the position relationships (not natural)

[Kalchbrenner et al., ACL 2014, Kim, EMNLP 2014, Shen et al., CIKM 2014]
● 1-d convolution → k-d, 1 CNN layer → multiple CNN layers
● multiple filters: capture various lengths of local contexts for each word
● max pooling → k-Max pooling: retain salient features from a few keywords in global feature vector
● Convolution: produce n-gram features

[Kalchbrenner et al., ACL 2014, Kim, EMNLP 2014, Shen et al., CIKM 2014]
● TREC question classification:
CNNs are close to “SVM with
careful feature engineerings”
● Large window width: long-term
dependency
● k-Max pooling maintains relative
positions of most relevant
n-grams
● Web query:
○ DSSM < C-DSSM (CLSM)
○ Short text: CNN is slightly
better than unigram model
Learned 3-gram features:
keywords win at 5 active neurons in max pooling:

The clouds are in the sky
I grew up in France … I speak fluent French
RNN
LSTM
github: 2015-08-Understanding-LSTMs

github: 2015-08-Understanding-LSTMs

[Palangi et al., TASLP 2016]
● Memory: become richer (more
info) over x-axis
● Input gates: do not update
words 3, 7, 9
● Peephole, forget updates are
not too helpful when text is short
and memory is initialized as 0
(just do not update)
● Web query:
○ DSSM < C-DSSM (CLSM)
< LSTM-DSSM
Case: match “hotels in shanghai” with “shanghai hotels accommodation (3) hotel in
shanghai discount (7) and reservation (9)”
Input gates cell gates (memory)

[Ravuri and Stolcke, Interspeech 2015]
When sentence is long:
Basic RNN < LSTM
When sentence is short:
Basic RNN > LSTM

[Sutskever, et al., NIPS 2014]
use the last state of the
encoder
forget the first part when
finishing the whole input
Attention

Decoder focuses on different information from encoder at every step (weighted sum).
[Bahdanau et al., ICLR 2015]
Model dependency w/t regard to
their distance in the input or
output sequences.
Transformer
sequentially process:
cannot parallelize

http://jalammar.github.io/illustrated-transformer/
Novelty: eliminate recurrence

[Vaswani, et al., NIPS 2017]
● Encoded representation of the input as a set of key-value pairs,
(K,V)
● In the decoder,
○ The previous output is compressed into a query Q,
○ The next output is the weighted sum of the values, where
the weight assigned to each value is determined by the
dot-product of the query with all the keys:

- positional encoding
- residuals
http://jalammar.github.io/illustrated-transformer/
- masked self-attention: only
attend to earlier positions in
the output sequence
- Q: from the layer below it
- K/V: from the output of the
encoder stack

[Vaswani, et al., NIPS 2017]
● Multi-head attention: run the scaled dot-product attention multiple times
in parallel
● Why does this work? similar to ensembling
The application of Transformer in classification (domain/intent) detection will be discussed in LinkedIn Anabot scenaria.

Sentence show flights from Boston to New York today
Slots O O O B-dept O B-arr I-arr B-date
Sentence is today’s New York arrival flight schedule available to see
Slots O B-date B-arr I-arr O O O O O O
Forward RNN is better:
Backward RNN is better:
[Mesnil et al., Interspeech 2013]

[Mesnil et al., TASLP 2015]
Consider global sequence optimization

[Liu and Lane, Interspeech 2016]
attention
alignment
attention + alignment
attention: normalized weighted sum of encoder states,
conditioned on previous decoder state. Carry additional
longer term dependencies (vs. h codes whole sentence
info already)
alignment: Do not learn alignment from training data
for slot filling task -- waste explicit attention
same encoder for 2 decoders

[Liu and Lane, Interspeech 2016]
attention-based bidirectional RNNattention-based encoder-decoder
performed similarly; faster

Related work Idea
[Xu and Sarikaya, ASRU 2013] CNN features for CRF optimization framework
[Zhang and Wang, IJCAI 2016] Similar to Attention-based RNN, 1) no attention, 2) CNN contextual
layer on top of input, 3) global label assignment, 4) replace LSTM
by GRU
[Hakkani-Tür Interspeech 2016] Append intent to the end of slots, Bidirectional LSTM
[Wen et al., CCF 2017] Modeling slot filling at lower layer and intent detection at higher
layer is slightly better than other variations
[Goo et al., NAACL-HLT 2018] Attention has a higher weight, if slot attention and intent attention
pay more attention to the same part of the input sequence
(indicates 2 tasks have higher correlation)
[Wang et al., NAACL-HLT 2018] Optimize loss function separately, alternatively update hidden
layers of 2 tasks

Results
1. Most of models achieved similar results
2. Attention-based RNN (2nd) beats the majority
3. Optimizing loss function separately is slightly
better, partially because the weights on joint loss
function need fine-tune
[Wang et al., NAACL-HLT 2018]

[Yih et al., ACL 2014]
KB, DB
Question Question - EntityEntity
Entity Relation Entity
CNN
Similarity
CNN
Similarity
Score
Rank
Answer

[Weston, ICML 2016]
1) Input module: input KB and questions to memory
2) Generalization module: add new KB/question to
next available slot in memory

[Weston, ICML 2016]
3) Output module: return
supporting facts based on
iterative memory lookups

[Weston, ICML 2016]
4) Response module:
score and return objects
of the support facts OR all
words in the dictionary

[Bordes et al., 2015; Weston, ICML 2016]

[Duboue and Chu-Carroll, HLTC 2006]
lexical paraphrase
syntactical paraphrase
1. QA is sensitive to small
variations in question
2. QA returns different
answers for questions
that are semantically
equivalent
3. Lack of training data to
cover all paraphrases
Problem
1. Replace user question by
the paraphrase
canonical form
2. Use MT to generate
paraphrases candidates
3. Multiple MTs to enhance
diversity
4. Feature-based
paraphrase selection
1. Oracle of paraphrase
selection: +35% (high
reward)
2. Random paraphrase
selection: -55% (high
risk)
3. A feature-based
selection: +0.6%
Solution Impact

Definition: paraphrase generation is a sequence-to-sequence modeling problem.
Characteristics:
1. Monolingual parallel data is not readily available (vs. bilingual parallel data in MT). Use Pivot
language (pairs of ML systems, especially with different methods)
2. Not all of the words or phrases need to be replaced (vs. MT)
3. Hard evaluation (vs. MT)
a. MT uses BLEU: translations are scored based on their similarity to the human references
b. More difficult to provide human references (canonical forms) in paraphrase generation
[Androutsopoulos and Malakasiotis, JAIR 2010]

[Zhao et al., ACL 2009]
Utility example:
sentence compression
1. Adequacy: {evidently not,
generally, completely} preserved
meaning
2. Fluency: {incomprehensible,
comprehensible, flawless} paraphrase
3. Usability: {opposite to, does not
achieve, achieve} the application
1. Jointly likelihood of Paraphrase
Tables
2. Trigram language model
3. Application dependent utility
score (e.g., similarity to canonical
form in “paraphrase generation”)
Human evaluation Model
1. Prefer paraphrases
which are a part of the
canonical form
2. Better than pure
MT-based methods
3. Utility score is crucial
Analysis

[Cho et al., EMNLP 2014]
● Semantically similar (most are
about duration of time in the left
figure)
● Syntactically similar (those phrases
that are syntactically similar are
clustered together)

[Devlin et al., 2018]
Needs
Reduce the need of a large amount of
training data
Parallelization
OpenAI GPT, a Transformer Decoder
Stack (predict next word → LM)
BERT, a Transformer Encoder Stack
Solutions
Transfer learning: pre-train unsupervised
LM before fine-tuning it for a supervised
task [OpenAI GPT, BERT]
Context from both directions is crucial to
sentence-level and (esp.) token-level
tasks (e.g., slot filling)
Bidirectional representation [ELMo,
BERT]

http://jalammar.github.io/illustrated-bert/
- Thousands of books
- Wikipedia
...
Transformer Encoder Stack

[Devlin et al., 2018]
IsNext
NotNext
Next Sentence Prediction does not
need human-labeled data as well

Paraphrase generation
Intent detection
Answer Generation

Logic
Form
KB, DB
action generation
ASR
Text
Audio
Text
Direct Search
Covered
● Deep NNs for template-based Q&A,
direct search Q&A, and reducing the
need of training data
Single-turn Q&A
● Technology is relatively mature but still
evolving very fast
● Pave the foundation for complex Q&A
infrastructure

Where can I have
japanese food
in the downtown?
Natural Language
Understanding
(NLU)
Dialogue State
Tracking
(DST)
Action Generation
(Dialogue Policy)
Natural Language
Generation
(NLG)
KB, DB, Index
San Jose
downtown or
Honolulu
downtown?

Much more difficult than evaluating a single-turn system
● In a single-turn system, it is easier to label the correct answer to a question
○ Once we have labels, we can do offline evaluation
● In a multi-turn system, it is unclear what is “the correct response”
○ There can be many “successful paths” to achieve the goal
■ Example: Which action is better when the question is ambiguous?
● Ask for clarification
● Answer it based on the best guess and address misunderstanding later
○ It is difficult to label sufficiently many successful paths for offline evaluation

Evaluation by hired human evaluators
● (also used for single-turn systems)
● Expensive
● Difficult to cover all possible
scenarios
Experiments with end users
● Sometimes difficult to assess
whether a user is satisfied (and need
to predict user satisfaction)
● Only available after we launch the
product
Treating it as a ranking problem
● For each turn, use the model to rank
a set of predetermined possible
system utterances and compute
precision
● Far from a realistic setting
Evaluation based on a user simulator
● Limited by the capability of the
simulator
● However, can be used to provide
unlimited training and test data

1. Define the problem space and collect annotated conversations
2. Build a user simulator
● Use annotated conversations to fit the parameters of the simulator
● Collect more annotated conversations based on simulator + crowdsourcing
3. Train a model using supervised learning on annotated conversations
4. Improve the model using reinforcement learning (RL) based on the simulator
5. Test the model with friends and/or hired human evaluators and apply RL
6. Test the model with end users and apply RL

User Goals
● Request & Constraints
● Example: Make a reservation of a
Japanese restaurant in San Jose,
… and let me know the address
Intents / Dialogue Acts
● inform
● request
● confirm_question
● confirm_answer
● greeting
● closing
● multiple_choice
● thanks
● welcome
● deny
● not_sure
Slot & Values
● Cuisine
Japanese
Chinese
…
● Rating
5
4
…
● City, State
San Jose, CA
New York, NY
...
request: address, reservation
constraints:
cuisine = “Japanese”,
city = “San Jose”, state = “CA”,
rating = “5”,
date = “today”, time = “7pm”,
number_of_people = “4”
Example: https://github.com/xiul-msr/e2e_dialog_challenge

Role Utterance Annotation (logical form, semantic frame)
User Hello, i was wondering if you can book a
restaurant for me? Pizza would be good.
greeting(greeting=hello),
request(reservation), inform(food=pizza)
Agent Sure! How many people are in your party? request(number_of_people)
User Please book a table for 4 inform(number_of_people=4)
Agent Great! What city are you dining in? request(city)
User Portland inform(city=Portland)
Agent Ciao Pizza or Neapolitan Pizzeria? multiple_choice(restaurant_name=...)
User What is the price range of Ciao Pizza inform(restaurant_name=Ciao Pizza),
request(pricing)

User Goal
● Request Rt
: e.g., name, address, phone
● Constraint Ct
: e.g., { type=bar, drinks=beer, area=central }
● Rt
and Ct
can change over time t
Agenda
● A stack of user actions to be performed
● Generated by a set of probabilistic rules
● Pop to perform a user action
● Push to add future actions in response
to the agent’s actions
At
=

Probabilistic user model
● Take n user actions
○ Pr(#user actions at this turn | user state)
○ Pop n actions from the user’s agenda
● Receive agent actions
● Update the user’s goal
○ Pr(add constraint S=V | user state, agent actions)
○ Pr(satisfy request X | user state, agent actions)
● Update the user’s agenda
○ Pr(push user action A to the agenda | user state, agent actions)

Rule-Based
Agent
User
Simulator
Simulated
Conversations
Contextual
Paraphrasing
Crowdsourcing Task #1
Make conversation more
natural with coreferences,
linguistic variations and
shortened sentences
(because of the context)
Validation
Crowdsourcing Task #2
Verify the created paraphrases
have the same meaning by
consensus of n workers
Generate both
utterances and
annotations
Annotated
Conversation
Paraphrasing and validation tasks are
much easier than annotation tasks

Turn 1 Turn 2 Turn 3
Feed Forward
Neural Net
After we collect more annotated data, we can improve the user simulator by
more advanced models
Example: Sequence-to-sequence models learned from annotated conversations
vt
: Feature vector of
the user state and
agent’s action at turn t
User utterance of turn 4 =

Where can I have
japanese food
in the downtown?
Natural Language
Understanding
(NLU)
Dialogue State
Tracking
(DST)
Action Generation
(Dialogue Policy)
Natural Language
Generation
(NLG)
KB, DB, Index
San Jose
downtown or
London downtown?

State-of-the-art: A neural network
model combining NLU & DST
Input: Previous state & conversation
State: Example - Belief state of
the user’s goal
Request:
Pr(request: reservation) = 0.2
Pr(request: phone) = 0.1
….
Constraints:
Pr(price=cheap) = 0.1
Pr(city=Honolulu) = 0.5
….
Agent: What kind of food at what price?
request(food, price)
User: I just want something cheap.
Can you book a table for me?
Output: New state
= 0.7
= 0.9

With annotated data, this is a
supervised learning problem
Input: Previous state & conversation
Agent: What kind of food at what price?
User: I want something cheap.
What are the available food types?
Output:
Request:
….
Constraints:
….
Label
1
0
Label
1
1

Predict
Pr(price=cheap)
For each (slot, value)
CNN

slot-specific slot-specific
Utterance Utterance
Representation

For each candidate
slot=value Predict
Pr(price_range=cheap | inform)
using global-locally
self-attentive encoders

Input: State
Request:
….
Constraints:
….
Output: Action
confirm(city=Honolulu)
Methods:
- Rules
- Supervised Learning
model(state) => intent(slot=value, ...)
- Reinforcement Learning

Policy: 𝜋(st
) → at
Training Data:
State Correct Action
s0
greeting(hello)
s1
request(city)
s2
confirm(city=...)
... ...
Example state:
st
: Pr(request: reserv..) = 0.7
...
…
Or, an embedding vector
feature
vector
greeting(hello)
request(food)
confirm(city=...)
...

Action template (summary action)
𝜋(st
)
confirm(city=San Jose)
confirm(city=London)
confirm(food=Chinese)
confirm(food=Korean)
...
...
𝜋(st
)
confirm(city=GET_CITY)
confirm(food=GET_FOOD)
Action template
San Jose
London
Chinese
Korean
...
...
API call or rules
Action mask
- Remove invalid candidate actions based on rules (domain knowledge, common sense)
- Example: Don’t recommend a restaurant if location is unknown
Don’t make a reservation if the user has not yet selected a restaurant

State Tracker:
RNN
Input feature vector to RNN
t -1
t -1
t -1

Natural Language
Generation
(NLG)
Japanese restaurant
(Text-to-Speech)
Input: Action
Type = Japanese
Name = Tomi Sushi
Basic version:
- Template-based approach
[Name] is a nice [Type] restaurant
How about a [Type] restaurant like [Name]
I would recommend a [Type] restaurant like [Name]
- Retrieval-based approach
Retrieve the most relevant utterance from a large corpus

Natural Language
Generation
(NLG)
Japanese restaurant
(Text-to-Speech)
Input: Action
Type = Japanese
Name = Tomi Sushi
Advanced version: RNN Decoder (e.g., LSTM)
● Add the action as additional input to each RNN cell
● Use multiple layers of RNN to improve performance
● Use a backward RNN to further improve performance

xt
xt
xt
xt
xt
ht-1
d0
input
gate
output
gate
forget gate
reading gate
Likelihood that an dialog at
has be used in the previous step

Agent
Turn t=1
Agent
State s0
input
x0
action
a0
State s1
input
x1
update
Turn t=0
Agent
State s2
input
x2
action
a1
update
Turn t=2
action
a2
End
USER
DST: 𝛿(st-1
, at-1
, xt
) → st
Policy: 𝜋(st
) → at
Example:
at-1
= request(food, price)
Example:
st-1
: Pr(request: reserv..) = 0.7
...
...
xt
= “I want something cheap.
What are the available
food types?”
at
= inform(food, GET...)

Agent
State s0
input
x0
Agent
State s1
input
x1
action
a0
reward
r0
update
Time t=0
Agent
State s2
input
x2
action
a1
reward
r1
update
Time t=1 Time t=2
action
a2
reward
r2
End
USER
Supervised Learning:
Annotate correct actions
Policy: 𝜋(st
) → at
-1 -1 20
Example Reward
- Each step: -1
- Success: 20
- Failure: 0
Reinforcement Learning:
Define the reward

Basic methods
- Q-learning: Deep Q-Network
- Policy gradient: REINFORCE (Monte-Carlo gradient ascent)
Advanced methods
- Actor-Critic policy gradient method with experience replay
- Deep Dyna-Q & BBQ-Networks
- Multi-level reinforcement learning

Q(s, a)
Q 𝜋
(s, a) = E [ total reward | we start from state s, take action a,
and then follow 𝜋]
Policy: 𝜋(st
) → at
Q*(s, a) = Q 𝜋
(s, a) when 𝜋 is the optimal policy
Optimal policy 𝜋*(st
) = argmaxa
Q*(st
, a)
Q*(st
, at
) = E [ rt
+ 𝛾 maxa
Q*(st+1
, a) | st
, at
]
Value

Goal: Learn Q*(st
, at
) = E [ rt
+ 𝛾 maxa
Q*(st+1
, a) | st
, at
]
Optimal policy 𝜋*(st
) = argmaxa
Q*(st
, a)
Neural Net: Q(s, a | w) → Value
w = model parameters (weights)
Q-Learning: Find w that minimizes E [ ( Q(s, a | w) - Q*(s, a) )2
]
SGD: Use a single sample (st
, at
, rt
, st+1
) to to compute Q*(st
, at
)
Compute the gradient using the sample and do gradient descent

Q*(st
, at
) = E [ rt
+ 𝛾 maxa
Q*(st+1
, a) | st
, at
]
Deep Neural Net: Q(s, a | w), w = argminw
E [ ( Q(s, a | w) - Q*(s, a) )2
]
While(current state st
)
Take action at
= argmaxa
Q(st
, a | wt
) with probability (1 - 𝜀);
random, otherwise.
Receive (rt
, st+1
) and save (st
, at
, rt
, st+1
) in replay memory D
Sample a mini-batch B from buffer D
Update wt
based on
𝜀-greedy

Policy 𝜋(s, a | 𝜃) = Pr(take action a | state s, model parameter 𝜃)
Reward

(No reward discounting over time)

Recursion
(No reward discounting over time)

Supervised (imitation) Learning
Reinforcement Learning
correct actions
agent’s actions value of action at

While( run policy 𝜋(⋅| 𝜃) to generate s0
, a0
, …, sT
, aT
)
Compute vt
= total reward starting from step t (based on this sample run)
Update 𝜃 based on
SGD: Do a sample run of the policy s0
, a0
, …, sT
, aT
Use this run to compute the sample Q value
Compute the gradient using this sample and do gradient ascent

Collect data D = { (s0
, a0
, p0
, v0
, …, sT
, aT
, pT
, vT
) }
- pt
= Pr(take action at
at step t), recorded during data collection
While( sample (s0
, a0
, p0
, v0
, …, sT
, aT
, pT
, vT
) from D )
Can we learn from past data?
- Importance sampling
, which is capped to prevent high variance
past example
Pr(past example | old policy)
Pr(past example | new policy)

While( run policy 𝜋(⋅| 𝜃) to generate s0
, a0
, …, sT
, aT
)
Save (s0
, a0
, p0
, v0
, …, sT
, aT
, pT
, vT
) in replay memory D
Train w1
and w2
using experience replay (with importance sampling weighting)
Problem: vt
has high variance
- Predict vt
by a model Q(st
, at
| w)
- Q(st
, at
| w) also have high variance
- Replace Q(st
, at
| w) by A(st
, at
| w) = Q(st
, at
| w1
) - V(st
| w2
)

Train a “world model” to predict rewards and user actions
M(s, a | wM
) → (reward, user action, terminate or not)
Use the world model to generate simulated data
Apply Q-learning to simulated data
=> Planning
(The agent thinks about and “plans” for hypothetical scenarios)

For each step:
- Serve user based on Q(s, a | wQ
)
using 𝜺-greedy
- Save experience in D
- Update wQ
by Q-learning based on a
sample from D
- Update wM
by learning from a
sample from D
- Update wQ
by Q-learning based on
simulation (a.k.a. planning) using
M(s, a | wM
) and Q(s, a | wQ
)
Q(s, a | wQ
) → value
M(s, a | wM
) → (reward,
user action,
terminate or not)

Model the uncertainty of Deep Q-Network: Q(s, a | w)
Bayes-by-Backprop [Blundell et al., 2015]
Assume prior w ~ N( 𝛍0
, diag( 𝛔0
2
) )
D = { (si
, ai
, vi
) }, where vi
is the observed Q(si
, ai
)
Approximate p(w | D) by q(w | 𝛍, 𝛒) = N( 𝛍, diag( 𝛔2
) ) s.t.
where 𝛔 = log(1+ exp( 𝛒))
Thompson sampling
Draw wt
~ q(w | 𝛍t
, 𝛒t
) and take action argmaxa
Q(st
, a | wt
)

Draw 𝜼 ~ N(0, 1) for L times Take a minibatch of size M from D
Compute the gradient and perform one step of SGD

Action hierarchy
- Action group → individual action
Learn K+1 policies
- Master policy: 𝜋(state) → action group g
- K sub-policies:
For each group g, 𝜋g
(state) → action a
Q(s, g | w)
Q(s, a | wg
)
Deep Q-Network
share some parameters
across different groups

E.g., recommend a
restaurant
E.g., make a
reservation

Multi-turn question answering is an active research area
How to obtain training data is a key challenge
- Simulation + crowdsourcing is a promising direction
- Continuously improve the simulator to generate better data
Reinforcement learning is promising
- It is important to pretrain a RL model on reasonable sample data
- Otherwise, it will have a hard time to get success and will learn to end early
- Interesting directions: Reduce variance in training,
model uncertainty better, leverage hierarchical structure

Our Use Cases of Goal-Oriented Question Answering
●
○
○
●
○
○
●

Supervised ML Problems
Feature X Label Y
Model
QA Problems
●
●
■
●
■ ⇔
■

●
○
○
■
■
●
○
■

“Who founded LinkedIn?”
Entity Recognition: “LinkedIn”
Intent Detection: “Founder”
Slot Filling &
Rule/Template-based
Action Generation
Output: “Reid Hoffman, Allen Blue, ...”
Model
DB Query: (_, Founder, LinkedIn)

A Set of Seed Questions & Answers
Define Problem & Scope
“Who founded LinkedIn?” “What jobs do you have?” …...
“Who is the founder of
LinkedIn?”“Founder of LinkedIn?”
“LinkedIn founders”
…...
Expansion of similar questions that lead to
same answer
Sometimes challenging!
● Out-of-scope questions collected from public databases with similar domains
● Transfer learning techniques (e.g. pre-trained embeddings from BERT) help
reduce the data volume requirement

● Entity Recognition:
● Intent detection
● Slot Filling & Rule/Template-based Action generation
“Who founded LinkedIn?” Entity Annotation: “LinkedIn” => Company
“Who founded LinkedIn?” Intent Annotation: “Founder”
“Who founded LinkedIn?” Result Annotation: DB Query: (_, Founder, LinkedIn)

How many software engineers know Java in United States?
Title Skill Country
● Semi-CRF Model (Sarawagi and Cohen
2005)
● Deep Neural Network (Lample et al.
2016)
Lample et al. 2016

● Define a set of intents
○ Depend heavily on product design!
● Multi-class classification model
○ Problem: Question => Intent
○ Model: Logistic regression, CNN, RNN, LSTM, …
○ Features: Bag of words, word embeddings, tagged entities, ...
● Out-of-scope intent a must!

● “Good jobs at Google?”
○ Call LinkedIn job recommendation engine with company ==
“google”
● “How many members are in active community?”
○ Convert to SQL query with slot filling and query the database
● “Change my password”
○ Show the article section that contains “how to change the
password” step by step
● ...

* Special thanks to Weiwei Guo for providing the material

•
Wide Component
(keyword matches)

Model Lift % in Precision@1
vs Control (IR-based)
Embedding Similarity (GloVe) -35%
Embedding Similarity
(LinkedIn data)
-8%
Deep Model (CNN) +55%
Deep + Wide Model +57%

Metric Lift % vs Control (IR-based)
Happy Path Rate +14%
Undesired Path Rate -18%
Search Sessions
w/ Clicks on Results
+7%
● Happy Path: Users who clicked only one search result and then left help
center without creating a case
● Undesired Path: User who did the search and went to “contact us” directly
without reading any articles from search results.

•
•
•
http://tanerakcok.com/data-driven-product-management-taner-akcok/

• Intent: Definition, Query,
Out-of-scope (OOS)
• Definition: What is “contributor”?
• Query: How many contributors 2
days ago?
• OOS: How is the weather?
Intent Detection Question2Definition Question2Query
• Find right definition for the
definition question
• In: What is contributor?
• Out: Contributor is a user who
initiates or continues
conversation at LinkedIn
• Create a database query from a
given question
• In: How many contributor
yesterday?
• Out: SELECT COUNT(DISTINCT
mem_id) FROM contributors
WHERE date = ‘08-18-2018’;

UI
Intent
Detection
Intent-specific NLU:
Question2Definition
Presto
UI
Intent-specific NLU:
Question2Query
Knowledge
Base

•
•
•
○
•
○
○
SELECT COUNT(DISTINCT mem_id)
FROM contributors
WHERE contribution_type
== ‘message’
AND activity_time > 2018-08-13
“How many contributors messaged last
week?”

• Expensive to get large training data
• Need to onboard new metrics
• SQL is hard to canonicalize
• “Almost success” does not count
Challenges Our Approaches
• Leverage models trained on public data sets
• Leverage metadata available in the company
• Formulate slot filling for SQL
• Show our interpretation and allow users to make
minor changes.

• Survey “What question would you ask to Ana?”
• 60 seed questions from 20 domain experts
• Discovered “Definition” intent from the seed Qs
• Selected 20 target metrics (tables and columns)
• Defined slot filling problem for SQL query
generation
Initial User Study Training Data Generation
• 60 Seed questions -> 3k (question, answer) pairs
• Initial annotation by data scientists
• Annotators generate paraphrases
• Multiple reviews to get consistent annotation

•
○
○ do slot filling
•
○
○
○
FROM contributors
== ‘message’
week?”

•
○ Metric
○ Time
○ Filter
○ Breakdown
•
• in English
•
FROM contributors
== ‘message’
Metric: unique_contributors
Filter: contribution_type == ‘message’
Time: Last 7 days
week?”

•
○ q m m q
• q m
○ q m
■
■ m
• Transfer learning:

Model Improvement in accuracy
over baseline
Unsupervised (Finding Nearest Neighbor in Training Data) --
Logistic Regression model with All features +21%
Prod Model: Logistic Regression with Pairwise model training +26%
Prod Model with Top 3 Accuracy +44%

How many users contributed ?[CLS] [SEP] number of contributorsInput
EHow
Emany
Eusers
Econtributed
E?
E[CLS]
E[SEP]
Enumber
Eof
Econtributors
Token
Embedding
E1
E2
E3
E4
E5
E0
E6
E7
E8
E9
Positional
Embedding
EA
EA
EA
EA
EA
EA
EA
EB
EB
EB
Segment
Embedding
Transformers
Label
EHow
Emany
Eusers
Econtributed
E?
E[CLS]
E[SEP]
Enumber
Eof
Econtributors
Final
Embedding

Evaluation Metric Relative Improvement over Production Model
Top1 Accuracy +11%
Top3 Accuracy +3%

Acc.
# Training questions.
•
•

•
•
•
•
•
○
○
○

● Literature review of Question Answering Systems for
○ Basic Version: Single-Turn Question Answering
○ Advanced Version: Multi-Turn Question Answering
● Our practical lessons learned through 3 LinkedIn use cases
● An area with a lot of potentials and challenges, e.g.
○ How to handle cold-start scalably?
○ How to make the model work more reliably?
○ How to have seamless human-like interactions with human?
○ How to make the model generic enough so that it works for
different domains with little effort?
○ …...

● [Androutsopoulos and Malakasiotis, JAIR 2010] A Survey of Paraphrasing and Textual Entailment Methods
● [Berant and Liang, ACL 2014] Semantic Parsing via Paraphrasing
● [Bordes et al., 2015] Large-scale Simple Question Answering with Memory Networks
● [Buck et al., ICLR 2018] Ask The Right Questions: Active Question Reformulation With Reinforcement
Learning
● [Casanueva et al., 2018] Casanueva, Iñigo, et al. "Feudal Reinforcement Learning for Dialogue
Management in Large Domains." arXiv preprint arXiv:1803.03232 (2018).
● [Chen et al., KDD Explorations 2017] A Survey on Dialogue Systems: Recent Advances and New Frontiers
● [Chen et al., WWW 2012 – CQA'12 Workshop] Understanding User Intent in Community Question
Answering
● [Cho et al., EMNLP 2014] Learning Phrase Representations using RNN Encoder–Decoder for Statistical
Machine Translation
● [Crook & Marin, 2017] Crook, Paul, and Alex Marin. "Sequence to sequence modeling for user simulation in
dialog systems." Proceedings of the 18th Annual Conference of the International Speech Communication
Association (INTERSPEECH 2017). 2017.
● [Dauphin et al., ICLR 2014] Zero-Shot Learning for Semantic Utterance Classification
● [Deng and Yu, Interspeech 2011] Deep Convex Net: A Scalable Architecture for Speech Pattern
Classification

● [Deng et al., SLT 2012] Use Of Kernel Deep Convex Networks And End-to-end Learning For Spoken
Language Understanding
● [Duboue and Chu-Carroll, HLTC 2006] Answering the Question You Wish They Had Asked: The Impact of
Paraphrasing for Question Answering
● [Goo et al., NAACL-HLT 2018] Slot-Gated Modeling for Joint Slot Filling and Intent Prediction
● [Hakkani-Tür Interspeech 2016] Multi-Domain Joint Semantic Frame Parsing using Bi-directional
RNN-LSTM
● [Hashemi et al., QRUMS 2016] Query Intent Detection using Convolutional Neural Networks
● [Hu et al., NIPS 2014] Convolutional Neural Network Architectures for Matching Natural Language
Sentences
● [Kalchbrenner et al., ACL 2014] A convolutional neural network for modelling sentences
● [Kavosh & Williams, 2016] Asadi, Kavosh, and Jason D. Williams. "Sample-efficient deep reinforcement
learning for dialog control." arXiv preprint arXiv:1612.06000 (2016).
● [Kim, EMNLP 2014] Convolutional Neural Networks for Sentence Classification
● [Kingma & Welling, 2014] Kingma, Diederik P., and Max Welling. "Stochastic gradient VB and the variational
auto-encoder." Second International Conference on Learning Representations, ICLR. 2014.
● [Kreyssig et al., 2018] Kreyssig, Florian, et al. "Neural User Simulation for Corpus-based Policy Optimisation
for Spoken Dialogue Systems." arXiv preprint arXiv:1805.06966 (2018).

● [Kwiatkowski et al., EMNLP 2013] Scaling Semantic Parsers with On-the-fly Ontology Matching
● [Lample et al. 2016] Lample, Guillaume, et al. "Neural architectures for named entity recognition." arXiv
preprint arXiv:1603.01360 (2016).
● [Lee and Dernoncourt, NAACL 2016] Sequential Short-Text Classification with Recurrent and Convolutional
Neural Networks
● [Lipton et al., 2017] Lipton, Zachary, et al. "BBQ-Networks: Efficient Exploration in Deep Reinforcement
Learning for Task-Oriented Dialogue Systems." arXiv preprint arXiv:1711.05715 (2017).
● [Liu and Lane, Interspeech 2016] Attention-Based Recurrent Neural Network Models for Joint Intent
Detection and Slot Filling
● [Mesnil et al., Interspeech 2013] Investigation of Recurrent-Neural-Network Architectures and Learning
Methods for Spoken Language Understanding
● [Mesnil et al., TASLP 2015] Using Recurrent Neural Networks for Slot Filling in Spoken Language
Understanding
● [Mnih et al., 2015] Mnih, Volodymyr, et al. "Human-level control through deep reinforcement learning."
Nature 518.7540 (2015): 529.
● [Mrksic et al., 2018] Mrkšić, Nikola, and Ivan Vulić. "Fully statistical neural belief tracking." arXiv preprint
arXiv:1805.11350 (2018).
● [Palangi et al., TASLP 2016] Deep Sentence Embedding Using Long Short-Term Memory Networks:
Analysis and Application to Information Retrieval

● [Peng et al., 2017] Peng, Baolin, et al. "Composite task-completion dialogue policy learning via hierarchical
deep reinforcement learning." arXiv preprint arXiv:1704.03084 (2017).
● [Peng et al., 2018] Peng, Baolin, et al. "Deep Dyna-Q: Integrating Planning for Task-Completion Dialogue
Policy Learning." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics
(Volume 1: Long Papers). Vol. 1. 2018.
● [Rajpurkar et al. 2016] Rajpurkar, Pranav, et al. "Squad: 100,000+ questions for machine comprehension of
text." arXiv preprint arXiv:1606.05250 (2016).
● [Ravuri and Stolcke, Interspeech 2015] Recurrent Neural Network and LSTM Models for Lexical Utterance
Classification
● [Reddy et al., TACL 2014] Large-scale Semantic Parsing without Question-Answer Pairs
● [Sarawagi and Cohen 2015] Sarawagi, Sunita, and William W. Cohen. "Semi-markov conditional random
fields for information extraction." Advances in neural information processing systems. 2005.
● [Sarikaya et al., ICASSP 2011] Deep Belief Nets For Natural Language Call–routing
● [Schatzmann & Young, 2009] Schatzmann, Jost, and Steve Young. "The hidden agenda user simulation
model." IEEE transactions on audio, speech, and language processing 17.4 (2009): 733-747.
● [Serdyuk et al., 2018] Towards End-to-end Spoken Language Understanding
● [Shah et al., 2018] Shah, Pararth, et al. "Building a Conversational Agent Overnight with Dialogue
Self-Play." arXiv preprint arXiv:1801.04871 (2018).

● [Shen et al., CIKM 2014] A Latent Semantic Model with Convolutional-Pooling Structure for Information
Retrieval
● [Shen et al., WWW 2014] Learning Semantic Representations Using Convolutional Neural Networks for
Web Search
● [Sutton et al., 2000] Sutton, Richard S., et al. "Policy gradient methods for reinforcement learning with
function approximation." Advances in neural information processing systems. 2000.
● [Tur et al., ICASSP 2012] Towards deeper understanding: Deep convex networks for semantic utterance
classification
● [Wang et al., ACL 2015] Building a Semantic Parser Overnight
● [Wang et al., NAACL-HLT 2018] A Bi-model based RNN Semantic Frame Parsing Model for Intent Detection
and Slot Filling
● [Weisz et al., 2018] Weisz, Gellért, et al. "Sample efficient deep reinforcement learning for dialogue systems
with large action spaces." arXiv preprint arXiv:1802.03753 (2018).
● [Wen et al., 2015] Wen, Tsung-Hsien, et al. "Semantically conditioned lstm-based natural language
generation for spoken dialogue systems." arXiv preprint arXiv:1508.01745 (2015).
● [Wen et al., CCF 2017] Jointly Modeling Intent Identification and Slot Filling with Contextual and Hierarchical
Information
● [Weston, ICML 2016] Memory Networks for Language Understanding, ICML Tutorial 2016

● [Williams et al., 1988] Williams, R. J. Toward a theory of reinforcement-learning connectionist systems.
Technical Report NU-CCS-88-3, Northeastern University, College of Computer Science.
● [Williams et al., 2017] Williams, Jason D., Kavosh Asadi, and Geoffrey Zweig. "Hybrid code networks:
practical and efficient end-to-end dialog control with supervised and reinforcement learning." arXiv preprint
arXiv:1702.03274 (2017).
● [Xiao et al., ACL 2016] Sequence-based Structured Prediction for Semantic Parsing
● [Xu and Sarikaya, ASRU 2013] Convolutional Neural Network Based Triangular CRF For Joint Intent
Detection And Slot Filling
● [Yan et al., AAAI 2017] Building Task-Oriented Dialogue Systems for Online Shopping
● [Yao et al., Interspeech 2013] Recurrent Neural Networks for Language Understanding
● [Yih et al., ACL 2014] Semantic Parsing for Single-Relation Question Answering
● [Zhang and Wang, IJCAI 2016] A Joint Model of Intent Determination and Slot Filling for Spoken Language
Understanding
● [Zhong et al., 2017] Zhong, Victor, Caiming Xiong, and Richard Socher. "Global-Locally Self-Attentive
Encoder for Dialogue State Tracking." Proceedings of the 56th Annual Meeting of the Association for
Computational Linguistics (Volume 1: Long Papers). Vol. 1. 2018.
● [Zhong et al., 2017b] Zhong, Victor, Caiming Xiong, and Richard Socher. “Seq2SQL: Generating Structured
Queries from Natural Language using Reinforcement Learning.” arXiv preprint arXiv:1709.00103 (2017).

[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems

[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à [AAAI 2019 tutorial] End-to-end goal-oriented question answering systems

Similaire à [AAAI 2019 tutorial] End-to-end goal-oriented question answering systems (20)

Dernier

Dernier (20)

[AAAI 2019 tutorial] End-to-end goal-oriented question answering systems