[DL輪読会] Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

DEEP LEARNING JP
[DL Papers]
“Towards an AutomaticTuringTest:
Learning to Evaluate Dialog Response (ACL2017)”
Hiromi Nakagawa, Matsuo Lab
http://deeplearning.jp/

1. Paper Information
2. Introduction
3. Related Works
4. Proposed Model
5. Experiments
6. Results
7. Discussion
2
Agenda

2. Introduction
3. Related Works
4. Proposed Model
5. Experiments
6. Results
7. Discussion
3
Agenda

• Author
– Ryan Lowe1, Michael Noseworthy1, Iulian V.Serban1, Nicolas A.-Gontier1,
– Yoshua Bengio2,3, Joelle Pineau1,3
1. Reasoning and Learning Lab, School of Computer Science, McGill University
2. Montreal Institute for Learning Algorithms, Universite de Monreal
3. CIFAR Senior Fellow
• ACL 2017
– https://arxiv.org/abs/1708.07149
• Summary
– BLEUなどword-overlap metricによる対話生成の評価は人間による評価とほとんど相関がない
– 人間のスコアリングを学習したモデルを用いて生成結果を評価する手法を提案(ADEM)
– 人間のスコアリングと高い相関性を持つスコアを自動で出力できることを検証
• 実装と学習済みモデルが公開 (https://github.com/mike-n-7/ADEM)
4

2. Introduction
3. Related Works
4. Proposed Model
5. Experiments
6. Results
7. Discussion
5
Agenda

• 対話システム開発の歴史
– 「人間らしく」人間と対話できる(non-task-oriented)システムの構築はAI研究の歴史の中
でも大きなゴールの1つ[Turing, 1950; Weizenbaum, 1966]
– 近年ではneural networkの活用で大規模なnon-task-orientedな対話システム研究が活発
化[Sordoni et al., 2015b; Shang et al., 2015; Vinyals and Le, 2015; Serban et al., 2016a; Li et al., 2015]
– 特定の目的のためにend-to-endで学習されたモデルが成功しているケースもある
• Google’s Smart Reply system[Kannen et al. 2016], Microsoft’s Xiaoice chatbot[Markoff and Mozur, 2015]
• 一方、対話システムの開発で常に課題となってきたのがパフォーマンスの評価
6
2. Introduction

• 対話システムのパフォーマンス評価
– Turing test：システムか人間かを見分ける評価を人間が行う
• 合理的ではあるが、制約も多く、人手による評価が必要なためスケーリングしにくい
• 相当注意深く評価システムを設計しないと、バイアスがかかりやすい
– 「見分ける」まではせず、対話の質を人間が主観評価する
• いずれにせよ時間/費用/スケールしにくい問題は解決しない
• 特にspecific conversation domainsではその評価を行える有識者の用意が大変[Lowe et al., 2015]
7
2. Introduction

• Neural network-based modelの発展にもかかわらず、non-task-orientedな
タスクでは評価指標が依然として問題となっている
– BLEU含めword-overlap指標は、人間の評価とほとんど相関がない[Liu et al. 2016]
– response間のsemantic similarityを考慮できないことが問題
8
2. Introduction

• とはいえ、現状では対話システム評価にはBLEUが使われることがほとんど
– 人手評価はコストが高すぎる
– 極端に言えば、全hyper parameterに対して人手評価するの？という話
• 自動で対話システムの質を評価できるモデルが作れれば、対話システム開発に
大きなインパクトがあるはず
– rapid prototyping & testing
– “Automatic Turing Test”
9
2. Introduction

• What is a ‘good’ chatbot ?
– one whose response aarree ssccoorreedd hhiigghhllyy on appropriateness bbyy hhuummaann evaluators.
– 現状の(破綻した返答をするような)対話システムの改善には十分な指標のはず
• 多様な対話に対する人間の評価スコアを収集し、automatic dialogue
evaluation model (ADEM) を学習させる
– hierarchical RNN で human scoresをsemi-supervisedに学習
– ADEMのscoreはutterance-levelでもsystem-levelでも人手評価と高い相関関係を示した
10
2. Introduction

2. Introduction
3. Related Works
4. Proposed Model
5. Experiments
6. Results
7. Discussion
11
Agenda

• word-overlap metrics
– BLEU[Papineni et al. 2002]
• 機械翻訳の用途で利用
– ROUGE[Lin, 2004]
• 要約の用途で利用
– 意味的類似性や文脈依存性を測れない
• 単語の共通度合いしか見れない
• 機械翻訳ではそこまで問題にならない(reasonable translationが大体限られている)
• 対話生成ではresponse diversityが非常に高い[Artstein et al., 2009]ためcriticalな問題
– 対話生成では人間の評価とほとんど相関がないことが指摘されている[Liu et al. 2016]
12
3. Related Works
参考：BLEU
N-gramのprecisionを計算し、短文に対するpenaltyを考慮

• chat-oriented dialogue systemsで返答の質を推定する研究
– automatic dialogue policy evaluation metric [DeVault et al., 2011]
– semi-automatic evaluation metric for dialogue coherence (similar to BLEU and
ROUGE)[Gandle and Traum, 2016]
– a framework to predict utterance-level problematic situations using intent and
sentiment factors[Xiang et al., 2014]
– train a classifier to distinguish user utterances from system-generated utterances
using various dialogue features[Higashinaka et al., 2014]
13
3. Related Works

• hand-crafted reward featuresによる強化学習の活用
– ease of answering and information flow [Li et al., 2016b]
– turn-level appropriateness and conversational depth [Yu et al., 2016]
• hand-crafted featuresであり、対話の一側面しか捉えられていない
– sub-optimal performance
– これがretrieval-based cross-entropyやword-level maximum log-likelihoodの最適化より良
いかはunclear
• conversational-levelでの評価のため、single dialogue responseを評価できない事
が多い
– response-levelで評価できる指標は提案指標に組み込むことが可能
14
3. Related Works

• task-orientedな対話システムについては評価手法の開発が進んでいる
– ex) finding a restaurant
– task completion signalを考慮する指標(PARADISE[Walker et al., 1997], MeMo[Moller et al,, 2006])
– task completionやtask complexityが計測できる領域でないと利用できない
15
3. Related Works

2. Introduction
3. Related Works
4. Proposed Model
5. Experiments
6. Results
7. Discussion
16
Agenda

• An Automatic Dialogue Evaluation Model (ADEM)
– captures sseemmaannttiicc ssiimmiillaarriittyy beyond word overlap statistics
– exploits both the ccoonntteexxtt and the rreeffeerreennccee rreessppoonnssee to calculate its score
17
4. Proposed Method

1. RNN encoderでContext, Model response, Reference responseを変換
2. scoreを計算
18
4. Proposed Method

• Hierarchical RNN encoder[El Hihi and Bengio, 1995; Sordoni et al., 2015a]
– utterance-level encoder
• input : word
• output : a vector at the end of each utterance
– context-level encoder
• input : utterance
• output: a vector representation of the context
– Why hierarchical? -> incorporate information from early utterances
– RNN部分のパラメータはpre-trained(後述)
• not learned from human scores
19
4. Proposed Method

•
– パラメータ：M, N
• linear projection
• map !̂ -> # & ! space
– 定数：α, β
• モデルの出力が1~5の範囲に収まるようにscalingする
– contextとreference responseと似たresponseベクトルに対して高いscoreを出力
– scoreと人間の評価スコアの二乗誤差を最小化するように学習(L2正則化)
• simple -> accurate prediction & fast evaluation (cf. supp. material in original paper)
20
4. Proposed Method

• Pre-training with VHRED
– encoderをneural dialogue modelとして学習させる
• encoder outputを受け取ってnext utteranceを予測する3rd decoder RNNを追加
– VHRED (latent variable hierarchical recurrent encoder decoder[Serban et al., 2016b])
• stochastic latent variable
• HREDよりもdiverseでcoherentな返答を生成できる
21
4. Proposed Method

• Pre-training with VHRED
1. The context is encoded into a vector using the hierarchical encoder
2. VHRED then samples a Gaussian variable that is used to condition the decoder
3. use the last hidden state of the context-level encoder (#, !, !̂ -> ', (, ())
22
4. Proposed Method

2. Introduction
3. Related Works
4. Proposed Model
5. Experiments
6. Results
7. Discussion
23
Agenda

• Settings
– BPE(Byte Pair Encoding)[Gage, 1994; Sennrich et al., 2015]
• reduce the effective vocabulary size
– layer normalization[Ba et al., 2016] for hierarchical encoder
• better than batch normalization[Ioffe and Szegedy, 2015; Cooijmans et al., 2016]
– used several of techniques to train the VHRED[Serban et al., 2016b; Bowman et al., 2016]
• drop words in the decoder 25%
• anneal the KL linearly from 0 to1 over the first 60,000batches
– Adam[Kingma and Ba, 2014] optimizer
24
5. Experiments

• Settings
– training ADEM
• employ a subsampling procedure based on the model response length
• ensure that ADEM does not use response length to predict the score
– humans have a tendency to give a higher rating to give a higher rating to shorter responses
– training VHRED
• embedding size = 2,000
– after training VHRED, use PCA to reduce the dimensionality (n = 50)
– Early stopping
25
5. Experiments

• Data Collection
– Twitter Corpus[Ritter et al., 2011]を対象にresponseを生成し、クラウドソーシング(Amazon
Mechanical Turk)で人間がスコアリング
• relevant / irrelevant responses
• coherent / incoherent responses
– ４パターンのCandidate responsesを用意してresponse varietyを増やす
• a response selected by TF-IFD retrieval-based model
• a response selected by the Dual Encoder(DE)[Lowe et al., 2015]
• a response generated by the hierarchical recurrent encoder-decoder(HRED)
• human-generated responses
– novel human response, different from a fixed corpus
26
5. Experiments

2. Introduction
3. Related Works
4. Proposed Model
5. Experiments
6. Results
7. Discussion
27
Agenda

• Utterance-level correlations
– utterance-levelで、各指標が人間による評価とどれだけ相関が有るか
– ADEMはword-overlap metricsより遥かに高い相関係数
28
6. Results

– C-ADEM, R-ADEM: context / reference情報のみで学習した場合
– ADEM(T2V): pre-trained VHREDではなく学習済みのtweet2vecモデルを用いた場合
29
6. Results

– utterance-levelで、各指標が人間による評価とどれだけ相関が有るか
– ADEMはword-overlap metricsより遥かに高い相関係数
30
6. Results

• System-level correlations
– 各dialogue model (TF-IDF, DE, HRED, human) によるresponseに対する平均スコア
– 横軸が人間のスコアリング、縦軸が各指標(BLEUなど)によるスコアリング
– ダメなモデルはダメ、理想的なモデル(human)は良い、と評価できているのがADEM
31
6. Results

• Generalization to previously unseen models
– 実用を考えると、trainにはないnew modelによるresponseを正しく評価できる必要
– {TF-IDF, DE, HRED, humans}モデルから1つ抜いてtrainして、残りのモデルに対してtest
を行う(leave-one-out evaluation)
32
6. Results

• Generalization to previously unseen models
– Dual Encoderを抜いた時以外はうまくいっている
– HREDを抜いた場合はsurprising
• 人間の記述した返答(retrieval models or human-generated)だけでtrainして、ニューラルネッ
トが生成した返答にもgeneralizeしている
33
6. Results

• Qualitative Analysis
– poor responseには正しくlow scoreをつけられる
– よいresponseにも高い評価
– 2nd contextの4th responseは人間がもっとよい評価つけてよいのでは
34
6. Results

• Qualitative Analysis
– 人間が高く評価する場合でも低く評価してしまう場合もあった
– 二乗誤差を取っているので平均的なスコアを出力しやすい(外れ値を出力しにくい)
35
6. Results

2. Introduction
3. Related Works
4. Proposed Model
5. Experiments
6. Results
7. Discussion
36
Agenda

• 提案モデルは多様な目的のデータセットに適用可能
– 一度pre-trainedモデルが公開されれば、その目的のために利用が可能
• domain transfer ability はfuture work
• 人間が高評価する返答を出力するdialogue modelは、chatbotのdesired end-goalではない
– generic responseの問題(人間は無難な/汎用性の高い返答を好む[Shang et al., 2016])
– このbiasがかからないようにADEMを拡張することがfuture work
• 長さに対して情報量の少ない返答を許容しないようにする
• adversarial evaluation model[Kannan and Vinyals, 2017; Li et al., 2017]
– 人間の返答かそうでないかを見分ける。generic responsesはeasy to distinguishableなのでスコアが低くなる
• 対話システムが人間と魅力的で意味深いinteractionをしているかを評価できるモデルが重要
– 難しいが、提案手法がこれを達成する過程での1つのstepになるはず
37
7. Discussion

• 学習済みモデルを分析すれば、人間らしさの定性的評価とかいろいろできそう
– 言語やデータセットごとの違いとかも面白そう
• アノテーションチェックや誤差関数の設計を工夫すればより人の直感に近いス
コアは出力できそう
• どこまで汎化できているのか？
– データセットを超えた「人間らしさ」
• チャットボット以外にも、人の主観評価が重要な領域に有効？
38
感想

[DL輪読会] Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (14)

Similaire à [DL輪読会] Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Similaire à [DL輪読会] Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses (20)

Plus de Deep Learning JP

Plus de Deep Learning JP (20)

[DL輪読会] Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses