Publicité

[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

29 Oct 2018
Publicité

Contenu connexe

Similaire à [ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context(20)

Publicité

Plus de Hayahide Yamagishi(15)

Publicité

[ACL2018読み会資料] Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context

  1. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context Urvashi Khandelwal, He He, Peng Qi, Dan Jurafsky (Stanford University) M2 山岸駿秀 @ ACL2018読み会
  2. Introduction ● n-gram Language Modelと比較して、Neural Language Model (NLM)は長距離文脈を使えるようになったとされる ● 実際に長距離文脈を捉えられているのかをAblation Test ● Neural Cache ModelはLMにどう影響するかを調査 読んだ理由 ● 文脈の知見が欲しかったから ● “We propose a novel architecture …” に疲れたから 2
  3. 言語モデルの復習と今回の入力例 ● 以下の確率を計算 ● Negative Log Likelihoodを計算 ● Perplexityで評価 ... the company reported a loss after taxation and minority interests of NUM million irish borrowings under the short-term parts of a credit agreement </s> berlitz which is based in princeton n.j. provides language instruction and translation services through more than NUM language centers in NUM countries </s> in the past five years more sim has set a fresh target of $ NUM a share by the end of </s> reaching that goal says robert t. UNK applied 's chief financial officer than NUM NUM of its sales have been outside the u.s. </s> macmillan has owned berlitz since NUM </s> in the first six 3
  4. 実験設定 ● Corpus: PennTreeBankとWikitext-2 ● モデルは普通のNLM ○ Dropoutを時間方向にも適用 ○ Random seedを変えて3つ用意 → 平均値を報告 ● 学習時は対象文の前の文を全て使用 ● Devで評価(Testの特徴を調べるのは気が引けたらしい) 4
  5. How much context is used? ● 実験1: LSTMは何単語覚えられるのか? ○ δ-function: Test dataの変更方法を指示 ● effective context sizeを調べる ○ Perplexityが収束する長さ(全て使ったときのPerplexity + 1%くらい) ● 評価はLossかPerplexityの変化率(以降全てこれ) ○ n単語消去したら、文長-n単語分のLossを測定 5
  6. 結果1: 文長とHyperparameter(右: PTB) ● PTBで150単語、Wikiで250単語あたりが限界 ● Hyperparameterは性能に影響するが、記憶力には無関係 6
  7. 結果2: 単語のクラスごとのLoss(右: Wiki) ● Infrequent words(Trainで出現数800回以下)は長距離文脈が必要 ● Function words(前置詞と冠詞)は周辺単語だけでいい 7
  8. Nearby vs. long-range context ● LSTMはだいたい200単語くらい覚えられる → 場所による特徴はあるのか? ● 文脈の途中(長さは span = (s1, s2] で管理 )を変化させる ○ ρはshuffleかreverse ● 文長は300単語で固定 8
  9. 結果3(右: Wiki) a. s2 = s1 + 20のとき: 近い文脈は語順が重要 b. s2 = nのとき: 離れた文脈は「出現したこと」が重要、 違う単語列(語順は整っている)で置換すると悪い 9
  10. Types of words and the region of context ● 「単語が出現したこと」が重要なら、function wordsはいらない? ● fPOS (y, span): span中でPOSがyの単語を除去 ● 同数の単語をrandomに削除する実験もした 10
  11. 結果4: 機能語/内容語の削除(左:PTB 右:Wiki) ● 近くのContent wordsは絶対に必要 ● 20単語くらい離れるとFunction wordsの影響が小さい ● 遠くの単語は意味だけを大まかに覚えているのか? 11
  12. Can LSTMs copy words without caches? ● Neural Language GenerationではCopy Mechanismが使われる ○ AttentionとかCopyNetとかCacheとか ● 「200単語も記憶できるならCopy Mechanismいらないのでは?」 以下の場合分けをして実験 ● 文脈の距離: “nearby” ≦ 50 < “long-range” ● Copyすると解になる単語がどこにあるか → これを消す ○ Cnear: “nearby”にある ○ Cfar: “long-range”にある ○ Cnone: どこにもない 12
  13. 結果5: Cを消した(左:PTB 右:Wiki) ● Cfarを消してもそこまで悪くならない → 大まかな意味を学習? ● Cnearを消してはいけない → 近くの単語をCopyする能力がある? ● 長距離の文脈を消すとCnoneの性能に悪影響 13
  14. 結果6: 除去の代わりに置換(左:PTB 右:Wiki) ● “Similar”: 同程度のfrequencyかつ同じPOSの単語 ● 近いところは表層が同じであることが重要 ● 遠いところはCfarを消しても分布仮説的なもので予測可能? 14
  15. How does the cache help? ● Neural Cache Model [Grave+, ICLR2017] ○ hi はそれまでのhidden states ○ 各単語に対してPcacheを計算し、Plm + Pcacheを生成確率とする ● 300単語以上使う(Document lengthの平均) ○ PTB: 500単語 ○ Wiki: 3875単語 ● Cache Modelを基準としたNLMのPPLの増加率で評価 15
  16. 結果7: Cacheの影響(左: PTB 右: Wiki) ● 文脈に出てきた単語はCopyされていそう ● 文脈に出てきていない単語を出すことには向いていない ● LSTMとCacheでできることが違う → 補完できているのでは? 16
  17. 出てきていれば生成できる 17
  18. まとめ ● LSTMを使ったNeural Language Modelの性能を調べた ● 以下のことがわかった ○ LSTMは200単語くらい覚えられる ○ Hyperparameterは性能を変化させるが、記憶力には影響がない ○ 近くの単語は語順が重要、遠くの単語は存在することが重要 ○ Cacheを使うと遠くの単語を使えるようになる ● “この結果はdata-drivenかもしれないので要追試” ○ “一応PTBとWikiでデータの多様性を持たせたつもり” 18
  19. 感想 ● 読みやすい & 謙虚な文体 & マジメな実験で好感が持てる ● 語順が自由な言語では違いがありそう ● 学習の設定がちょっと特殊 ○ 普通、無限に文脈を使える設定で学習しないのでは ○ 長距離文脈が現れる設定と1文しかこない設定では結果が変わりそう ● 200単語以上覚えておく必要がないだけなのか覚えられないのか ○ 平均単語文長20単語だから、10文くらい ○ LSTMは原理的には全部覚えられるはず…… ● 何単語消したのかとか、そういうデータがほしかった 19
Publicité