2. 概要
nImage Captioningの生成文の長さを調整する手法の提案
• 既存のモデルに対して適用可能
• 既存のモデルに適用することで同等以上の性能
nImage CaptionigモデルのLaBERTの提案
• 非自己回帰モデル
• 計算量が生成する文の長さに比例しない
• 生成文の長さの調整を適応可能
2 C. Deng et al.
Reference image captions
A pizza on a pan sitting on a table.
A close up of a pizza in a pan on a table.
A pizza sits on a plate on a dark surface.
A person sitting at a table where a pizza is sitting.
A pizza topped with different toppings is brought to a table.
Predicted image captions
Rough VLP A pizza sitting on top of a pan on a table.
Ours Lv1 A pizza that is sitting on a table.
Ours Lv2 A pizza with tomatoes and spinach on a table.
Ours Lv3 A pizza with tomatoes cheese and toppings on it sitting on a table.
Detailed Ours Lv4 A pizza sitting on top of a pan with a lot of cheese spinach and tomatoes on it.
6. LaBERT
nWord embedding
• 長さレベル埋め込みを同様に行う
• 学習時
• 文章にランダムにマスクをかけてembeddingをする
• マスクをかけた部分の単語を推論する
Length-Controllable Image Captioning 7
A . . .
N× Transformer Blocks
[MASK] in a [MASK] [EOS]
. . .
. . .
. . .
"#$
"#%
"#&
"#'
"($
"(%
"(&
"()
"(*
"(+
person kayak
Image type embedding
Location embedding
Visual embedding
Length level embedding
Position embedding
Word embedding
7. LaBERT
n推論時
• 生成文の最大の長さのマスクを入力
• 各単語を推論する
• 𝑝& 𝑠& = [𝐸𝑂𝑆] ← 𝛾0"!#"1&
𝑝& 𝑠& = [𝐸𝑂𝑆] , ∀&∈ [𝐿"2#, 𝐿3&/3]
• 系列が最後の方になるにつれて[𝐸𝑂𝑆]の確率をあげる
• 推論時の各単語のconfidence scoreを以下のように更新する
• 𝑐& ← :
max
!
𝑝& 𝑠& = 𝑠 i is masked position
,!4567
$
%! !!8!
9
otherwise
• 各単語のconfidence scoreが下位n個にマスクをかけてもう一度推論を行う
• nはステップが進む毎に減衰させる
Length-Controllable Image Captioning 9
Mask-predict-update process
Step 1 [MASK] [MASK] [MASK] [MASK] [MASK] [MASK]…..[MASK] [MASK] [MASK]
A dog is on the the on a a a a a a a a walking in the sidewalk.
A dog laying on the side of a street with a a walking a a walking in the sidewalk.
A dog laying on the side of a street with a woman walking on the sidewalk of the street.
Step T A dog sitting on the side of a street with a woman walking on the sidewalk in the background.