[DL輪読会]The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision
1. 1
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
The Neuro-Symbolic Concept Learner: Interpreting Scenes,
Words, and Sentences From Natural Supervision
Kazuki Fujikawa, DeNA
5. 背景
• 物体に紐づくコンセプト(色・形などの属性)を認識することは重要
– 人間がVQAの複雑な質問に答える場合、コンセプト情報とロジック(カウント作業など)
を分離して考える
– 機械学習モデルも同様で、コンセプト情報とロジックを分離して学習・出力できると、
データ効率・解釈性の面で改善できる可能性がある
5
Published asaconference paper at ICLR 2019
Q: What’s the color of the object?
A: Red.
Q: Is there any cube?
A: Yes.
Q: What’s the color of the object?
A: Green.
Q: Is there any cube?
A: Yes.
Q: How many objects are right of the red object?
A: 2.
Q: How many objects have the same material as the cube?
A: 2
Q: How many objects are both right of the green cylinder
and have the same material as the small blue ball?
A: 3
I. Learning basic, object-based concepts. II. Learning relational concepts based on referential expressions.
III. Interpret complex questions from visual cues.
Figure 1: Humans learn visual concepts, words, and semantic parsing jointly and incrementally.
I. Learning visual concepts (red vs. green) starts from looking at simple scenes, reading simple
questions, and reasoning over contrastive examples (Fazly et al., 2010). II. Afterwards, we can
interpret referential expressions based on the learned object-based concepts, and learn relational
concepts (e.g., on the right of, the same material as). III Finally, we can interpret complex questions
from visual cues by exploiting thecompositional structure.
7. 関連研究
• 関連研究と本研究の位置付け
7
End-to-End Programを介するアプローチ 本研究
Hudson+ 2018, Mascharka+ 2018, etc. Yi+ 2018
モジュール分離 × ○ ○
解釈性 △ ○ ○
教師データ 画像, 質問文 → 回答
画像 → コンセプト、質問文 → プログラム
コンセプト, プログラム → 回答
画像, 質問文 → 回答
Obj:2
Published as aconference paper at ICLR 2019
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
1
3
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program
Filter Green Cub
Program Representations
Relate Object 2
Left
Filter Red
Filter Purple Ma
AEQuery Object 1 Object 3
Shape
Concept
A. Curriculum concept learning B. Illustrative execution of NS-
Figure4: A. Demonstration of thecurriculum learning of visual concepts, words, and sema
of sentences by watching images and reading paired questions and answers. Scenes and q
入力1: 画像データ 入力2: 質問文
Q: What is the shape
of the red object ?
出力: 回答
A: Box
NN
中間出力1: コンセプト
ID Color Shape
1 Green Cube
2 Red Sphere
NN
中間出力2: プログラム
NN
Filter(Red)
↓
Query(Shape)
Published as aconference paper at ICLR 2019
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
1 2
3 4
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program Execution
Filter Green Cube
Program Representations Outputs
Relate Object 2
Left
Filter Red
Filter Purple Matte
AEQuery Object 1 Object 3
Shape No (0.98)
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-CL
Figure4: A. Demonstration of thecurriculum learning of visual concepts, words, and semantic parsing
入力1: 画像データ 入力2: 質問文
Q: What is the shape
of the red object ?
出力: 回答
A: Box
NN
Published as aconference paper at ICLR 201
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions
Lesson2: Relational questions.
Lesson3: More complex questions
Deploy: complex scenes, complex
A. Curriculum concept learning
Figure4: A. Demonstration of thecurriculum
of sentences by watching images and reading
入力1: 画像データ 入力2: 質問文
Q: What is the shape
of the red object ?
出力: 回答
A: Box
NN
中間出力1: ベクトル
NN
中間出力2: プログラム
NN
Filter(Red)
↓
Query(Shape)
Obj:1 Green
Red
9. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
9
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
10. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
10
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
11. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
11
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
12. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
12
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
13. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
13
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
14. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
14
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
15. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
15
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
16. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
16
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、Embedding Spaceを更新
17. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. 画像・コンセプト表現空間を教師あり学習(Program出力部は固定)
17
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
BP
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解データとの誤差を逆伝播し、EmbeddingSpaceを更新
18. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 2. Program出力の強化学習(Concept Embeddingは固定)
18
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
Reinforce
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解 / 不正解を報酬にReinforceでProgramの生成方策を更新
19. Reinforce
提案手法: Joint Learning of Concepts and Semantic Parsing
• 2. Program出力の強化学習(Concept Embeddingは固定)
19
Red
入力1: 画像データ 入力2: 質問文
Q:What is the shape
of the red object ?
Mask R-CNN
ResNet-34
Color Embedding Space
Shape Embedding Space
1
2
Cylinder
Sphere
Box
Visual Feature Space
Obj:1
Obj:2
出力: 回答
A: Box
正解: Sphere
予測
↓
Filter(Red)
Query(Shape)
BiGRU-GRU
NN
NN
① Mask R-CNNで画像から物体領域を認識、ResNet-34で
Visual Featureを抽出
② 質問文からEncoder-Decoder(BiGRU-GRU)ベースの手法
[Dong+, 2016]でProgramを出力
③ Programの1行目の処理に必要なConceptのEmbedding
(Color embedding)を獲得
④ Filter処理を実行(RedとのCosine類似度が最大となるObjに限定)
⑤ Programの2行目の処理に必要なConceptのEmbedding
(Shape embedding)を獲得
⑥Query処理を実行(Obj: 2とのCosine類似度が最大となるShapeを
獲得)し、予測結果として出力
⑦ 正解 / 不正解を報酬にReinforceでProgramの生成方策を更新
20. 提案手法: Joint Learning of Concepts and Semantic Parsing
• 1. と 2. を交互に実行して学習を進める
– Curriculum Learningの枠組みで、少しずつ問題の難度を上げていく
20
Published asaconference paper at ICLR 2019
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
1 2
3 4
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsing and Program Execution
Filter Green Cube
Program Representations Outputs
Relate Object 2
Left
Filter Red
Filter Purple Matte
AEQuery Object 1 Object 3
Shape No (0.98)
Concepts
A. Curriculum concept learning B. Illustrative execution of NS-CL
Q: What is the shape of the red object?
A: Cube.
Q: How many cubes are behind the
sphere?
A: 3
Q: Does the red object left of the green
cube have the same shape as the
purple matte thing?
A: No
Q: Does the matte thing behind the big
sphere have the same color as the
cylinder left of the small matte cube?
A: No.
Initialized with DSL and executor.
Lesson1: Object-based questions.
Lesson2: Relational questions.
Lesson3: More complex questions.
Deploy: complex scenes, complex questions
Q: Does the red object left o
cube have the same shape as
purple matte thing?
Obj 1
Obj 2
Obj 3
Obj 4
Step1: Visual Parsing
Step2, 3: Semantic Parsin
Filter
Program Representatio
Relate Objec
Filter
Filter
AEQuery Object 1 Objec
Figure4: A. Demonstration of thecurriculum learning of visual concepts, word
of sentences by watching images and reading paired questions and answers. S
different complexities are illustrated to thelearner in an incremental manne
neuro-symbolic inference model for VQA. The perception module begins wi
into object-based deep representations, while the semantic parser parse sen
programs. A symbolic execution process bridges two modules.
23. 実験: 定性評価
• 実験: CLEVR Dataset [Johnson+, 2017]
– 提案手法は、回答に至るまでの意思決定のプロセスを明示できることが一つのメリット
• 間違った回答をした場合、何で間違ったのかを知ることができる
23
Published asaconference paper at ICLR 2019
Q: Do the cyan cylinder that is behind
the gray cylinder and the gray
cylinder have the same material?
AEQuery
Filter
Filter
Relate
FilterGray Cylinder
Behind
Cyan Cylinder
Gray Cylinder
Material Yes (0.92)
Example A.
Q: There is a small blue object
that is to the right of the small red
matte object; what shape is it?
Filter
Query
Relate
FilterSmall Red
Matte Object
Right
Small Blue
Object
Shape Cube (0.85)
Example B.
Concept Program Result Concept Program Result
AEQuery
Filter
Filter
RelateBehind
Cyan Cylinder
Gray Cylinder
Material Yes (0.92)
Filter
Query
RelateRight
Small Blue
Object
Shape Cube (0.85)
Q: What is the color of the big box
left of the blue metal cylinder?
Filter
Relate
FilterBlue Metal
Cylinder
Left
Big Box
QueryColor
Execution
Abort
No such object found!
Color: Blue ✓
Material: Rubber ✕
Shape: Cylinder ✓
Size: Small ✓
Example C. Failure Case
Q: What is the color of the big
metal object?
Query
FilterBig Metal
Object
Color
Execution
Abort
Ambiguous Referral!
Example D. Ambiguous Program Case
Concept Program Result Concept Program Result
Figure 11: Visualization of theexecution trace generated by our Neuro-Symbolic Concept Learner
on the CLEVR dataset. Example A and B aresuccessful executions that generate correct answers.
In example C, the execution aborts at the first operator. To inspect the reason why the execution
engine fails to find the corresponding object, we can read out the visual representation of the object,
AEQuery
Filter
Filter
RelateBehind
Cyan Cylinder
Gray Cylinder
Material Yes (0.92)
Filter
Query
RelateRight
Small Blue
Object
Shape Cube (0.85)
Q: What is the color of the big box
left of the blue metal cylinder?
Filter
Relate
FilterBlue Metal
Cylinder
Left
Big Box
QueryColor
Execution
Abort
No such object found!
Color: Blue ✓
Material: Rubber ✕
Shape: Cylinder ✓
Size: Small ✓
Example C. Failure Case
Q: What is the color of the big
metal object?
Query
FilterBig Metal
Object
Color
Execution
Abort
Ambiguous Referral!
Example D. Ambiguous Program Case
Concept Program Result Concept Program Result
Figure11: Visualization of theexecution trace generated by our Neuro-Symbolic Concept Learner
on the CLEVR dataset. Example A and B aresuccessful executions that generate correct answers.
In example C, the execution aborts at the first operator. To inspect the reason why the execution
engine fails to find thecorresponding object, wecan read out thevisual representation of theobject,
24. 実験: 定性評価
• 実験: VQS Dataset [Gan+, 2017]
– 現実画像のデータに対しても本手法は適用可能
• CLEVRは機械的にデータセットを作成するため、Programのアノテーションも作成可能だが、
現実画像のデータに対してProgramのアノテーションをつけるのは高コスト
• 提案手法ではProgramのアノテーションが不要であるため、現実画像のデータに対しても適用可能
24
Published as aconference paper at ICLR 2019
Example B.
Q: What is the sharp object on the table?
Relate
FilterTable
On
Concept Program Result
Example A.
Q: How many zebras are there?
FilterZebra
Concept Program Result
Count 3 ✓
Filter
Relate
FilterTable
On
Concept Program Result
Shape Object
QueryWhat Knife (0.85) ✓
FilterZebra
Count 3 ✓
Q: What kind of desert is plated?
Query
FilterDesert, Plated
Kind Cake (0.68)
Example C.
Concept Program Result
✓
Example D.
Q: What are the kids doing?
Query
FilterKids
What Playing_Frisbee (0.70)
Concept Program Result
✕
Groundtruth: Playing_Baseball
26. References
• Mao, Jiayuan, et al. "The neuro-symbolic concept learner: Interpreting scenes, words, and
sentences from natural supervision." in Proc. of ICLR, 2019.
• Hudson, Drew A, et al. ”Compositional attention networks for machine reasoning.” in Proc. of ICLR,
2018.
• Mascharka, David, et al. “Transparency by design: Closing the gap between performance and
interpretability in visual reasoning.” in Proc. of CVPR, 2018.
• Yi, Kexin, et al. “Neural-Symbolic VQA: Disentangling reasoning from vision and language
understanding.” in Proc. of NeurIPS, 2018.
• Johnson, Justin, et al. “CLEVR: A diagnostic dataset for compositional language and elementary
visual reasoning.” in Proc. of CVPR, 2017.
• Gan, Chuang, et al. “VQS: Linking segmentations to questions and answers for supervised
attention in vqa and question-focused semantic segmentation.” in Proc. of ICCV, 2017.
26