230906 paper summary - learning to world model with language - public.pdf

Machine
Learning
LABoratory
Seungjoon Lee. 2023-09-06. sjlee1218@postech.ac.kr
Learning to Model the World
with Language
Paper Summary
1

Contents
• Introduction
• Methods
• Experiments
2

Caution!!!
• This is the material I summarized a paper at my personal research meeting.
• Some of the contents may be incorrect!
• Some contributions, experiments are excluded intentionally, because they
are not directly related to my research interest.
• Methods are simpli
fi
ed for easy explanation.
• Please send me an email if you want to contact me: sjlee1218@postech.ac.kr
(for correction or addition of materials, ideas to develop this paper, or others).
3

Situations
• Most language-conditioned RL methods only use language as instructions
(eg. “Pick the blue box”)
• However, language does not always match the optimal action.
• Therefore, mapping language only to actions is a weak learning signal.
4
“Put the bowls away”

Complication
• On the other hand, human can predict the future using language.
• Human can predict environment dynamics (eg. “wrenches tightens nuts.”)
• Human can predict the future observations (eg. “the paper is outside.”)
5

Questions & Hypothesis
• Question:
• If we let reinforcement learning predict the future using language, will its
performance improve?
• Hypothesis:
• Predicting the future representation provides a rich learning signal for
agents of how language relates to the world.
• Rich learning signal: frequent, stable training signal.
6

Contributions
• DynaLang enables RL agents to use diverse types of language, for example
hint or dynamics, along with instruction.
• DynaLang suggests the future prediction self-supervised objective to improve
the training performance.
7

Why is This New?
• Previous language-based RL methods either used language as only
instructions or only description of environment.
• DynaLang uni
fi
es these settings so that agents learns from diverse types of
language.
• Previous works mostly directly condition policies on language to generate
actions.
• DynaLang proposes the future prediction objective to train the world model
which associates language, image, and dynamics.
8

Problem Setting
• Observation: , where is an image, is a language token.
• An agent chooses action , then environment returns:
• reward ,
• a
fl
ag whether the episode continues ,
• and next observation .
•
The agent’s goal is to maximize
ot = (xt, lt) xt lt
at
rt+1
ct+1
ot+1
E
[
T
∑
t=1
γt−1
rt
]
10

Method Outline
• DynaLang components
• World model: encodes current image obs and language into representation.
• RL agent: using encoded representation, acts to maximize the sum of
discounted reward.
11

Method - World Model
Outline
• World model components:
• Encoder - Decoder: learns to represent the current state.
• Sequence model: learns to predict the future state representation.
12

Base model (previous work)
• DynaLang = Dreamer V3 + language + future prediction objective.
• Dreamer V3 learns to compute compact representations of current state, and
learns how these concepts change by actions.
13
Architecture of Dreamer V3

Incorporation of language
• DynaLang incorporates language into the encoder-decoder of Dremer V3.
• By this, DynaLang gets representations unifying visual observations and
languages.
14

Prediction of the future
• DynaLang adds the future representation prediction into the sequence model
of Dreamer V3.
• Future representation prediction lets the agent extract the information from
language, relating to the dynamics of multiple modalities.
15

Model Losses
• World model loss: , where
• Image loss
• Language loss
• Reward loss
• Continue loss
• Regularizer , where sg is stop-gradient
• Future prediction loss
Lx + Ll + Lr + Lc + Lreg + Lpred
Lx = || ̂
xt − x||2
2
Ll = categorical_cross_entropy( ̂
lt, lt)
Lr = ( ̂
rt − rt)2
Lc = binary_cross_entropy( ̂
ct, ct)
Lreg = βreg max(1,KL[zt |sg( ̂
zt)])
Lpred = βpred max(1,KL[sg(zt), ̂
zt])
16

Method - RL Agent
Outline
• The used RL agent is a simple actor critic agent.
• Actor:
• Critic:
• Note that the RL agent is not conditioned on language directly.
π(at |zt, ht)
V(ht, zt)
17

Method - RL Agent
Environment interaction
• The RL agent interacts with environment using the encoded representation
and history .
zt
ht
18

Method - RL Agent
Training
• Let , the estimated discounted sum of
future rewards.
• Critic loss:
• Actor loss: , maximizing the return estimate
• The agent is trained only using imagined rollout generated by the world model.
• The agent is trained by the action of the agent and the predicted states, rewards.
Rt = rt + γct ((1 − λ)V (zt+1, ht+1) + λRt+1)
Lϕ = (Vϕ(zt, ht) − Rt)
2
Lθ = − (Rt − V(zt, ht)) log πθ(at |ht, zt)
19

Experiments 1 - Diverse Types of Language
Questions
• Questions to address:
• Can DynaLang use diverse types of language along with instruction?
• If can, does it improve task performance?
21

Setup
• Env: HomeGrid
• multitask grid world where agents receive task
instruction in language but also language hints.
• Agents gets a reward of 1 when a task is completed,
and then a new task is sampled.
• Therefore, agents must complete as many tasks
as possible before the episode terminates in 100
steps.
•
22
HomeGrid env. Agents receive 3 typess of hints.

Results
• Baselines: model-free o
ff
-policy algorithms, IMPALA, R2D2.
• Simply image embeddings, language embeddings are conditioned to policy.
• DynaLang solves more tasks with hints, but simple language-conditioned RL
get worse with hints.
23
HomeGrid training performance after 50M steps (2 seeds)

Experiments 2 - Future Prediction
Questions
• Questions to address:
• Is adding future prediction more e
ff
ective than using language to only
generate actions?
24

Setup
• Env: Messenger
• grid world where agents should deliver a message
while avoiding enemies using text manuals.
• Agents must understand manuals and relate them to
the environment to achieve high score.
25
Messenger env. Agent get text manuals.

Results
• EMMA is added to be compared:
• Language + gridworld speci
fi
c method, using language only to generate action.
• Only DynaLang can learn from S3, the most di
ffi
cult setting.
• Adding future prediction helps the training more than only action generation.
• However, the authors do not include ablation studies which exclude the future
prediction loss from their architecture.
26
Messenger training performance (2 seeds). S1 is most easy, S3 is most hard.

230906 paper summary - learning to world model with language - public.pdf

Recommandé

Recommandé

Contenu connexe

Similaire à 230906 paper summary - learning to world model with language - public.pdf

Similaire à 230906 paper summary - learning to world model with language - public.pdf (20)

Dernier

Dernier (20)

230906 paper summary - learning to world model with language - public.pdf