The document describes the Dreamer model for reinforcement learning. Dreamer learns a world model from images of its experience. It then learns behaviors by imagining future sequences predicted by the world model and backpropagating value gradients through the imagined sequences. Experiments show Dreamer outperforms prior model-free and model-based methods on a variety of visual control tasks, demonstrating it can learn behaviors purely from latent imagination to solve challenging problems.
3. 1. RL Comparison
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 2
INDEX
Introduction
Methods
Performance
Conclusion
Model-Free RL
โข No Model
โข Learn value function(and/or policy) from real experience
ModelโBased RL
โข Learn a model from real experience
โข Plan value function(and/or policy) from the simulated experience
RL Comparison, from slides of Sergey Levine
4. 2. World Model
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 3
INDEX
Introduction
Methods
Performance
Conclusion
โIntelligent agents can achieve goals in complex environments
even through they never encouter the exact same situation twice.โ
โThis ability requires building representations of the world from past
experience that enable generalization to novel situations.โ
โWorld models offer an explicit way to represent an agentโs knowledge
about the world in a parametric model that can make predictions about the
futureโ
A World Model, from Scott McCloudโs Understanding Comics.
5. 3. Visual Control
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 4
INDEX
Introduction
Methods
Performance
Conclusion
โSensory inputs are high-dimensional images, latent dynamic models can
abstract observations to predict forward in compact state spaces.โ
โ latent states have a small memory footprint
โBehaviors can be derived from dynamic models in many ways.โ
โ Considering only rewards within a fixed imagination horizon results in shortsighted behaviors
โ Prior work commonly resorts to derivative-free optimization for robustness
6. 4. PlaNet
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 5
INDEX
Introduction
Methods
Performance
Conclusion
An RL agent that learns the environment dynamics from
images and chooses actions through fast online planning in
latent space.
Learning Latent Dynamics for Planning from Pixels(Danijar Hafner et al., 2019)
7. 5. Dreamer
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 6
INDEX
Introduction
Methods
Performance
Conclusion
An RL agent that learns long-horizontal behaviors from
images purely by latent imagination.
The three processes of the Dreamer agent.
1. The world model is learned from past experience.
2. From predictions of this model, the agent then learns a value network
to predict future rewards and an actor network to select actions.
3. The actor network is used to interact with the environment.
9. 1. Learning the World Model
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 8
INDEX
Introduction
Methods
Performance
Conclusion
Dreamer learns a world model from experience. Using past images ๐1 ~ ๐3 and actions ๐1 ~
๐2, it computes a sequence of compact model states (green circles) from which it reconstructs
the images ๐1 ~ ๐3 and predicts the rewards ๐1 ~ ๐3. โ Leveraging PlaNet
10. 2. Learning Behavior in Imagination
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 9
INDEX
Introduction
Methods
Performance
Conclusion
Dreamer learns long-sighted behaviors from predicted sequences of model states. It first
learns the long-term value ๐ฃ2 ~ ๐ฃ3 of each state, and then predicts actions ๐1 ~ ๐2 that lead to
high rewards and values by backpropagating them through the state sequence to the actor
network.
11. 2. Learning Behavior in Imagination
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 10
INDEX
Introduction
Methods
Performance
Conclusion
PlaNet vs Dreamer
โข For a given situation in the environment, PlaNet searches for the best action among many predictions for different
action sequences.
โข Dreamer side-steps this expensive search by decoupling planning and acting. Once its actor network has been
trained on predicted sequences, it computes the actions for interacting with the environment without additional search.
In addition, Dreamer considers rewards beyond the planning horizon using a value function and leverages
backpropagation for efficient planning.
12. 3. Act in the Environment
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 11
INDEX
Introduction
Methods
Performance
Conclusion
The agent encodes the history of the episode to compute the current model state and the next
action to execute in the environment.
14. 4. Explained
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 13
INDEX
Introduction
Methods
Performance
Conclusion
1. Transition Model
2. Reward Model
3. Policy
4. Objective
(Imgained Rewards)
Actor-Critic Method
From papers on Deep Multi-Agent(Taiki Fuji et al.)
5. Actor-Critic Model
(Parametrized by ๐ & ๐, respectively)
15. 4. Explained
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 14
INDEX
Introduction
Methods
Performance
Conclusion
6. Value Estimation
7. Learning Object
Flow of Actor-Critic Method
From slides of Deep RL, Sergey Levine
w/ Imgained Trajectories
โ Exponential decaying for old Trajectory
โ Rewards beyond k steps with the learned value model
โ Actor
โ Critic
17. 1. Control Tasks
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 16
INDEX
Introduction
Methods
Performance
Conclusion
โข Dreamer learns to solve 20 challenging continuous control tasks with
image inputs, 5 of which are displayed here.
The tasks are designed to pose a variety of challenges to the RL agent, including difficult to
predict collisions, sparse rewards, chaotic dynamics, small but relevant objects, high degrees
of freedom, and 3D perspectives
โข The visualizations show the same 64x64 images that the agent receives
from the environment.
18. 2. Comparison
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 17
INDEX
Introduction
Methods
Performance
Conclusion
Dreamer outperforms the previous best model-free
(D4PG) and model-based (PlaNet) methods on the
benchmark of 20 tasks in terms of final performance,
data efficiency, and computation time.
19. 3. Atari Games
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 18
INDEX
Introduction
Methods
Performance
Conclusion
Dreamer learns successful behaviors on Atari games and DeepMind Lab
levels, which feature discrete actions and visually more diverse scenes,
including 3D environments with multiple objects.
21. Conclusion
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 20
INDEX
Introduction
Methods
Performance
Conclusion
1. Learning behaviors from sequences predicted by world models alone can solve
challenging visual control tasks from image inputs, surpassing the performance
of previous model-free approaches.
2. Dreamer demonstrates that learning behaviors by backpropagating value gradients through
predicted sequences of compact model states is successful and robust, solving a diverse
collection of continuous and discrete control tasks.
22. My Questions
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 21
INDEX
Introduction
Methods
Performance
Conclusion
1. Is there any relation between world model and โcommon senseโ mentioned by Yann LeCun?
2. Is there any evidence for the mechanism of human prediction and dream?
What we see is based on our brainโs prediction of the future,
A. Kitaoka.Kanzen. 2002.
25. Thank You For Your Listening!
4/18/2023 ๋ฅ๋ ผ์ฝ ์ธ๋ฏธ๋ - ๊ฐํํ์ต 24
INDEX
Introduction
Methods
Performance
Conclusion
Learning Behaviors by Latent Imagination
DQN model for Text2image, PixRay