Double Q-learning Paper Reading

Deep Reinforcement Learning with
Double Q-learning
Presenter: Takato Yamazaki
1

About the Paper
Title
Deep Reinforcement Learning with Double Q-learning
[arXiv:1509.06461]
Author
Hado van Hasselt, Arthur Guez, David Silver
Af liation
Google DeepMind
Year
2015
2

Outline
How DDQN was Derived
DDQN
Experiment Environment
Results
Summary
Related Papers
3

Reinforcement Learning
Agent's Goal: Learn good policies for sequential decision problems
With policy π, the true value Q of an action a in state s is
Q (s, a) = E R + γR + ...∣S = s, A = a, π
Optimal value is then
Q (s, a) = Q (s, a)
π [ 1 2 0 0 ]
∗
π
max π
4

Q-learning (Watkins, 1989)
Q(s, a) = Q(s, a) + α −
where α is the learning rate.
Current Q value will move closer to (Reward + next Q value)
(R + γ Q(s , a )t+1
a′
max ′ ′
Q(s, a))
5

Deep Q-learning (Mnih et al., 2015)
What if there is in nite states...
Q-learning can be considered as minimization problem.
Neural network can be used to minimize the error!
Y t
DQN
L(θ )
θt
min t
= R + γ Q(s , a ; θ )t+1
a′
max ′ ′
t
−
= E (R + γ Q(s , a ; θ ) − Q(s, a; θ ))
θt
min [ t+1
a′
max ′ ′
t
−
t
2
]
6

Deep Q-learning (Mnih et al., 2015) (Continued)
Experience replay
Store observed transitions to memory bank
Sample from memory bank randomly and train network
Target network
Copy online network θ to target network θ every τ stepst t
−
7

Double Q-learning (van Hasselt, 2010)
Q-learning often OVERESTIMATES the Q values because...
it uses the maximum action value every time to update Q values
it uses the same values to select and to evaluate an action
Double Q-learning helps avoiding overestimates!
Split the weights θ into selector and evaluator
8

Double Q-learning (van Hasselt, 2010) (continued)
9

Q-learning target
Y = R + γ Q(s , a ; θ )
Transform to
Y = R + γQ s , argmax Q(s , a; θ ); θ
Use different parameter for evaluating the Q-value
t
Q
t+1
a′
max ′ ′
t
t
Q
t+1 ( ′
a
′
t t)
t
DoubleQ
t+1 ( ′
a
′
t t
′
)
10

11

DDQN
Double Deep Q-learning (DDQN)
Combination of DQN and Double Q-learning!!!
Using neural network as selector and evaluator.
Easy implementation because...
DQN uses target network feature
Online network θ = Selector
Target network θ = Evaluator
t
t
−
12

Double Deep Q-learning (DDQN) (continued)
Double Q-learning's target was described as
Transform for DDQN
where θ is the online network and θ is the target network
t
DoubleQ
t+1 ( ′
a
′
t t
′
)
t
DoubleDQN
t+1 ( ′
a
′
t t
−
)
t t
−
13

Atari 2600 Games, using the Arcade Learning Environment (ALE)
14

Network
Optimizer: RMSProp
15

Parameters (DQN, DDQN)
Discount value: γ = 0.99
Learning rate: α = 0.00025
Target network update: every 10000 steps
Exploration: epsilon-greedy method
Epsilon: ε = max 1 − t , 0.1
Steps: 50,000,000 steps
(
1, 000, 000
1
)
16

Parameters (Tuned for DDQN)
Discount value: γ = 0.99
Learning rate: α = 0.00025
Target network update: every 30000 steps
Exploration: epsilon-greedy method
Epsilon: ε = max 1 − t , 0.01
Steps: 50,000,000 steps
(
1, 000, 000
1
)
17

Results
DDQN is better than DQN
Value estimates: argmax Q(S , a; θ)
T
1
t=1
∑
T
a t
18

Results
More results (100 games each)
20

Summary
DDQN > DQN for most of the environments.
Less overestimations of values.
Implementing is easy!
Go DDQN!!
22

Related Papers
Elhadji Amadou Oury Diallo et al.: "Learning Power of Coordination
in Adversarial Multi-Agent with Distributed Double DQN".
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas
Heess, Tom Erez, Yuval Tassa, David Silver: “Continuous control with
deep reinforcement learning”, 2015;
[http://arxiv.org/abs/1509.02971 arXiv:1509.02971].
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc
Lanctot: “Dueling Network Architectures for Deep Reinforcement
Learning”, 2015; [http://arxiv.org/abs/1511.06581
arXiv:1511.06581].
23

Double Q-learning Paper Reading

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Double Q-learning Paper Reading

Similaire à Double Q-learning Paper Reading (20)

Dernier

Dernier (20)

Double Q-learning Paper Reading