2. About the Paper
Title
Deep Reinforcement Learning with Double Q-learning
[arXiv:1509.06461]
Author
Hado van Hasselt, Arthur Guez, David Silver
Af liation
Google DeepMind
Year
2015
2
3. Outline
How DDQN was Derived
DDQN
Experiment Environment
Results
Summary
Related Papers
3
4. How DDQN was Derived
Reinforcement Learning
Agent's Goal: Learn good policies for sequential decision problems
With policy π, the true value Q of an action a in state s is
Q (s, a) = E R + γR + ...∣S = s, A = a, π
Optimal value is then
Q (s, a) = Q (s, a)
π [ 1 2 0 0 ]
∗
π
max π
4
5. How DDQN was Derived
Q-learning (Watkins, 1989)
Q(s, a) = Q(s, a) + α −
where α is the learning rate.
Current Q value will move closer to (Reward + next Q value)
(R + γ Q(s , a )t+1
a′
max ′ ′
Q(s, a))
5
6. How DDQN was Derived
Deep Q-learning (Mnih et al., 2015)
What if there is in nite states...
Q-learning can be considered as minimization problem.
Neural network can be used to minimize the error!
Y t
DQN
L(θ )
θt
min t
= R + γ Q(s , a ; θ )t+1
a′
max ′ ′
t
−
= E (R + γ Q(s , a ; θ ) − Q(s, a; θ ))
θt
min [ t+1
a′
max ′ ′
t
−
t
2
]
6
7. How DDQN was Derived
Deep Q-learning (Mnih et al., 2015) (Continued)
Experience replay
Store observed transitions to memory bank
Sample from memory bank randomly and train network
Target network
Copy online network θ to target network θ every τ stepst t
−
7
8. How DDQN was Derived
Double Q-learning (van Hasselt, 2010)
Q-learning often OVERESTIMATES the Q values because...
it uses the maximum action value every time to update Q values
it uses the same values to select and to evaluate an action
Double Q-learning helps avoiding overestimates!
Split the weights θ into selector and evaluator
8
10. Double Q-learning (van Hasselt, 2010) (continued)
Q-learning target
Y = R + γ Q(s , a ; θ )
Transform to
Y = R + γQ s , argmax Q(s , a; θ ); θ
Use different parameter for evaluating the Q-value
Y = R + γQ s , argmax Q(s , a; θ ); θ
t
Q
t+1
a′
max ′ ′
t
t
Q
t+1 ( ′
a
′
t t)
t
DoubleQ
t+1 ( ′
a
′
t t
′
)
10
12. DDQN
Double Deep Q-learning (DDQN)
Combination of DQN and Double Q-learning!!!
Using neural network as selector and evaluator.
Easy implementation because...
DQN uses target network feature
Online network θ = Selector
Target network θ = Evaluator
t
t
−
12
13. Double Deep Q-learning (DDQN) (continued)
Double Q-learning's target was described as
Y = R + γQ s , argmax Q(s , a; θ ); θ
Transform for DDQN
Y = R + γQ s , argmax Q(s , a; θ ); θ
where θ is the online network and θ is the target network
t
DoubleQ
t+1 ( ′
a
′
t t
′
)
t
DoubleDQN
t+1 ( ′
a
′
t t
−
)
t t
−
13
22. Summary
DDQN > DQN for most of the environments.
Less overestimations of values.
Implementing is easy!
Go DDQN!!
22
23. Related Papers
Elhadji Amadou Oury Diallo et al.: "Learning Power of Coordination
in Adversarial Multi-Agent with Distributed Double DQN".
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas
Heess, Tom Erez, Yuval Tassa, David Silver: “Continuous control with
deep reinforcement learning”, 2015;
[http://arxiv.org/abs/1509.02971 arXiv:1509.02971].
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc
Lanctot: “Dueling Network Architectures for Deep Reinforcement
Learning”, 2015; [http://arxiv.org/abs/1511.06581
arXiv:1511.06581].
23