DQN (Deep Q-Network)

Value Iteration and Q-learning
• Model-free control: iteratively optimise value function and policy
•

Value Function Approximation
• “Lookup table” is not practical
• generalize to unobserved states
• handle large state/action space (and continuous state/action)
• Transform to supervised learning problem
• model(hypothesis space)
• Loss/cost function
• optimization
• iid assumption
• RL is unstable/divergent when action-value Q function is approximated
with a nonlinear function like neural networks
• states are correlated & data distribution changes + complex model

Deep Q-Network
• First step towards “General Artiﬁcial Intelligence”
• DQN = Q-learning + Function Approximation + Deep Network
• Stabilize training with experience replay and target network
• End-to-end RL approach, and quite ﬂexible

Practical Tips
• stable training: experiment replay(1M)+ ﬁxed target
• mini-batch
• E&E with decremental epsilon greedy parameter (1.0 to 0.1)
• input of Q-NETWORK includes 4 recent frames
• skip frames
• discounted reward with 0.99
• use RMSProp instead of SGD

DQN variants
• Double DQN
• Prioritized Experience Replay
• Dueling Architecture
• Asynchronous Methods
• Continuous DQN

Double Q-learning
• Motivation: reduce overestimation by decomposing the
max operation in the target into action selection and
action evaluation

Double DQN
• From Double Q-learning to DDQN

Prioritized Experience Replay
• Motivation: more frequently replay transitions
with high information
• Key components
• criterion of importance: TD error
• stochastic prioritization instead of greedy
• Importance sampling to avoid bias

Dueling Architecture - Motivation
• Motivation: for many states, estimation of state value is more important,
comparing with state-action value
• Better approximate state value, and leverage power of advantage function

Dueling Architecture - Details
• Adopt to existing DQN algorithms (output of dueling
network is still Q function)
• Estimate value function and advantage function
separately, and combine them to estimate action
value function
• In Back-propagation: the estimates value function
and Advantage function are computed automatically

Dueling Architecture - Performance
• Converge faster
• More robust (differences
between Q-values for a
given state are small, so
noise could make the nearly
greedy policy switch
abruptly)
• Achieve better performance
on Atari games (advantage
grows when the number of
actions is large)

More variants
• Continuous action control + DQN
• NAF: continuous variant of Q-learning algorithm
• DDPG: Deep DPG
• Asynchronous Methods + DQN
• multiple agents in parallel + parameter server

Reference
• Playing atari with deep reinforcement learning
• Human-level control through deep reinforcement learning
• Deep Reinforcement Learning with Double Q-learning
• Prioritized Experience Replay
• Dueling Network Architectures for Deep Reinforcement Learning
• Asynchronous methods for deep reinforcement learning
• Continuous control with deep reinforcement learning
• Continuous Deep Q-Learning with Model-based Acceleration
• Double Q learning
• Deep Reinforcement Learning - An Overview

DQN (Deep Q-Network)

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à DQN (Deep Q-Network)

Similaire à DQN (Deep Q-Network) (20)

Plus de Dong Guo

Plus de Dong Guo (8)

Dernier

Dernier (20)

DQN (Deep Q-Network)