SlideShare une entreprise Scribd logo
1  sur  23
Télécharger pour lire hors ligne
Deep Reinforcement Learning with
Double Q-learning
Presenter: Takato Yamazaki
1
About the Paper
Title
Deep Reinforcement Learning with Double Q-learning
[arXiv:1509.06461]
Author
Hado van Hasselt, Arthur Guez, David Silver
Af liation
Google DeepMind
Year
2015
2
Outline
How DDQN was Derived
DDQN
Experiment Environment
Results
Summary
Related Papers
3
How DDQN was Derived
Reinforcement Learning
Agent's Goal: Learn good policies for sequential decision problems
With policy π, the true value Q of an action a in state s is
Q (s, a) = E R + γR + ...∣S = s, A = a, π
Optimal value is then
Q (s, a) = Q (s, a)
π [ 1 2 0 0 ]
∗
π
max π
4
How DDQN was Derived
Q-learning (Watkins, 1989)
Q(s, a) = Q(s, a) + α −
where α is the learning rate.
Current Q value will move closer to (Reward + next Q value)
(R + γ Q(s , a )t+1
a′
max ′ ′
Q(s, a))
5
How DDQN was Derived
Deep Q-learning (Mnih et al., 2015)
What if there is in nite states...
Q-learning can be considered as minimization problem.
Neural network can be used to minimize the error!
Y t
DQN
L(θ )
θt
min t
= R + γ Q(s , a ; θ )t+1
a′
max ′ ′
t
−
= E (R + γ Q(s , a ; θ ) − Q(s, a; θ ))
θt
min [ t+1
a′
max ′ ′
t
−
t
2
]
6
How DDQN was Derived
Deep Q-learning (Mnih et al., 2015) (Continued)
Experience replay
Store observed transitions to memory bank
Sample from memory bank randomly and train network
Target network
Copy online network θ to target network θ every τ stepst t
−
7
How DDQN was Derived
Double Q-learning (van Hasselt, 2010)
Q-learning often OVERESTIMATES the Q values because...
it uses the maximum action value every time to update Q values
it uses the same values to select and to evaluate an action
Double Q-learning helps avoiding overestimates!
Split the weights θ into selector and evaluator
8
Double Q-learning (van Hasselt, 2010) (continued)
9
Double Q-learning (van Hasselt, 2010) (continued)
Q-learning target
Y = R + γ Q(s , a ; θ )
Transform to
Y = R + γQ s , argmax Q(s , a; θ ); θ
Use different parameter for evaluating the Q-value
Y = R + γQ s , argmax Q(s , a; θ ); θ
t
Q
t+1
a′
max ′ ′
t
t
Q
t+1 ( ′
a
′
t t)
t
DoubleQ
t+1 ( ′
a
′
t t
′
)
10
Double Q-learning (van Hasselt, 2010) (continued)
11
DDQN
Double Deep Q-learning (DDQN)
Combination of DQN and Double Q-learning!!!
Using neural network as selector and evaluator.
Easy implementation because...
DQN uses target network feature
Online network θ = Selector
Target network θ = Evaluator
t
t
−
12
Double Deep Q-learning (DDQN) (continued)
Double Q-learning's target was described as
Y = R + γQ s , argmax Q(s , a; θ ); θ
Transform for DDQN
Y = R + γQ s , argmax Q(s , a; θ ); θ
where θ is the online network and θ is the target network
t
DoubleQ
t+1 ( ′
a
′
t t
′
)
t
DoubleDQN
t+1 ( ′
a
′
t t
−
)
t t
−
13
Experiment Environment
Atari 2600 Games, using the Arcade Learning Environment (ALE)
14
Experiment Environment
Network
Optimizer: RMSProp
15
Experiment Environment
Parameters (DQN, DDQN)
Discount value: γ = 0.99
Learning rate: α = 0.00025
Target network update: every 10000 steps
Exploration: epsilon-greedy method
Epsilon: ε = max 1 − t , 0.1
Steps: 50,000,000 steps
(
1, 000, 000
1
)
16
Experiment Environment
Parameters (Tuned for DDQN)
Discount value: γ = 0.99
Learning rate: α = 0.00025
Target network update: every 30000 steps
Exploration: epsilon-greedy method
Epsilon: ε = max 1 − t , 0.01
Steps: 50,000,000 steps
(
1, 000, 000
1
)
17
Results
DDQN is better than DQN
Value estimates: argmax Q(S , a; θ)
T
1
t=1
∑
T
a t
18
Results
More results
19
Results
More results (100 games each)
20
Results
More results
21
Summary
DDQN > DQN for most of the environments.
Less overestimations of values.
Implementing is easy!
Go DDQN!!
22
Related Papers
Elhadji Amadou Oury Diallo et al.: "Learning Power of Coordination
in Adversarial Multi-Agent with Distributed Double DQN".
Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas
Heess, Tom Erez, Yuval Tassa, David Silver: “Continuous control with
deep reinforcement learning”, 2015;
[http://arxiv.org/abs/1509.02971 arXiv:1509.02971].
Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc
Lanctot: “Dueling Network Architectures for Deep Reinforcement
Learning”, 2015; [http://arxiv.org/abs/1511.06581
arXiv:1511.06581].
23

Contenu connexe

Tendances

Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Slideshare
 

Tendances (20)

Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
Reinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | EdurekaReinforcement Learning Tutorial | Edureka
Reinforcement Learning Tutorial | Edureka
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex FridmanMIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
MIT 6.S091: Introduction to Deep Reinforcement Learning (Deep RL) by Lex Fridman
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Tutorial: Context In Recommender Systems
Tutorial: Context In Recommender SystemsTutorial: Context In Recommender Systems
Tutorial: Context In Recommender Systems
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 

Similaire à Double Q-learning Paper Reading

Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
謙益 黃
 

Similaire à Double Q-learning Paper Reading (20)

Financial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning ApproachFinancial Trading as a Game: A Deep Reinforcement Learning Approach
Financial Trading as a Game: A Deep Reinforcement Learning Approach
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
Introduction to Big Data Science
Introduction to Big Data ScienceIntroduction to Big Data Science
Introduction to Big Data Science
 
Safe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement LearningSafe and Efficient Off-Policy Reinforcement Learning
Safe and Efficient Off-Policy Reinforcement Learning
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 
Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님Dear - 딥러닝 논문읽기 모임 김창연님
Dear - 딥러닝 논문읽기 모임 김창연님
 
Joint optimization framework for learning with noisy labels
Joint optimization framework for learning with noisy labelsJoint optimization framework for learning with noisy labels
Joint optimization framework for learning with noisy labels
 
L1 intro2 supervised_learning
L1 intro2 supervised_learningL1 intro2 supervised_learning
L1 intro2 supervised_learning
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
MLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic trackMLHEP Lectures - day 2, basic track
MLHEP Lectures - day 2, basic track
 
Lect4
Lect4Lect4
Lect4
 
Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017Andres hernandez ai_machine_learning_london_nov2017
Andres hernandez ai_machine_learning_london_nov2017
 
learned optimizer.pptx
learned optimizer.pptxlearned optimizer.pptx
learned optimizer.pptx
 
Unbiased Bayes for Big Data
Unbiased Bayes for Big DataUnbiased Bayes for Big Data
Unbiased Bayes for Big Data
 
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
Time-Series Analysis on Multiperiodic Conditional Correlation by Sparse Covar...
 
Machine learning in science and industry — day 1
Machine learning in science and industry — day 1Machine learning in science and industry — day 1
Machine learning in science and industry — day 1
 
Support Vector Machines Simply
Support Vector Machines SimplySupport Vector Machines Simply
Support Vector Machines Simply
 
ENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-MeansENBIS 2018 presentation on Deep k-Means
ENBIS 2018 presentation on Deep k-Means
 

Dernier

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Dernier (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 

Double Q-learning Paper Reading

  • 1. Deep Reinforcement Learning with Double Q-learning Presenter: Takato Yamazaki 1
  • 2. About the Paper Title Deep Reinforcement Learning with Double Q-learning [arXiv:1509.06461] Author Hado van Hasselt, Arthur Guez, David Silver Af liation Google DeepMind Year 2015 2
  • 3. Outline How DDQN was Derived DDQN Experiment Environment Results Summary Related Papers 3
  • 4. How DDQN was Derived Reinforcement Learning Agent's Goal: Learn good policies for sequential decision problems With policy π, the true value Q of an action a in state s is Q (s, a) = E R + γR + ...∣S = s, A = a, π Optimal value is then Q (s, a) = Q (s, a) π [ 1 2 0 0 ] ∗ π max π 4
  • 5. How DDQN was Derived Q-learning (Watkins, 1989) Q(s, a) = Q(s, a) + α − where α is the learning rate. Current Q value will move closer to (Reward + next Q value) (R + γ Q(s , a )t+1 a′ max ′ ′ Q(s, a)) 5
  • 6. How DDQN was Derived Deep Q-learning (Mnih et al., 2015) What if there is in nite states... Q-learning can be considered as minimization problem. Neural network can be used to minimize the error! Y t DQN L(θ ) θt min t = R + γ Q(s , a ; θ )t+1 a′ max ′ ′ t − = E (R + γ Q(s , a ; θ ) − Q(s, a; θ )) θt min [ t+1 a′ max ′ ′ t − t 2 ] 6
  • 7. How DDQN was Derived Deep Q-learning (Mnih et al., 2015) (Continued) Experience replay Store observed transitions to memory bank Sample from memory bank randomly and train network Target network Copy online network θ to target network θ every τ stepst t − 7
  • 8. How DDQN was Derived Double Q-learning (van Hasselt, 2010) Q-learning often OVERESTIMATES the Q values because... it uses the maximum action value every time to update Q values it uses the same values to select and to evaluate an action Double Q-learning helps avoiding overestimates! Split the weights θ into selector and evaluator 8
  • 9. Double Q-learning (van Hasselt, 2010) (continued) 9
  • 10. Double Q-learning (van Hasselt, 2010) (continued) Q-learning target Y = R + γ Q(s , a ; θ ) Transform to Y = R + γQ s , argmax Q(s , a; θ ); θ Use different parameter for evaluating the Q-value Y = R + γQ s , argmax Q(s , a; θ ); θ t Q t+1 a′ max ′ ′ t t Q t+1 ( ′ a ′ t t) t DoubleQ t+1 ( ′ a ′ t t ′ ) 10
  • 11. Double Q-learning (van Hasselt, 2010) (continued) 11
  • 12. DDQN Double Deep Q-learning (DDQN) Combination of DQN and Double Q-learning!!! Using neural network as selector and evaluator. Easy implementation because... DQN uses target network feature Online network θ = Selector Target network θ = Evaluator t t − 12
  • 13. Double Deep Q-learning (DDQN) (continued) Double Q-learning's target was described as Y = R + γQ s , argmax Q(s , a; θ ); θ Transform for DDQN Y = R + γQ s , argmax Q(s , a; θ ); θ where θ is the online network and θ is the target network t DoubleQ t+1 ( ′ a ′ t t ′ ) t DoubleDQN t+1 ( ′ a ′ t t − ) t t − 13
  • 14. Experiment Environment Atari 2600 Games, using the Arcade Learning Environment (ALE) 14
  • 16. Experiment Environment Parameters (DQN, DDQN) Discount value: γ = 0.99 Learning rate: α = 0.00025 Target network update: every 10000 steps Exploration: epsilon-greedy method Epsilon: ε = max 1 − t , 0.1 Steps: 50,000,000 steps ( 1, 000, 000 1 ) 16
  • 17. Experiment Environment Parameters (Tuned for DDQN) Discount value: γ = 0.99 Learning rate: α = 0.00025 Target network update: every 30000 steps Exploration: epsilon-greedy method Epsilon: ε = max 1 − t , 0.01 Steps: 50,000,000 steps ( 1, 000, 000 1 ) 17
  • 18. Results DDQN is better than DQN Value estimates: argmax Q(S , a; θ) T 1 t=1 ∑ T a t 18
  • 20. Results More results (100 games each) 20
  • 22. Summary DDQN > DQN for most of the environments. Less overestimations of values. Implementing is easy! Go DDQN!! 22
  • 23. Related Papers Elhadji Amadou Oury Diallo et al.: "Learning Power of Coordination in Adversarial Multi-Agent with Distributed Double DQN". Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver: “Continuous control with deep reinforcement learning”, 2015; [http://arxiv.org/abs/1509.02971 arXiv:1509.02971]. Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot: “Dueling Network Architectures for Deep Reinforcement Learning”, 2015; [http://arxiv.org/abs/1511.06581 arXiv:1511.06581]. 23