SlideShare une entreprise Scribd logo
1  sur  24
Télécharger pour lire hors ligne
Deep Q-Networks
Volodymyr Mnih
Deep RL Bootcamp, Berkeley
■ Learning a parametric Q function:
■ Remember:
■ Update:
■ For tabular function, , we recover the familiar update:
■ Converges to optimal values (*)
■ Does it work with a neural network Q functions?
■ Yes but with some care
Recap: Q-Learning
Recap: (Tabular) Q-Learning
Recap: Approximate Q-Learning
● High-level idea - make Q-learning look like supervised learning.
● Two main ideas for stabilizing Q-learning.
● Apply Q-updates on batches of past experience instead of online:
○ Experience replay (Lin, 1993).
○ Previously used for better data efficiency.
○ Makes the data distribution more stationary.
● Use an older set of weights to compute the targets (target network):
○ Keeps the target function from changing too quickly.
DQN
“Human-Level Control Through Deep Reinforcement Learning”, Mnih, Kavukcuoglu, Silver et al. (2015)
Target Network Intuition
s s’
● Changing the value of one action will
change the value of other actions
and similar states.
● The network can end up chasing its
own tail because of bootstrapping.
● Somewhat surprising fact - bigger
networks are less prone to this
because they alias less.
DQN Training Algorithm
DQN Details
● Uses Huber loss instead of squared loss on Bellman error:
● Uses RMSProp instead of vanilla SGD.
○ Optimization in RL really matters.
● It helps to anneal the exploration rate.
○ Start ε at 1 and anneal it to 0.1 or 0.05 over the first million frames.
DQN on ATARI
Pong Enduro Beamrider Q*bert
• 49 ATARI 2600 games.
• From pixels to actions.
• The change in score is the reward.
• Same algorithm.
• Same function approximator, w/ 3M free parameters.
• Same hyperparameters.
• Roughly human-level performance on 29 out of 49 games.
ATARI Network Architecture
Stack of 4 previous
frames
Convolutional layer
of rectified linear units
Convolutional layer
of rectified linear units
Fully-connected layer
of rectified linear units
Fully-connected linear
output layer
16 8x8 filters
32 4x4 filters
256 hidden units
4x84x84
● Convolutional neural network architecture:
○ History of frames as input.
○ One output per action - expected reward for that action Q(s, a).
○ Final results used a slightly bigger network (3 convolutional + 1 fully-connected hidden layers).
Stability Techniques
Atari Results
“Human-Level Control Through Deep Reinforcement Learning”, Mnih, Kavukcuoglu, Silver et al. (2015)
DQN Playing ATARI
Action Values on Pong
Learned Value Functions
Sacrificing Immediate Rewards
DQN Source Code
● The DQN source code (in Lua+Torch) is available:
https://sites.google.com/a/deepmind.com/dqn/
Neural Fitted Q Iteration
● NFQ (Riedmiller, 2005) trains neural networks with Q-learning.
● Alternates between collecting new data and fitting a new Q-function to all previous
experience with batch gradient descent.
● DQN can be seen as an online variant of NFQ.
Lin’s Networks
● Long-Ji Lin’s thesis “Reinforcement Learning for Robots using Neural
Networks” (1993) also trained neural nets with Q-learning.
● Introduced experience replay among other things.
● Lin’s networks did not share parameters among actions.
Q(a1
,s) Q(a2
,s) Q(a3
,s) Q(a3
,s)Q(a1
,s) Q(a2
,s)
Lin’s architecture DQN
Double DQN
● There is an upward bias in maxa
Q(s, a; θ).
● DQN maintains two sets of weight θ and θ-
, so reduce bias by using:
○ θ for selecting the best action.
○ θ-
for evaluating the best action.
● Double DQN loss:
“Double Reinforcement Learning with Double Q-Learning”, van Hasselt et al. (2016)
Prioritized Experience Replay
● Replaying all transitions with equal probability is highly suboptimal.
● Replay transitions in proportion to absolute Bellman error:
● Leads to much faster learning.
“Prioritized Experience Replay”, Schaul et al. (2016)
● Value-Advantage decomposition of Q:
● Dueling DQN (Wang et al., 2015):
Dueling DQN
Q(s,a)
Q(s,a)
A(s,a)
V(s)
DQN
Dueling
DQN
Atari Results
“Dueling Network Architectures for Deep Reinforcement Learning”, Wang et al. (2016)
Noisy Nets for Exploration
● Add noise to network parameters for better exploration [Fortunato, Azar, Piot et al. (2017)].
● Standard linear layer:
● Noisy linear layer:
● εw
and εb
contain noise.
● σw
and σb
are learned parameters that determine the amount of noise.
“Noisy Nets for Exploration”, Fortunato, Azar, Piot et al. (2017)
Also see “Parameter Space Noise for Exploration”, Plappert et al. (2017)
Questions?

Contenu connexe

Tendances

Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningYoonho Shin
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingSeung Jae Lee
 
Reinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationReinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationSeung Jae Lee
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingSeung Jae Lee
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningSeung Jae Lee
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy OptimizationShubhaManikarnike
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learningSeungHyeok Baek
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNEuijin Jeong
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 

Tendances (20)

Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
Reinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic ProgrammingReinforcement Learning 4. Dynamic Programming
Reinforcement Learning 4. Dynamic Programming
 
Reinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with ApproximationReinforcement Learning 10. On-policy Control with Approximation
Reinforcement Learning 10. On-policy Control with Approximation
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
Reinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step BootstrappingReinforcement Learning 7. n-step Bootstrapping
Reinforcement Learning 7. n-step Bootstrapping
 
Reinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference LearningReinforcement Learning 6. Temporal Difference Learning
Reinforcement Learning 6. Temporal Difference Learning
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
deep reinforcement learning with double q learning
deep reinforcement learning with double q learningdeep reinforcement learning with double q learning
deep reinforcement learning with double q learning
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Deep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQNDeep sarsa, Deep Q-learning, DQN
Deep sarsa, Deep Q-learning, DQN
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 

Similaire à Lec3 dqn

DQN Variants: A quick glance
DQN Variants: A quick glanceDQN Variants: A quick glance
DQN Variants: A quick glanceTejas Kotha
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryAhmed Yousry
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Ryo Takahashi
 
Practical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture GenerationPractical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture Generation郁凱 黃
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)DonghyunKang12
 
GNR638_Course Project for spring semester
GNR638_Course Project for spring semesterGNR638_Course Project for spring semester
GNR638_Course Project for spring semesterBijayChandraDasTECH0
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to ChainerShunta Saito
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...ssuser2624f71
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningTapas Majumdar
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentationOwin Will
 
Machine learning project
Machine learning projectMachine learning project
Machine learning projectHarsh Jain
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...gabrielesisinna
 
08 neural networks
08 neural networks08 neural networks
08 neural networksankit_ppt
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 

Similaire à Lec3 dqn (20)

DQN Variants: A quick glance
DQN Variants: A quick glanceDQN Variants: A quick glance
DQN Variants: A quick glance
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousryHands on machine learning with scikit-learn and tensor flow by ahmed yousry
Hands on machine learning with scikit-learn and tensor flow by ahmed yousry
 
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic...
 
Practical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture GenerationPractical Block-wise Neural Network Architecture Generation
Practical Block-wise Neural Network Architecture Generation
 
Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)Cvpr 2018 papers review (efficient computing)
Cvpr 2018 papers review (efficient computing)
 
GNR638_Course Project for spring semester
GNR638_Course Project for spring semesterGNR638_Course Project for spring semester
GNR638_Course Project for spring semester
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Introduction to Chainer
Introduction to ChainerIntroduction to Chainer
Introduction to Chainer
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
Weisfeiler and Leman Go Neural: Higher-order Graph Neural Networks, arXiv e-...
 
GNR638_project ppt.pdf
GNR638_project ppt.pdfGNR638_project ppt.pdf
GNR638_project ppt.pdf
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Neural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learningNeural network basic and introduction of Deep learning
Neural network basic and introduction of Deep learning
 
Batch normalization presentation
Batch normalization presentationBatch normalization presentation
Batch normalization presentation
 
Machine learning project
Machine learning projectMachine learning project
Machine learning project
 
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
08 neural networks
08 neural networks08 neural networks
08 neural networks
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 

Plus de Ronald Teo

07 regularization
07 regularization07 regularization
07 regularizationRonald Teo
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodelRonald Teo
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgRonald Teo
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsRonald Teo
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchRonald Teo
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixelsRonald Teo
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticRonald Teo
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingRonald Teo
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsRonald Teo
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebraRonald Teo
 

Plus de Ronald Teo (16)

Mc td
Mc tdMc td
Mc td
 
07 regularization
07 regularization07 regularization
07 regularization
 
Dp
DpDp
Dp
 
06 mlp
06 mlp06 mlp
06 mlp
 
Mdp
MdpMdp
Mdp
 
04 numerical
04 numerical04 numerical
04 numerical
 
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
Eac4f222d9d468a0c29a71a3830a5c60 c5_w3l08-attentionmodel
 
Intro rl
Intro rlIntro rl
Intro rl
 
Lec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scgLec7 deeprlbootcamp-svg+scg
Lec7 deeprlbootcamp-svg+scg
 
Lec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methodsLec5 advanced-policy-gradient-methods
Lec5 advanced-policy-gradient-methods
 
Lec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-researchLec6 nuts-and-bolts-deep-rl-research
Lec6 nuts-and-bolts-deep-rl-research
 
Lec4b pong from_pixels
Lec4b pong from_pixelsLec4b pong from_pixels
Lec4b pong from_pixels
 
Lec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-criticLec4a policy-gradients-actor-critic
Lec4a policy-gradients-actor-critic
 
Lec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fittingLec2 sampling-based-approximations-and-function-fitting
Lec2 sampling-based-approximations-and-function-fitting
 
Lec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methodsLec1 intro-mdps-exact-methods
Lec1 intro-mdps-exact-methods
 
02 linear algebra
02 linear algebra02 linear algebra
02 linear algebra
 

Dernier

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 

Dernier (20)

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Lec3 dqn

  • 1. Deep Q-Networks Volodymyr Mnih Deep RL Bootcamp, Berkeley
  • 2. ■ Learning a parametric Q function: ■ Remember: ■ Update: ■ For tabular function, , we recover the familiar update: ■ Converges to optimal values (*) ■ Does it work with a neural network Q functions? ■ Yes but with some care Recap: Q-Learning
  • 5. ● High-level idea - make Q-learning look like supervised learning. ● Two main ideas for stabilizing Q-learning. ● Apply Q-updates on batches of past experience instead of online: ○ Experience replay (Lin, 1993). ○ Previously used for better data efficiency. ○ Makes the data distribution more stationary. ● Use an older set of weights to compute the targets (target network): ○ Keeps the target function from changing too quickly. DQN “Human-Level Control Through Deep Reinforcement Learning”, Mnih, Kavukcuoglu, Silver et al. (2015)
  • 6. Target Network Intuition s s’ ● Changing the value of one action will change the value of other actions and similar states. ● The network can end up chasing its own tail because of bootstrapping. ● Somewhat surprising fact - bigger networks are less prone to this because they alias less.
  • 8. DQN Details ● Uses Huber loss instead of squared loss on Bellman error: ● Uses RMSProp instead of vanilla SGD. ○ Optimization in RL really matters. ● It helps to anneal the exploration rate. ○ Start ε at 1 and anneal it to 0.1 or 0.05 over the first million frames.
  • 9. DQN on ATARI Pong Enduro Beamrider Q*bert • 49 ATARI 2600 games. • From pixels to actions. • The change in score is the reward. • Same algorithm. • Same function approximator, w/ 3M free parameters. • Same hyperparameters. • Roughly human-level performance on 29 out of 49 games.
  • 10. ATARI Network Architecture Stack of 4 previous frames Convolutional layer of rectified linear units Convolutional layer of rectified linear units Fully-connected layer of rectified linear units Fully-connected linear output layer 16 8x8 filters 32 4x4 filters 256 hidden units 4x84x84 ● Convolutional neural network architecture: ○ History of frames as input. ○ One output per action - expected reward for that action Q(s, a). ○ Final results used a slightly bigger network (3 convolutional + 1 fully-connected hidden layers).
  • 12. Atari Results “Human-Level Control Through Deep Reinforcement Learning”, Mnih, Kavukcuoglu, Silver et al. (2015)
  • 17. DQN Source Code ● The DQN source code (in Lua+Torch) is available: https://sites.google.com/a/deepmind.com/dqn/
  • 18. Neural Fitted Q Iteration ● NFQ (Riedmiller, 2005) trains neural networks with Q-learning. ● Alternates between collecting new data and fitting a new Q-function to all previous experience with batch gradient descent. ● DQN can be seen as an online variant of NFQ.
  • 19. Lin’s Networks ● Long-Ji Lin’s thesis “Reinforcement Learning for Robots using Neural Networks” (1993) also trained neural nets with Q-learning. ● Introduced experience replay among other things. ● Lin’s networks did not share parameters among actions. Q(a1 ,s) Q(a2 ,s) Q(a3 ,s) Q(a3 ,s)Q(a1 ,s) Q(a2 ,s) Lin’s architecture DQN
  • 20. Double DQN ● There is an upward bias in maxa Q(s, a; θ). ● DQN maintains two sets of weight θ and θ- , so reduce bias by using: ○ θ for selecting the best action. ○ θ- for evaluating the best action. ● Double DQN loss: “Double Reinforcement Learning with Double Q-Learning”, van Hasselt et al. (2016)
  • 21. Prioritized Experience Replay ● Replaying all transitions with equal probability is highly suboptimal. ● Replay transitions in proportion to absolute Bellman error: ● Leads to much faster learning. “Prioritized Experience Replay”, Schaul et al. (2016)
  • 22. ● Value-Advantage decomposition of Q: ● Dueling DQN (Wang et al., 2015): Dueling DQN Q(s,a) Q(s,a) A(s,a) V(s) DQN Dueling DQN Atari Results “Dueling Network Architectures for Deep Reinforcement Learning”, Wang et al. (2016)
  • 23. Noisy Nets for Exploration ● Add noise to network parameters for better exploration [Fortunato, Azar, Piot et al. (2017)]. ● Standard linear layer: ● Noisy linear layer: ● εw and εb contain noise. ● σw and σb are learned parameters that determine the amount of noise. “Noisy Nets for Exploration”, Fortunato, Azar, Piot et al. (2017) Also see “Parameter Space Noise for Exploration”, Plappert et al. (2017)