SlideShare a Scribd company logo
1 of 41
Download to read offline
Xavier Giro-i-Nieto
Associate Professor
Universitat Politecnica de Catalunya
@DocXavi
xavier.giro@upc.edu
Q-learning with a Neural Network
Deep Reinforcement Learning 1
[course site]
2
Acknowledgments
Víctor Campos
victor.campos@bsc.es
PhD Candidate
Barcelona Supercomputing Center
Universitat Politècnica de Catalunya
Outline
1. Motivation
2. Q-Learning with Neural Networks
3. Deep Q-Networks (DQN)
4
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller.
"Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
5
Deep Q-learning (DQN)
Flavours of (model-free) RL Policy-based / value-based
(e.g. DQN)
(e.g. A3C)
(e.g. REINFORCE)
Outline
1. Motivation
2. Function Approximation
3. Deep Q-Networks (DQN)
Value
function
Policy ㄫ Model
(of the Environment)
Goals of Reinforcement Learning
Reinforcement Learning with Neural Networks (NN)
9
Solving the Optimal Policy
Optimal
policy ㄫ*
The optimal policy is that one capable of achieving the optimal value functions
V*
(s) and Q*
(s,a)
Optimal
Q-value
functions
10
Tabular Q-Learning is feasible for small state-action spaces:
actions
Q(s,a)
Solving the Optimal Policy: Q-Learning
S F F F
F H F H
F F F F
H F F G
11
How would you compute the amount of rows in a tabular Q-learning solution of
Space invaders ?
Q-Function estimation with a table
actions
Q(s,a)
12
Exploring all positive states would require:
● generating all possible (valid) pixel combinations
● run all possible actions
● several times to allow estimation.
... is not scalable
Q-Function estimation with a table
actions
Q(s,a)
13
Deep neural network are powerful function approximators, also of
Q*(s,a).
Q-Function approximation with NN
Q(s,a,w) ≈ Q*(s,a)
Neural Network parameters
14
Q-Function approximation with NN
Network
State Action
Q
Can you think of a more efficient way
of estimating Q values?
Hint: how many passes through the
network do we need for sampling an
action when using this architecture?
15
For discrete actions, practical implementations actually feed the state into the
NN and estimate a Q-value for each action.
Q-Function approximation with NN
16
Q-Learning with Neural Networks
#TD-Gammon Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves
master-level play. Neural computation, 6(2), 215-219.
Outline
1. Motivation
2. Function Approximation
3. Deep Q-Networks (DQN)
18
Deep Q-Network (DQN) was the first method to combine value-based RL (in
particular, Q-learning) with deep neural networks.
Deep Q-learning (DQN)
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
19
Deep Q-learning (DQN)
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
Number of
actions
between 4-18,
depending on
the Atari game
20
Deep Q-learning (DQN)
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
Some challenges arise when combining Q-learning with neural networks trained
with gradient descent:
1. When updating Q(s,a) for a given (s,a) tuple, we will be changing the Q value
estimate for all (s,a). This did not happen in the tabular version of Q-learning.
2. Gradient-based techniques assume that the training samples are independent
and identically distributed (i.i.d.). This assumption is broken when the training
data is generated by an agent interacting with the environment.
Outline
1. Motivation
2. Function approximation
3. Deep Q-Networks (DQN)
○ Online & Target Networks
○ Replay Memory
4. Improvements to DQN
22
DQN: Online & Target Networks
Source: Arthur Juliani, “Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond” (2016)
● Q-network parameters in the online policy
network determine the next training samples
➡ can lead to bad feedback loops .
● A different and more stable target network is
used to estimate TD targets.
TD target
23
DQN: Online & Target Networks
Source: Arthur Juliani, “Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond” (2016)
The target network (wi
-
) is updated by copying the parameters of the policy
network (wi
) periodically.
Online (Policy) Network (wi
)Target Network (wi
-
)
Copy
TD target
Outline
1. Motivation
2. Function approximation
3. Deep Q-Networks (DQN)
○ Online & Target Networks
○ Replay Memory
4. Improvements to DQN
25
DQN: Replay Memory
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
● Learning from batches of consecutive
samples is problematic because samples are
too correlated ➡ inefficient learning
● Continually update a replay memory table
of interactions (s, a, r, s’) as episodes are
collected.
● Train a the online network (wi
) with random
minibatches of transitions from the replay
memory, instead of consecutive samples.
Memory
s a r s’
s a r s’
s a r s’
s a r s’
... ... ... ...
s a r s’
26
Algorithm:
1. Collect transitions (s, a, r, s’) and store them in a replay memory D
2. Sample random mini-batch of transitions (s, a, r, s’) from replay memory D
3. Compute TD-learning targets with respect to (wrt.) old parameters wー
4. Optimise with MSE loss using gradient descent:
David Silver, UCL course on RL (Lecture 6)
Deep Q-learning (DQN)
TD target
27
Online demo
Andrej Karpathy, “ConvNetJS Deep Q Learning Demo”
Reminder: taxonomy of DRL methods
Figure: OpenAI Spinning Up
Outline
1. Motivation
2. Q-Learning with Neural Networks
3. Deep Q-Networks (DQN)
Learn more
Volodymyr Minih, UCL 2018
31
Beyond DQN by Víctor Campos (2020)
Multiple improvements have been proposed since the original DQN was
published:
● Double DQN
● Prioritized Experience Replay
● Dueling Networks
● Distributional DQN
● Noisy Networks
32
van Hasselt et al., Deep Reinforcement Learning with Double Q-learning, AAAI 2016
Double DQN
DQN suffers from overestimation bias. It can be reduced by using the online
network to sample the action whose Q(s,a) will be used for bootstrapping. The
target, Y, still uses a target network:
online network target network
target network
33
Schaul et al., Prioritized Experience Replay, ICLR 2016
Prioritized Experience Replay (PER)
In the original DQN, all samples have the same probability of being sampled for
replay (i.e. for learning).
One might want to replay interesting transitions more often. PER assigns a
different replay priority to each transition. For instance, based on TD error
(transitions with a larger TD error are sampled more often).
34
Wang et al., Dueling Network Architectures for Deep Reinforcement Learning, ICML 2016
Dueling Networks
Instead of estimating Q(s,a) directly, dueling networks estimate V(s) and A(s,a)
and combine them to obtain an estimate of Q(s,a). This architecture helps with
generalization across states.
Q(s,a)
Q(s,a)
V(s)
A(s,a)
Standard DQN
Dueling DQN
35
Bellemare et al., A Distributional Perspective on Reinforcement Learning, ICML 2017
Distributional DQN
Standard Q-learning estimates the expected return for each state and action.
Distributional Q-learning tries to model the full distribution instead of its
expectation only.
36
Fortunato et al., Noisy Networks for Exploration, ICLR 2018
Noisy Networks
DQN explores with ε-greedy. Noisy DQN instead adds (learnable) noise to the
Q-network parameters when sampling actions.
37
Hessel et al., Rainbow: Combining Improvements in Deep Reinforcement Learning, AAAI 2018
Rainbow: combining all improvements
38
What about continuous actions?
DQN derives a policy from a Q value estimate greedily:
π(s) = argmaxa∈A
Q(s,a)
For discrete actions, all Q(s,a) for a given state are computed and we take the
action with the largest value.
How can we compute this argmaxa∈A
operation when dealing with continuous
(i.e. infinite) actions?
39
What about continuous actions?
DQN derives a policy from a Q value estimate greedily:
π(s) = argmaxa∈A
Q(s,a)
For discrete actions, all Q(s,a) for a given state are computed and we take the
action with the largest value.
We can train another neural network that will predict the action that
maximizes Q(s,a).
40
Deep Deterministic Policy Gradient (DDPG)
Network
(Critic)
State Action
Q
Lillicrap et al., Continuous control with deep reinforcement learning, ICLR 2016
41
Deep Deterministic Policy Gradient (DDPG)
Network
(Critic)
State
Q
Network
(Actor)
State
Action
Lillicrap et al., Continuous control with deep reinforcement learning, ICLR 2016

More Related Content

What's hot

What's hot (20)

Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
Deep Learning without Annotations - Xavier Giro - UPC Barcelona 2018
 
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
Variational Autoencoders VAE - Santiago Pascual - UPC Barcelona 2018
 
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
Generative Adversarial Networks GAN - Santiago Pascual - UPC Barcelona 2018
 
Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)Deep Learning Representations for All (a.ka. the AI hype)
Deep Learning Representations for All (a.ka. the AI hype)
 
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
Recurrent Neural Networks RNN - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Backpropagation for Neural Networks
Backpropagation for Neural NetworksBackpropagation for Neural Networks
Backpropagation for Neural Networks
 
Backpropagation for Deep Learning
Backpropagation for Deep LearningBackpropagation for Deep Learning
Backpropagation for Deep Learning
 
Neural Architectures for Video Encoding
Neural Architectures for Video EncodingNeural Architectures for Video Encoding
Neural Architectures for Video Encoding
 
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
Lifelong / Incremental Deep Learning - Ramon Morros - UPC Barcelona 2018
 
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
Deep Neural Networks (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
Attention Models (D3L6 2017 UPC Deep Learning for Computer Vision)
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
Deep Generative Models - Kevin McGuinness - UPC Barcelona 2018
 
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
Transfer Learning and Domain Adaptation (DLAI D5L2 2017 UPC Deep Learning for...
 
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018One Perceptron  to Rule them All: Deep Learning for Multimedia #A2IC2018
One Perceptron to Rule them All: Deep Learning for Multimedia #A2IC2018
 
Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018
Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018
Semantic Segmentation - Míriam Bellver - UPC Barcelona 2018
 
DeepFix: a fully convolutional neural network for predicting human fixations...
DeepFix:  a fully convolutional neural network for predicting human fixations...DeepFix:  a fully convolutional neural network for predicting human fixations...
DeepFix: a fully convolutional neural network for predicting human fixations...
 
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
Convolutional Neural Networks (DLAI D5L1 2017 UPC Deep Learning for Artificia...
 

Similar to Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020

Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo appDetails of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
PAY2 YOU
 
The Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention ModelsThe Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention Models
Universitat Politècnica de Catalunya
 
imageclassification-160206090009.pdf
imageclassification-160206090009.pdfimageclassification-160206090009.pdf
imageclassification-160206090009.pdf
KammetaJoshna
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...
Universidade de São Paulo
 

Similar to Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020 (20)

Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
Advanced Deep Architectures (D2L6 Deep Learning for Speech and Language UPC 2...
 
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo appDetails of Lazy Deep Learning for Images Recognition in ZZ Photo app
Details of Lazy Deep Learning for Images Recognition in ZZ Photo app
 
5 Important Deep Learning Research Papers You Must Read In 2020
5 Important Deep Learning Research Papers You Must Read In 20205 Important Deep Learning Research Papers You Must Read In 2020
5 Important Deep Learning Research Papers You Must Read In 2020
 
The Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention ModelsThe Importance of Time in Visual Attention Models
The Importance of Time in Visual Attention Models
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Image classification with Deep Neural Networks
Image classification with Deep Neural NetworksImage classification with Deep Neural Networks
Image classification with Deep Neural Networks
 
imageclassification-160206090009.pdf
imageclassification-160206090009.pdfimageclassification-160206090009.pdf
imageclassification-160206090009.pdf
 
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
Neural Architectures for Still Images - Xavier Giro- UPC Barcelona 2019
 
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neu...
 
A reinforcement learning based routing protocol with qo s support for biomedi...
A reinforcement learning based routing protocol with qo s support for biomedi...A reinforcement learning based routing protocol with qo s support for biomedi...
A reinforcement learning based routing protocol with qo s support for biomedi...
 
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
D1L5 Visualization (D1L2 Insight@DCU Machine Learning Workshop 2017)
 
On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...On the Support of a Similarity-Enabled Relational Database Management System ...
On the Support of a Similarity-Enabled Relational Database Management System ...
 
Reservoir computing fast deep learning for sequences
Reservoir computing   fast deep learning for sequencesReservoir computing   fast deep learning for sequences
Reservoir computing fast deep learning for sequences
 
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
Deep Learning - 인공지능 기계학습의 새로운 트랜드 :김인중
 
CNN
CNNCNN
CNN
 
Visual Search Engine with MXNet Gluon
Visual Search Engine with MXNet GluonVisual Search Engine with MXNet Gluon
Visual Search Engine with MXNet Gluon
 
Deep learning: Modeling high-level face features through deep networks
Deep learning: Modeling high-level face features through deep networksDeep learning: Modeling high-level face features through deep networks
Deep learning: Modeling high-level face features through deep networks
 
Deep Learning for Computer Vision: Saliency Prediction (UPC 2016)
Deep Learning for Computer Vision: Saliency Prediction (UPC 2016)Deep Learning for Computer Vision: Saliency Prediction (UPC 2016)
Deep Learning for Computer Vision: Saliency Prediction (UPC 2016)
 
Large Scale Data Clustering: an overview
Large Scale Data Clustering: an overviewLarge Scale Data Clustering: an overview
Large Scale Data Clustering: an overview
 
K means report
K means reportK means report
K means report
 

More from Universitat Politècnica de Catalunya

Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Universitat Politècnica de Catalunya
 

More from Universitat Politècnica de Catalunya (19)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
Transcription-Enriched Joint Embeddings for Spoken Descriptions of Images and...
 
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
Object Detection with Deep Learning - Xavier Giro-i-Nieto - UPC School Barcel...
 
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
Self-supervised Audiovisual Learning 2020 - Xavier Giro-i-Nieto - UPC Telecom...
 
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC BarcelonaSelf-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
Self-supervised Visual Learning 2020 - Xavier Giro-i-Nieto - UPC Barcelona
 
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN BarcelonaDeep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
Deep Video Object Tracking 2020 - Xavier Giro - UPC TelecomBCN Barcelona
 
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic ModerationHate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation
 

Recently uploaded

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 

Recently uploaded (20)

Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 

Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020

  • 1. Xavier Giro-i-Nieto Associate Professor Universitat Politecnica de Catalunya @DocXavi xavier.giro@upc.edu Q-learning with a Neural Network Deep Reinforcement Learning 1 [course site]
  • 2. 2 Acknowledgments Víctor Campos victor.campos@bsc.es PhD Candidate Barcelona Supercomputing Center Universitat Politècnica de Catalunya
  • 3. Outline 1. Motivation 2. Q-Learning with Neural Networks 3. Deep Q-Networks (DQN)
  • 4. 4 Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
  • 6. Flavours of (model-free) RL Policy-based / value-based (e.g. DQN) (e.g. A3C) (e.g. REINFORCE)
  • 7. Outline 1. Motivation 2. Function Approximation 3. Deep Q-Networks (DQN)
  • 8. Value function Policy ㄫ Model (of the Environment) Goals of Reinforcement Learning Reinforcement Learning with Neural Networks (NN)
  • 9. 9 Solving the Optimal Policy Optimal policy ㄫ* The optimal policy is that one capable of achieving the optimal value functions V* (s) and Q* (s,a) Optimal Q-value functions
  • 10. 10 Tabular Q-Learning is feasible for small state-action spaces: actions Q(s,a) Solving the Optimal Policy: Q-Learning S F F F F H F H F F F F H F F G
  • 11. 11 How would you compute the amount of rows in a tabular Q-learning solution of Space invaders ? Q-Function estimation with a table actions Q(s,a)
  • 12. 12 Exploring all positive states would require: ● generating all possible (valid) pixel combinations ● run all possible actions ● several times to allow estimation. ... is not scalable Q-Function estimation with a table actions Q(s,a)
  • 13. 13 Deep neural network are powerful function approximators, also of Q*(s,a). Q-Function approximation with NN Q(s,a,w) ≈ Q*(s,a) Neural Network parameters
  • 14. 14 Q-Function approximation with NN Network State Action Q Can you think of a more efficient way of estimating Q values? Hint: how many passes through the network do we need for sampling an action when using this architecture?
  • 15. 15 For discrete actions, practical implementations actually feed the state into the NN and estimate a Q-value for each action. Q-Function approximation with NN
  • 16. 16 Q-Learning with Neural Networks #TD-Gammon Tesauro, G. (1994). TD-Gammon, a self-teaching backgammon program, achieves master-level play. Neural computation, 6(2), 215-219.
  • 17. Outline 1. Motivation 2. Function Approximation 3. Deep Q-Networks (DQN)
  • 18. 18 Deep Q-Network (DQN) was the first method to combine value-based RL (in particular, Q-learning) with deep neural networks. Deep Q-learning (DQN) Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
  • 19. 19 Deep Q-learning (DQN) Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Number of actions between 4-18, depending on the Atari game
  • 20. 20 Deep Q-learning (DQN) Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Some challenges arise when combining Q-learning with neural networks trained with gradient descent: 1. When updating Q(s,a) for a given (s,a) tuple, we will be changing the Q value estimate for all (s,a). This did not happen in the tabular version of Q-learning. 2. Gradient-based techniques assume that the training samples are independent and identically distributed (i.i.d.). This assumption is broken when the training data is generated by an agent interacting with the environment.
  • 21. Outline 1. Motivation 2. Function approximation 3. Deep Q-Networks (DQN) ○ Online & Target Networks ○ Replay Memory 4. Improvements to DQN
  • 22. 22 DQN: Online & Target Networks Source: Arthur Juliani, “Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond” (2016) ● Q-network parameters in the online policy network determine the next training samples ➡ can lead to bad feedback loops . ● A different and more stable target network is used to estimate TD targets. TD target
  • 23. 23 DQN: Online & Target Networks Source: Arthur Juliani, “Simple Reinforcement Learning with Tensorflow Part 4: Deep Q-Networks and Beyond” (2016) The target network (wi - ) is updated by copying the parameters of the policy network (wi ) periodically. Online (Policy) Network (wi )Target Network (wi - ) Copy TD target
  • 24. Outline 1. Motivation 2. Function approximation 3. Deep Q-Networks (DQN) ○ Online & Target Networks ○ Replay Memory 4. Improvements to DQN
  • 25. 25 DQN: Replay Memory Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. ● Learning from batches of consecutive samples is problematic because samples are too correlated ➡ inefficient learning ● Continually update a replay memory table of interactions (s, a, r, s’) as episodes are collected. ● Train a the online network (wi ) with random minibatches of transitions from the replay memory, instead of consecutive samples. Memory s a r s’ s a r s’ s a r s’ s a r s’ ... ... ... ... s a r s’
  • 26. 26 Algorithm: 1. Collect transitions (s, a, r, s’) and store them in a replay memory D 2. Sample random mini-batch of transitions (s, a, r, s’) from replay memory D 3. Compute TD-learning targets with respect to (wrt.) old parameters wー 4. Optimise with MSE loss using gradient descent: David Silver, UCL course on RL (Lecture 6) Deep Q-learning (DQN) TD target
  • 27. 27 Online demo Andrej Karpathy, “ConvNetJS Deep Q Learning Demo”
  • 28. Reminder: taxonomy of DRL methods Figure: OpenAI Spinning Up
  • 29. Outline 1. Motivation 2. Q-Learning with Neural Networks 3. Deep Q-Networks (DQN)
  • 31. 31 Beyond DQN by Víctor Campos (2020) Multiple improvements have been proposed since the original DQN was published: ● Double DQN ● Prioritized Experience Replay ● Dueling Networks ● Distributional DQN ● Noisy Networks
  • 32. 32 van Hasselt et al., Deep Reinforcement Learning with Double Q-learning, AAAI 2016 Double DQN DQN suffers from overestimation bias. It can be reduced by using the online network to sample the action whose Q(s,a) will be used for bootstrapping. The target, Y, still uses a target network: online network target network target network
  • 33. 33 Schaul et al., Prioritized Experience Replay, ICLR 2016 Prioritized Experience Replay (PER) In the original DQN, all samples have the same probability of being sampled for replay (i.e. for learning). One might want to replay interesting transitions more often. PER assigns a different replay priority to each transition. For instance, based on TD error (transitions with a larger TD error are sampled more often).
  • 34. 34 Wang et al., Dueling Network Architectures for Deep Reinforcement Learning, ICML 2016 Dueling Networks Instead of estimating Q(s,a) directly, dueling networks estimate V(s) and A(s,a) and combine them to obtain an estimate of Q(s,a). This architecture helps with generalization across states. Q(s,a) Q(s,a) V(s) A(s,a) Standard DQN Dueling DQN
  • 35. 35 Bellemare et al., A Distributional Perspective on Reinforcement Learning, ICML 2017 Distributional DQN Standard Q-learning estimates the expected return for each state and action. Distributional Q-learning tries to model the full distribution instead of its expectation only.
  • 36. 36 Fortunato et al., Noisy Networks for Exploration, ICLR 2018 Noisy Networks DQN explores with ε-greedy. Noisy DQN instead adds (learnable) noise to the Q-network parameters when sampling actions.
  • 37. 37 Hessel et al., Rainbow: Combining Improvements in Deep Reinforcement Learning, AAAI 2018 Rainbow: combining all improvements
  • 38. 38 What about continuous actions? DQN derives a policy from a Q value estimate greedily: π(s) = argmaxa∈A Q(s,a) For discrete actions, all Q(s,a) for a given state are computed and we take the action with the largest value. How can we compute this argmaxa∈A operation when dealing with continuous (i.e. infinite) actions?
  • 39. 39 What about continuous actions? DQN derives a policy from a Q value estimate greedily: π(s) = argmaxa∈A Q(s,a) For discrete actions, all Q(s,a) for a given state are computed and we take the action with the largest value. We can train another neural network that will predict the action that maximizes Q(s,a).
  • 40. 40 Deep Deterministic Policy Gradient (DDPG) Network (Critic) State Action Q Lillicrap et al., Continuous control with deep reinforcement learning, ICLR 2016
  • 41. 41 Deep Deterministic Policy Gradient (DDPG) Network (Critic) State Q Network (Actor) State Action Lillicrap et al., Continuous control with deep reinforcement learning, ICLR 2016