SlideShare une entreprise Scribd logo
1  sur  74
Télécharger pour lire hors ligne
Xavier Giro-i-Nieto
xavier.giro@upc.edu
Associate Professor
Universitat Politècnica de Catalunya
Technical University of Catalonia
Reinforcement
Learning: MDP & DQN
Day 5 Lecture 2
#DLUPC
http://bit.ly/dlai2018
2
Acknowledgements
Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
3
Acknowledgements
Víctor Campos
victor.campos@bsc.es
PhD Candidate
Barcelona Supercomputing Center
Míriam Bellver
miriam.bellver@bsc.edu
PhD Candidate
Barcelona Supercomputing Center
4
Video lecture
Xavier Giró, DLAI 2017
5
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
6
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
Types of machine learning
Yann Lecun’s Black Forest cake
7
8
Types of machine learning
We can categorize three types of learning procedures:
1. Supervised Learning:
𝐲 = ƒ(𝐱)
2. Unsupervised Learning:
ƒ(𝐱)
3. Reinforcement Learning (RL):
𝐲 = ƒ(𝐱)
𝐳
Predict label y corresponding to
observation x
Estimate the distribution of
observation x
Predict action y based on
observation x, to maximize a future
reward z
9
Types of machine learning
We can categorize three types of learning procedures:
1. Supervised Learning:
𝐲 = ƒ(𝐱)
2. Unsupervised Learning:
ƒ(𝐱)
3. Reinforcement Learning (RL):
𝐲 = ƒ(𝐱)
𝐳
10
Rich Sutton, “Temporal-Difference Learning” University of Alberta (2017)
11
Motivation
What is Reinforcement Learning (RL) ?
“a way of programming agents by reward and punishment without needing to
specify how the task is to be achieved”
[Kaelbling, Littman, & Moore, 96]
Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of
artificial intelligence research 4 (1996): 237-285.
12
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
13
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller.
"Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
14
Architecure
Figure: UCL Course on RL by David Silver
15
Architecure
Figure: UCL Course on RL by David Silver
Environment
16
Architecure
Figure: UCL Course on RL by David Silver
Environment
state (st
)
17
Architecure
Figure: UCL Course on RL by David Silver
Environment
state (st
)
18
Architecure
Figure: UCL Course on RL by David Silver
Environment
Agent
state (st
)
19
Architecure
Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)state (st
)
20
Architecure
Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)
reward (rt
)
state (st
)
21
Architecure
Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)
reward (rt
)
state (st
)
Reward is given to
the agent delayed
with respect to
previous states and
actions !
22
Architecure
Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)
reward (rt
)
state (st+1
)
23
Architecure
Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)
reward (rt
)
state (st+1
)
GOAL: Reach the
highest score
possible at the end
of the game.
24
Architecure
Figure: UCL Course on RL by David Silver
Environment
Agent
action (At
)
reward (rt
)
state (st+1
)
GOAL: Learn how to
take actions to
maximize
accumulative reward
25
Architecure
Multiple problems that can be formulated with a RL architecture.
Cart-Pole Problem
Objective: Balance a pole on top of a movable car
Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
26
Architecure
Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Environment
Agent
action (At
)
reward (rt
)
state (st
) Angle
Angular speed
Position
Horizontal velocity
Horizontal force
applied in the car1 at each time
step if the pole is
upright
27
Architecure
Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized
advantage estimation." ICLR 2016 [project page]
Multiple problems that can be formulated with a RL architecture.
Robot Locomotion
Objective: Make the robot move forward
28
Architecure
Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized
advantage estimation." ICLR 2016 [project page]
Environment
Agent
action (At
)
reward (rt
)
state (st
) Angle and position
of the joints
Torques applied
on joints1 at each time
step upright +
forward
movement
29
Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized
advantage estimation." ICLR 2016 [project page]
30
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
○ Policy
○ Optimal Policy
○ Value Function
○ Q-value function
○ Optimal Q-value function
○ Bellman equation
○ Value iteration algorithm
4. Deep Q-learning
5. RL Frameworks
6. Learn more
31
Markov Decisin Process (MDP)
Markov Decision Processes provide a formalism for reinforcement learning
problems.
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Markov property:
Current state completely
characterises the state of the world.
32
Markov Decision Process (MDP)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
S A R P ४
33
Markov Decision Process (MDP)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
S A R P ४
Environment
samples initial
state s0
~ p(s0
)
Agent
selects
action at
Environment samples
next state st+1
~ P ( .| st
, at
)
Environment samples
reward rt
~ R(. | st
,at
) reward
(rt
)
state (st
) action (at
)
34
MDP: Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
S A R P ४
Agent
selects
action at
policy π
A Policy π is a function S ➝ A that specifies which action to take in each state.
35
MDP: Policy
A Policy π is a function S ➝ A that specifies which action to take in each state.
Agent
selects
action at
policy π
GOAL: Learn how
to take actions to
maximize reward
Agent
GOAL: Find policy π* that
maximizes the cumulative
discounted reward:
MDP
36
MDP: Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Multiple problems that can be formulated with a RL architecture.
Grid World (a simple MDP)
Objective: reach one of the terminal
states (greyed out) in least number of
actions.
37
MDP: Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Environment
Agent
action (At
)
reward (rt
)
state (st
)
Each cell is a state:
A negative
“reward” (penalty)
for each transition
rt
= r = -1
38
MDP: Policy: Random
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Example: Actions resulting from applying a random policy on this Grid World
problem.
39
MDP: Policy: Optimal Policy π*
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Exercise: Draw the actions resulting from applying an optimal policy π* in this Grid
World problem.
40
MDP: Policy: Optimal Policy π*
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Solution: Draw the actions resulting from applying an optimal policy π* in this Grid
World problem.
41
MDP: Policy: Optimal Policy π*
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
How do we handle the randomness (initial
state s0
, transition probabilities, action...) ?
GOAL: Find policy π* that
maximizes the cumulative
discounted reward:
Environment
samples initial
state s0
~ p(s0
)
Agent selects
action at
~π
(.|st
)
Environment samples
next state st+1
~ P ( .| st
, at
)
Environment samples
reward rt
~ R(. | st
,at
) reward
(rt
)
state (st
) action (at
)
42
MDP: Policy: Optimal Policy π*
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
How do we handle the randomness (initial
state s0
, transition probabilities, action) ?
GOAL: Find policy π* that
maximizes the cumulative
discounted reward:
The optimal policy π* will maximize the
expected sum of rewards:
initial
state
selected
action at t
sampled state for
t+1expected
cumulative
discounted reward
43
MDP: Value Function Vπ
(s)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
How to estimate how good state s is for a given policy π ?
With the value function at state s, Vπ
(s), the expected cumulative reward from
following policy π from state s.
“...from following policy π
from state s.”
“Expected
cumulative reward…””
44
MDP: Q-Value Function Qπ
(s,a)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
How to estimate how good a state-action pair (s,a) is for a given policy π ?
With the Q-value function at state s and action a, Qπ
(s,a), the expected
cumulative reward from taking action a in state s, and then following policy π.
“...from taking action a in state s
and then following policy π.”
“Expected
cumulative reward…””
45
MDP: Optimal Q-Value Function Q*
(s,a)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
The optimal Q-value function at state s and action, Q*
(s,a), is the maximum
expected cumulative reward achievable from a given (state, action) pair:
choose the policy that
maximizes the expected
cumulative reward
(From the previous page)
Q-value function
46
MDP: Optimal Q-Value Function Q*
(s,a)
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Q*
(s,a) satisfies the following Bellman equation:
Maximum expected
cumulative reward for
future pair (s’,a’)
FUTURE REWARD
(From the previous page)
Optimal Q-value function
reward for
considered
pair (s,a)
Maximum expected
cumulative reward
for considered pair
(s,a)
Expectation
across possible
future states s’
(randomness) discount
factor
47
MDP: Bellman Equation
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Q*
(s,a) satisfies the following Bellman equation:
The optimal policy π* corresponds to taking the best action in any state according
to Q*.
GOAL: Find policy π* that
maximizes the cumulative
discounted reward:
select action a’ that maximizes
expected cumulative reward
48
MDP: Solving the Optimal Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Value iteration algorithm: Estimate the Bellman equation with an iterative update.
The iterative estimation Qi
(s,a) will converge to the optimal Q*(s,a) as i ➝ ∞.
(From the previous page)
Bellman Equation
Updated Q-value
function
Current Q-value for
future pair (s’,a’)
49
MDP: Solving the Optimal Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Exploring all positive states and action is not scalable
by itself. Let alone iteratively.
Eg. If video game, it would require generating all possible pixels and
actions… as many times as necessary iterations to estimate Q*(s,a).
The iterative estimation Qi
(s,a) will converge to the optimal Q*(s,a) as i ➝ ∞.
Updated Q-value
function
Current Q-value for
future pair (s’,a’)
50
MDP: Solving the Optimal Policy
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Solution: Use a deep neural network as
an function approximator of Q*(s,a).
Q(s,a,Ө) ≈ Q*(s,a)
Neural Network parameters
Exploring all positive states and action is not scalable
Eg. If video game, it would require generating all possible pixels and
actions.
51
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
52
Deep Q-learning
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
The function to approximate is a Q-function that satisfies the Bellman equation:
Q(s,a,Ө) ≈ Q*(s,a)
53
Deep Q-learning
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
The function to approximate is a Q-function that satisfies the Bellman equation:
Q(s,a,Ө) ≈ Q*(s,a)
Forward Pass
Loss function:
Sample a (s,a) pair Predicted Q-value
with Өi
Sample a
future state s’
Predict Q-value with Өi-1
54
Deep Q-learning
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Train the DNN
to approximate
a Q-value
function that
satisfies the
Bellman
equation
55
Deep Q-learning
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Must compute
reward during
training
56
Deep Q-learning
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Backward Pass
Gradient update (with respect to Q-function parameters Ө):
Forward Pass
Loss function:
57
Deep Q-learning: Deep Q-Network DQN
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Q(s,a,Ө) ≈ Q*(s,a)
58
Deep Q-learning: Deep Q-Network DQN
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Q(s,a,Ө) ≈ Q*(s,a)
efficiency Single
Feed
Forward
Pass
A single feedforward pass to compute the Q-values
for all actions from the current state (efficient)
59
Deep Q-learning: Deep Q-Network DQN
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al.
"Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533.
Number of
actions between
4-18, depending
on the Atari
game
60
Deep Q-learning: Deep Q-Network DQN
Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
Q(st
, ⬅), Q(st
, ➡), Q(st
, ⬆), Q(st
,⬇ )
61
Deep Q-learning: Demo
Andrej Karpathy, “ConvNetJS Deep Q Learning Demo”
62
Deep Q-learning: Demo
63
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
64
RL Frameworks
OpenAI Gym + keras-rl
+
keras-rl
keras-rl implements some state-of-the
art deep reinforcement learning
algorithms in Python and seamlessly
integrates with the deep learning
library Keras. Just like Keras, it works
with either Theano or TensorFlow,
which means that you can train your
algorithm efficiently either on CPU or
GPU. Furthermore, keras-rl works with
OpenAI Gym out of the box.
Slide credit: Míriam Bellver
65
RL Frameworks
OpenAI
Universe
environment
66
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
67
Learn more
Deep Learning TV,
“Reinforcement learning - Ep. 30”
Siraj Raval, Deep Q Learning for Video
Games
68
Learn more
David Silver, UCL COMP050, Reinforcement Learning
69
Learn more
Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning,
Berkeley.
Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017)
70
Learn more
Nando de Freitas, “Machine Learning” (University of
Oxford)
71
Homework
Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I.,
Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree
search. Nature, 529(7587), pp.484-489
72
Homework (exam material !!)
Greg Kohs, “AlphaGo” (2017). [@ Netflix]
73
Outline
1. Motivation
2. Architecture
3. Markov Decision Process (MDP)
4. Deep Q-learning
5. RL Frameworks
6. Learn more
74
Final Questions

Contenu connexe

Tendances

Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)pauldix
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...Ceni Babaoglu, PhD
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesThomas da Silva Paula
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
 
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1Garry D. Lasaga
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIAI Frontiers
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsBill Liu
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed BanditsDongmin Lee
 

Tendances (20)

Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)An introduction to reinforcement learning (rl)
An introduction to reinforcement learning (rl)
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...
5. Linear Algebra for Machine Learning: Singular Value Decomposition and Prin...
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Deep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-LearningDeep Reinforcement Learning: Q-Learning
Deep Reinforcement Learning: Q-Learning
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
Introduction Artificial Intelligence a modern approach by Russel and Norvig 1
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Jay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AIJay Yagnik at AI Frontiers : A History Lesson on AI
Jay Yagnik at AI Frontiers : A History Lesson on AI
 
Deep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its ApplicationsDeep Reinforcement Learning and Its Applications
Deep Reinforcement Learning and Its Applications
 
Multi-armed Bandits
Multi-armed BanditsMulti-armed Bandits
Multi-armed Bandits
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Learning
LearningLearning
Learning
 

Similaire à Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona 2018

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning Chandra Meena
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptxRithikRaj25
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPGHye-min Ahn
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningRyo Iwaki
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Universitat Politècnica de Catalunya
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperDataScienceLab
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learningazzeddine chenine
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfVaishnavGhadge1
 

Similaire à Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona 2018 (20)

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...
 
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx14_ReinforcementLearning.pptx
14_ReinforcementLearning.pptx
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
increasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learningincreasing the action gap - new operators for reinforcement learning
increasing the action gap - new operators for reinforcement learning
 
Hierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement LearningHierarchical Object Detection with Deep Reinforcement Learning
Hierarchical Object Detection with Deep Reinforcement Learning
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
Q-Learning with a Neural Network - Xavier Giró - UPC Barcelona 2020
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Head First Reinforcement Learning
Head First Reinforcement LearningHead First Reinforcement Learning
Head First Reinforcement Learning
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 

Plus de Universitat Politècnica de Catalunya

The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...Universitat Politècnica de Catalunya
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoUniversitat Politècnica de Catalunya
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Universitat Politècnica de Catalunya
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosUniversitat Politècnica de Catalunya
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Universitat Politècnica de Catalunya
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Universitat Politècnica de Catalunya
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Universitat Politècnica de Catalunya
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Universitat Politècnica de Catalunya
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Universitat Politècnica de Catalunya
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Universitat Politècnica de Catalunya
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Universitat Politècnica de Catalunya
 

Plus de Universitat Politècnica de Catalunya (20)

Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
Deep Generative Learning for All
Deep Generative Learning for AllDeep Generative Learning for All
Deep Generative Learning for All
 
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
The Transformer in Vision | Xavier Giro | Master in Computer Vision Barcelona...
 
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-NietoTowards Sign Language Translation & Production | Xavier Giro-i-Nieto
Towards Sign Language Translation & Production | Xavier Giro-i-Nieto
 
The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021The Transformer - Xavier Giró - UPC Barcelona 2021
The Transformer - Xavier Giró - UPC Barcelona 2021
 
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
Learning Representations for Sign Language Videos - Xavier Giro - NIST TRECVI...
 
Open challenges in sign language translation and production
Open challenges in sign language translation and productionOpen challenges in sign language translation and production
Open challenges in sign language translation and production
 
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in VideosGeneration of Synthetic Referring Expressions for Object Segmentation in Videos
Generation of Synthetic Referring Expressions for Object Segmentation in Videos
 
Discovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in MinecraftDiscovery and Learning of Navigation Goals from Pixels in Minecraft
Discovery and Learning of Navigation Goals from Pixels in Minecraft
 
Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...Learn2Sign : Sign language recognition and translation using human keypoint e...
Learn2Sign : Sign language recognition and translation using human keypoint e...
 
Intepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural NetworksIntepretability / Explainable AI for Deep Neural Networks
Intepretability / Explainable AI for Deep Neural Networks
 
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
Convolutional Neural Networks - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
Self-Supervised Audio-Visual Learning - Xavier Giro - UPC TelecomBCN Barcelon...
 
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
Attention for Deep Learning - Xavier Giro - UPC TelecomBCN Barcelona 2020
 
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
Generative Adversarial Networks GAN - Xavier Giro - UPC TelecomBCN Barcelona ...
 
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
Language and Vision with Deep Learning - Xavier Giró - ACM ICMR 2020 (Tutorial)
 
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
Image Segmentation with Deep Learning - Xavier Giro & Carles Ventura - ISSonD...
 
Curriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object SegmentationCurriculum Learning for Recurrent Video Object Segmentation
Curriculum Learning for Recurrent Video Object Segmentation
 
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
Deep Self-supervised Learning for All - Xavier Giro - X-Europe 2020
 
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
Deep Learning Representations for All - Xavier Giro-i-Nieto - IRI Barcelona 2020
 

Dernier

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Onlineanilsa9823
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlkumarajju5765
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 

Dernier (20)

Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service OnlineCALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
CALL ON ➥8923113531 🔝Call Girls Chinhat Lucknow best sexual service Online
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girlCall Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
Call Girls 🫤 Dwarka ➡️ 9711199171 ➡️ Delhi 🫦 Two shot with one girl
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 

Deep Reinforcement Learning: MDP & DQN - Xavier Giro-i-Nieto - UPC Barcelona 2018

  • 1. Xavier Giro-i-Nieto xavier.giro@upc.edu Associate Professor Universitat Politècnica de Catalunya Technical University of Catalonia Reinforcement Learning: MDP & DQN Day 5 Lecture 2 #DLUPC http://bit.ly/dlai2018
  • 2. 2 Acknowledgements Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
  • 3. 3 Acknowledgements Víctor Campos victor.campos@bsc.es PhD Candidate Barcelona Supercomputing Center Míriam Bellver miriam.bellver@bsc.edu PhD Candidate Barcelona Supercomputing Center
  • 5. 5 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 6. 6 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 7. Types of machine learning Yann Lecun’s Black Forest cake 7
  • 8. 8 Types of machine learning We can categorize three types of learning procedures: 1. Supervised Learning: 𝐲 = ƒ(𝐱) 2. Unsupervised Learning: ƒ(𝐱) 3. Reinforcement Learning (RL): 𝐲 = ƒ(𝐱) 𝐳 Predict label y corresponding to observation x Estimate the distribution of observation x Predict action y based on observation x, to maximize a future reward z
  • 9. 9 Types of machine learning We can categorize three types of learning procedures: 1. Supervised Learning: 𝐲 = ƒ(𝐱) 2. Unsupervised Learning: ƒ(𝐱) 3. Reinforcement Learning (RL): 𝐲 = ƒ(𝐱) 𝐳
  • 10. 10 Rich Sutton, “Temporal-Difference Learning” University of Alberta (2017)
  • 11. 11 Motivation What is Reinforcement Learning (RL) ? “a way of programming agents by reward and punishment without needing to specify how the task is to be achieved” [Kaelbling, Littman, & Moore, 96] Kaelbling, Leslie Pack, Michael L. Littman, and Andrew W. Moore. "Reinforcement learning: A survey." Journal of artificial intelligence research 4 (1996): 237-285.
  • 12. 12 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 13. 13 Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. "Playing atari with deep reinforcement learning." arXiv preprint arXiv:1312.5602 (2013).
  • 14. 14 Architecure Figure: UCL Course on RL by David Silver
  • 15. 15 Architecure Figure: UCL Course on RL by David Silver Environment
  • 16. 16 Architecure Figure: UCL Course on RL by David Silver Environment state (st )
  • 17. 17 Architecure Figure: UCL Course on RL by David Silver Environment state (st )
  • 18. 18 Architecure Figure: UCL Course on RL by David Silver Environment Agent state (st )
  • 19. 19 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At )state (st )
  • 20. 20 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st )
  • 21. 21 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st ) Reward is given to the agent delayed with respect to previous states and actions !
  • 22. 22 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 )
  • 23. 23 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) GOAL: Reach the highest score possible at the end of the game.
  • 24. 24 Architecure Figure: UCL Course on RL by David Silver Environment Agent action (At ) reward (rt ) state (st+1 ) GOAL: Learn how to take actions to maximize accumulative reward
  • 25. 25 Architecure Multiple problems that can be formulated with a RL architecture. Cart-Pole Problem Objective: Balance a pole on top of a movable car Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017.
  • 26. 26 Architecure Slide credit: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Angle Angular speed Position Horizontal velocity Horizontal force applied in the car1 at each time step if the pole is upright
  • 27. 27 Architecure Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page] Multiple problems that can be formulated with a RL architecture. Robot Locomotion Objective: Make the robot move forward
  • 28. 28 Architecure Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page] Environment Agent action (At ) reward (rt ) state (st ) Angle and position of the joints Torques applied on joints1 at each time step upright + forward movement
  • 29. 29 Schulman, John, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. "High-dimensional continuous control using generalized advantage estimation." ICLR 2016 [project page]
  • 30. 30 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) ○ Policy ○ Optimal Policy ○ Value Function ○ Q-value function ○ Optimal Q-value function ○ Bellman equation ○ Value iteration algorithm 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 31. 31 Markov Decisin Process (MDP) Markov Decision Processes provide a formalism for reinforcement learning problems. Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Markov property: Current state completely characterises the state of the world.
  • 32. 32 Markov Decision Process (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४
  • 33. 33 Markov Decision Process (MDP) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४ Environment samples initial state s0 ~ p(s0 ) Agent selects action at Environment samples next state st+1 ~ P ( .| st , at ) Environment samples reward rt ~ R(. | st ,at ) reward (rt ) state (st ) action (at )
  • 34. 34 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. S A R P ४ Agent selects action at policy π A Policy π is a function S ➝ A that specifies which action to take in each state.
  • 35. 35 MDP: Policy A Policy π is a function S ➝ A that specifies which action to take in each state. Agent selects action at policy π GOAL: Learn how to take actions to maximize reward Agent GOAL: Find policy π* that maximizes the cumulative discounted reward: MDP
  • 36. 36 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Multiple problems that can be formulated with a RL architecture. Grid World (a simple MDP) Objective: reach one of the terminal states (greyed out) in least number of actions.
  • 37. 37 MDP: Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Environment Agent action (At ) reward (rt ) state (st ) Each cell is a state: A negative “reward” (penalty) for each transition rt = r = -1
  • 38. 38 MDP: Policy: Random Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Example: Actions resulting from applying a random policy on this Grid World problem.
  • 39. 39 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Exercise: Draw the actions resulting from applying an optimal policy π* in this Grid World problem.
  • 40. 40 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Solution: Draw the actions resulting from applying an optimal policy π* in this Grid World problem.
  • 41. 41 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How do we handle the randomness (initial state s0 , transition probabilities, action...) ? GOAL: Find policy π* that maximizes the cumulative discounted reward: Environment samples initial state s0 ~ p(s0 ) Agent selects action at ~π (.|st ) Environment samples next state st+1 ~ P ( .| st , at ) Environment samples reward rt ~ R(. | st ,at ) reward (rt ) state (st ) action (at )
  • 42. 42 MDP: Policy: Optimal Policy π* Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How do we handle the randomness (initial state s0 , transition probabilities, action) ? GOAL: Find policy π* that maximizes the cumulative discounted reward: The optimal policy π* will maximize the expected sum of rewards: initial state selected action at t sampled state for t+1expected cumulative discounted reward
  • 43. 43 MDP: Value Function Vπ (s) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How to estimate how good state s is for a given policy π ? With the value function at state s, Vπ (s), the expected cumulative reward from following policy π from state s. “...from following policy π from state s.” “Expected cumulative reward…””
  • 44. 44 MDP: Q-Value Function Qπ (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. How to estimate how good a state-action pair (s,a) is for a given policy π ? With the Q-value function at state s and action a, Qπ (s,a), the expected cumulative reward from taking action a in state s, and then following policy π. “...from taking action a in state s and then following policy π.” “Expected cumulative reward…””
  • 45. 45 MDP: Optimal Q-Value Function Q* (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The optimal Q-value function at state s and action, Q* (s,a), is the maximum expected cumulative reward achievable from a given (state, action) pair: choose the policy that maximizes the expected cumulative reward (From the previous page) Q-value function
  • 46. 46 MDP: Optimal Q-Value Function Q* (s,a) Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q* (s,a) satisfies the following Bellman equation: Maximum expected cumulative reward for future pair (s’,a’) FUTURE REWARD (From the previous page) Optimal Q-value function reward for considered pair (s,a) Maximum expected cumulative reward for considered pair (s,a) Expectation across possible future states s’ (randomness) discount factor
  • 47. 47 MDP: Bellman Equation Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q* (s,a) satisfies the following Bellman equation: The optimal policy π* corresponds to taking the best action in any state according to Q*. GOAL: Find policy π* that maximizes the cumulative discounted reward: select action a’ that maximizes expected cumulative reward
  • 48. 48 MDP: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Value iteration algorithm: Estimate the Bellman equation with an iterative update. The iterative estimation Qi (s,a) will converge to the optimal Q*(s,a) as i ➝ ∞. (From the previous page) Bellman Equation Updated Q-value function Current Q-value for future pair (s’,a’)
  • 49. 49 MDP: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Exploring all positive states and action is not scalable by itself. Let alone iteratively. Eg. If video game, it would require generating all possible pixels and actions… as many times as necessary iterations to estimate Q*(s,a). The iterative estimation Qi (s,a) will converge to the optimal Q*(s,a) as i ➝ ∞. Updated Q-value function Current Q-value for future pair (s’,a’)
  • 50. 50 MDP: Solving the Optimal Policy Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Solution: Use a deep neural network as an function approximator of Q*(s,a). Q(s,a,Ө) ≈ Q*(s,a) Neural Network parameters Exploring all positive states and action is not scalable Eg. If video game, it would require generating all possible pixels and actions.
  • 51. 51 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 52. 52 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The function to approximate is a Q-function that satisfies the Bellman equation: Q(s,a,Ө) ≈ Q*(s,a)
  • 53. 53 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. The function to approximate is a Q-function that satisfies the Bellman equation: Q(s,a,Ө) ≈ Q*(s,a) Forward Pass Loss function: Sample a (s,a) pair Predicted Q-value with Өi Sample a future state s’ Predict Q-value with Өi-1
  • 54. 54 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Train the DNN to approximate a Q-value function that satisfies the Bellman equation
  • 55. 55 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Must compute reward during training
  • 56. 56 Deep Q-learning Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Backward Pass Gradient update (with respect to Q-function parameters Ө): Forward Pass Loss function:
  • 57. 57 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(s,a,Ө) ≈ Q*(s,a)
  • 58. 58 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(s,a,Ө) ≈ Q*(s,a) efficiency Single Feed Forward Pass A single feedforward pass to compute the Q-values for all actions from the current state (efficient)
  • 59. 59 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Mnih, Volodymyr, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves et al. "Human-level control through deep reinforcement learning." Nature 518, no. 7540 (2015): 529-533. Number of actions between 4-18, depending on the Atari game
  • 60. 60 Deep Q-learning: Deep Q-Network DQN Slide concept: Serena Yeung, “Deep Reinforcement Learning”. Stanford University CS231n, 2017. Q(st , ⬅), Q(st , ➡), Q(st , ⬆), Q(st ,⬇ )
  • 61. 61 Deep Q-learning: Demo Andrej Karpathy, “ConvNetJS Deep Q Learning Demo”
  • 63. 63 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 64. 64 RL Frameworks OpenAI Gym + keras-rl + keras-rl keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seamlessly integrates with the deep learning library Keras. Just like Keras, it works with either Theano or TensorFlow, which means that you can train your algorithm efficiently either on CPU or GPU. Furthermore, keras-rl works with OpenAI Gym out of the box. Slide credit: Míriam Bellver
  • 66. 66 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more
  • 67. 67 Learn more Deep Learning TV, “Reinforcement learning - Ep. 30” Siraj Raval, Deep Q Learning for Video Games
  • 68. 68 Learn more David Silver, UCL COMP050, Reinforcement Learning
  • 69. 69 Learn more Pieter Abbeel and John Schulman, CS 294-112 Deep Reinforcement Learning, Berkeley. Slides: “Reinforcement Learning - Policy Optimization” OpenAI / UC Berkeley (2017)
  • 70. 70 Learn more Nando de Freitas, “Machine Learning” (University of Oxford)
  • 71. 71 Homework Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., Van Den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M. and Dieleman, S., 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), pp.484-489
  • 72. 72 Homework (exam material !!) Greg Kohs, “AlphaGo” (2017). [@ Netflix]
  • 73. 73 Outline 1. Motivation 2. Architecture 3. Markov Decision Process (MDP) 4. Deep Q-learning 5. RL Frameworks 6. Learn more