7. Quick Reminder
Learning by experience
Episodic tasks and continuous task
An action has a reward and results a new state
7
8. Quick Reminder
The value based methods try to reach an optimal policy by using a value function
8
9. Quick Reminder
The value based methods try to reach an optimal policy by using a value function
The Q function of an optima policy maximise the expected return G
9
10. Quick Reminder
The value based methods try to reach an optimal policy by using a value function
The Q function of an optima policy maximise the expected return G
Epsilon greedy policies solve the exploration and exploitation
problem
10
16. No more given de
fi
nitions ….
fi
guring it out by
ourselves
16
17. 1 - How to replace the
Q-table ?
Neural networks can be trained to
fi
t on a linear or a non
linear function.
17
18. 1 - How to replace the
Q-table ?
Neural networks can be trained to
fi
t on a linear or a non
linear function.
Replace the Q-value function with a deep neural network
18
19. 1 - How to replace the
Q-table ?
Neural networks can be trained to
fi
t on a linear or a non
linear function.
Replace the Q-value function with a deep neural network
The outputs layer hold the Q-value of each possible
action
⟹ make the actions space discrete
19
20. 1 - How to replace the
Q-table ?
Neural networks can be trained to
fi
t on a linear or a non
linear function.
Replace the Q-value function with a deep neural network
The outputs layer hold the Q-value of each possible
action
The input layer takes the desired representation of the
state
⟹ make the actions space discrete
⟹ a pipeline of convolutions
⟹ a Dense layer
⟹ a sequence of LSTM/RNN cells
20
22. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How did we used update the Q-table
The agent took an action (a) in a state (S)
22
23. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How did we used update the Q-table
The agent took an action (a) in a state (S)
The agent moved to the next state (S’) and
received a reward (r)
23
24. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
the discount rate
𝞭
re
fl
ects how much we care
about the futur reward
How did we used update the Q-table
The agent took an action (a) in a state (S)
The agent moved to the next state (S’) and
received a reward (r)
𝞭
≈ 1⟹ we care too much about the future reward
𝞭
≈ 0⟹ we care too much about the immediate reward
24
25. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
Q' ( S , a ) ⟸ Q ( S , a ) +
𝛼
[ r ( S , a , S’ ) +
𝞭
max Q(S’, a') - Q ( S, a ) ]
How did we used update the Q-table
A good Q value for an action (a) given a state (S) is a
good approximation to the expected discounted
cumulative reward.
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
25
26. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values
State
Q-networks
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
26
27. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values Action
Action selection
State
Q-networks Environments
New State S’
Reward R
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
27
28. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
State
Q-networks Environments
New State S’
Reward R
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
28
29. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
State
Q-networks Environments
New State S’
Reward R
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
29
30. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
State
Q-networks Environments
New State S’
Reward R
Loss Optimizers
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
30
31. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
New network weights
State
Q-networks Environments
New State S’
Reward R
Loss Optimizers
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
31
32. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model
How are we going to update the Q-
networks weights
⟹ Q ( S, a ) ≈ r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Update weights
State
New state
For n episode
32
34. r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q ( S, a )
34
35. Q ( S, a )
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
35
36. Q ( S, a )
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
36
37. r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q ( S, a )
37
38. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
We use a local network and a target network
38
39. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
We use a local network and a target network
The target neural network get slightly updated based
on the local networks weights
39
40. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
We use a local network and a target network
The target neural network get slightly updated based
on the local networks weights
The local network predict the Q-values for the current
state
40
41. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
We use a local network and a target network
The target neural network get slightly updated based
on the local networks weights
The local network predict the Q-values for the current
state
The target network predict the Q-values for the next
state
41
42. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
Q-values Action
Action selection
State
Local network Environments
New State S’
Reward R
42
43. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How are we going to update the Q-
networks weights
3 - T-target
Q-values Action
Action selection
⨁
r ( S , a , S’ ) +
𝞭
max Q(S’, a')
Q(S, a')
New network weights
State
Local network Environments
New State S’
Reward R
Loss Optimizers for
the local network
Target network
43
44. ?
Someone to answer the question
please
In supervised learning, why we shu
ffl
ing
the dataset gives better accuracy?
44
45. Someone to answer the question
please
In supervised learning, why we shu
ffl
ing
the dataset gives better accuracy?
If not shu
ffl
ing data, the data can be sorted or
similar data points will lie next to each other,
which leads to slow convergence
?
45
46. Someone to answer the question
please
In supervised learning, why we shu
ffl
ing
the dataset gives better accuracy?
If not shu
ffl
ing data, the data can be sorted or
similar data points will lie next to each other,
which leads to slow convergence
In our case we are having a big
correlation between consecutive states
which may lead to ine
ffi
cient learning .
?
46
47. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How to break the coloration
between states
3 - T-target ✅
4 - Experience
replay
Stack the experiences in a buffer
After the speci
fi
ed number of steps pick an N
random experiences
Proceed with optimizing the loss with the
learning batch
47
48. 1 - How to replace the
Q-table ✅
2 - How to optimize the
model ✅
How to break the coloration
between states
3 - T-target ✅
4 - Experience
replay ✅
Stack the experiences in a buffer
After the speci
fi
ed number of steps pick an N
random experiences
Proceed with optimizing the loss with the
learning batch
48