Continuous control

Model-free Continuous Control
Reinforcement Learning
初谷怜慈

2
⾃⼰紹介
• アメフト -> 東⼤Warriors
• 東京⼤学情報理⼯学系研究科修⼠2年
• DeepX, シニアエンジニア
• 研究 -> 強化学習
– 特に実環境に向けて
• twitter -> @Reiji_Hatsu
• github -> rarilurelo

3
What is reinforcement learning?
Environment Agent

4
Environment Agent
state, reward

5
Environment Agent
action

6
Environment Agent
state, reward
action
Described as MDP or POMDP

7
Formulation
Agent
Environment
𝐴"~𝜋(𝑎|𝑆")
𝑆"*+~𝑃(𝑠.
|𝑆", 𝐴")
𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+)

8
Formulation
Agent
Environment
𝑆"*+~𝑃(𝑠.
|𝑆", 𝐴")
𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+)
Modeling π!
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑Get
Model-free

9
Formulation
Agent
Environment
𝑆"*+~𝑃(𝑠.
|𝑆", 𝐴")
𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+)
Modeling π and P!
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑Get
Model-base

10
Example
• DQN
– Learn Q function
– ε-greedy <- policy π!
𝜋 𝑎 𝑠 = 2
𝑎𝑟𝑔𝑚𝑎𝑥6 𝑄 𝑠, 𝑎 𝜀 < 𝑢
𝑟𝑎𝑛𝑑𝑜𝑚 𝑢 < 𝜀
(𝑢~𝑢𝑛𝑖𝑓𝑜𝑟𝑚 0,1 )

11
What is Continuous Control?
Atari invaders
Robot arm
• Stick directions
• Buttons
• Torques

12
What is Continuous Control?
Robot arm
Assume π is gaussian (gaussian
policy)
μ(s)
σ(s
)
action
action is sampled from this distribution
μ(s) and σ(s) is represented by neural
network

13
Overview of reinforcement learningʼ complexity
Action space complexity
Statespacecomplexity
Discrete Continuous

14
Learning methods of continuous policy
• NAF
• Policy gradient
• Value gradient

15
Definition
𝑄C
𝑠, 𝑎 = 𝐸E,C F 𝛾"
𝑟"
"
|𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C 𝑄C
(𝑠.
, 𝑎.
)
𝑉C
(𝑠) = 𝐸E,C F 𝛾"
𝑟"
"
|𝑠 = 𝐸E,C 𝑟 + 𝛾𝑉C
(𝑠.
)
𝐴C
s, a = 𝑄C
𝑠, 𝑎 − 𝑉C
(𝑠)
exist s, take a, and then according to π
exist s, and then according to π
true influence of a
(Advantage function)
bellman equation

16
Q-learning
Consider optimal policy 𝜋∗
𝑄C∗
𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C∗ 𝑄C∗
(𝑠.
, 𝑎.
)
𝜋∗
𝑎 𝑠 = O
1 (𝑎𝑟𝑔𝑚𝑎𝑥6 𝑄C∗
𝑠, 𝑎 )
0 (𝑜𝑡ℎ𝑒𝑟𝑠)
𝑄C∗
𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾 max
6
𝑄C∗
(𝑠, 𝑎)
minimize (lhs – rhs)**2 (for function approximation)

17
NAF (Normalized Advantage Function)
In DQN, we can get max Q
How to get max Q with continuous action?

18
NAF (Normalized Advantage Function)
𝑄(𝑠, 𝑎) = 𝐴(𝑠, 𝑎) + 𝑉(𝑠)
𝐴 𝑠, 𝑎 = −
1
2
𝑎 − 𝜇 𝑠 𝑃(𝑠)(𝑎 − 𝜇 𝑠 )
Positive definite matrix Conve
x
μ(s)
We can get max Q as V!
0
max 𝑄 𝑠, 𝑎 = 0 + 𝑉(𝑠)
minimize (r+γmaxQ – Q) w.r.t all parameters

19
• NAF
• Policy gradient
• Value gradient

20
Formulation
Agent
Environment
𝑆"*+~𝑃(𝑠.
|𝑆", 𝐴")
𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+)
Modeling π!
π∗
= argmax
π
Eπ [ γ τ
rτ ]
τ =0
∞
∑Get
Model-free

21
Policy gradient
more direct approach than Q-learning
𝐽 = 𝐸CX
F 𝛾Y
𝑟Y
Z
Y
𝜋[ 𝑎 𝑠 = 𝒩(𝜇[ 𝑠 , 𝜎[ 𝑠 )
𝛻𝜃 𝐽 is what we want

22
Policy gradient
∇θ J = ∇θ Eπθ
[ γ τ
rτ ]
τ =0
∞
∑
= ∇θ Es0 ~ρ,s'~p πθ at ,st( ) γ τ
rτ
τ =0
∞
∑t=0
∏
⎡
⎣
⎢
⎤
⎦
⎥
= Es0 ~ρ,s'~p ∇θ πθ at ,st( ) γ τ
rτ
τ =0
∞
∑t=0
∏
⎡
⎣
⎢
⎤
⎦
⎥
= Es~ρ πθ at ,st( )
∇θ πθ at ,st( )
t=0
∏
πθ at ,st( )
t=0
∏
γ τ
rτ
τ =0
∞
∑
t=0
∏
⎡
⎣
⎢
⎢
⎢
⎤
⎦
⎥
⎥
⎥
= Es~ρ πθ (at | st ) ∇θ log(πθ (at | st ))
t=0
∑t=0
∏ γ τ
rτ
τ =0
∞
∑
⎡
⎣
⎢
⎤
⎦
⎥
= Eπθ
[ ∇θ log(πθ (at | st ))
t=0
∑ γ τ
rτ
τ =t
∞
∑ ]
Expectation to summation
differentiate w.r.t theta
multiple pi to
nominator and denominator
logarithmic differentiation
Causality
Approximated by MC

25
Property of Policy gradient
• unbiased estimate
– stable
• on-policy and high-variance estimate
– need large batch size (or A3C like asynchronous training)
– less sample efficiency
• on-policy or off-policy
– current policy can be updated by only current policyʼs sample (on-policy)
– current policy can be updated by any policyʼs sample (off-policy)

26
High variance
• In policy gradient, we have to estimate ∑ 𝛾 𝜏
𝑅 𝜏
∞
𝜏=0
• Estimation of ∑ 𝛾 𝜏
𝑅 𝜏
∞
𝜏=0 is high variance
– long time sequence
– environmentʼs state transition probability
• There are several methods to reduce variance

27
Actor-critic method
policy gradient evaluate how good π(a_t|s_t) was
That depends on only τ > t (causality)
𝛻[ 𝐽 ≈ 𝐸CX
F 𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑄C
(𝑠", 𝑎")
"
reduce variance, but biased estimate

31
Baseline
F 𝛻[ log 𝜋[ 𝑎" 𝑠" (𝑄C
𝑠", 𝑎" − 𝑏 𝑠" )
"
𝐸CX
𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑏(𝑠") = 𝑏(𝑠")𝛻[ 𝐸C[
𝜋[ = 0
b=V is good choice, because Q and V are correlation!
F 𝛻[ log 𝜋[ 𝑎" 𝑠" (𝐴C
(𝑠", 𝑎"))
"

32
• NAF
• Policy gradient
• Value gradient

33
Value gradient
state transition distribution 𝜌
𝐽 = 𝐸j 𝑄C
(𝑠, 𝑎)
= 𝐸j 𝛻6 𝑄 𝑠, 𝑎 k
6lm n *op n
𝛻[(𝜇 𝑠 + 𝜀𝜎 𝑠 )
𝛻[ 𝐽 = 𝐸j 𝛻[ 𝑄C
(𝑠, 𝑎)
= Ej 𝛻6 𝑄 𝑠, 𝑎 k
6lm n
𝛻[ 𝜇(𝑠) DPG
SVG

34
Similarity of GANs
policy (generator) is updated by
gradient of Q function (Discriminator)

35
Property of Value gradient
• biased estimate
– it depends on function approximation of Q
– less stable
• off-policy and low-variance estimate
– high sample efficiency

36
Recent approaches
• TRPO
• A3C
• Q-Prop

37
TRPO (Trust Region Policy Optimization)
• Problem of Policy gradient (on-policy) method
– large step size may affect policy to be divergence
– if once policy becomes bad, policy is updated by bad samples
• Careful choosing step size
– update should not cause large change
– KL constraint

38
TRPO
𝐿Cstu
𝜋 = 𝐸Cstu
𝜋
𝜋stu
𝐴C
(𝑠, 𝑎) variant of PG
constraint 𝐾𝐿(𝜋stu| 𝜋 < 𝐶
max 𝐿Cstu
𝜋 − 𝜆 ∗ 𝐾𝐿(𝜋stu||𝜋)
lagrange multiplier λ
Make linear approximation to L and quadratic approximation to KL
max 𝑔 𝜃 − 𝜃stu −
𝜆
2
𝜃 − 𝜃stu
y
𝐹 (𝜃 − 𝜃stu) 𝐹 =
𝜕|
𝜕𝜃|
𝐾𝐿

39
TRPO
𝜃 − 𝜃stu =
1
𝜆
𝐹}+
𝑔
Finally, natural gradient is
obtained
Conjugate gradient method, line
search

40
A3C
• Asynchronously Advantage Actor-Critic
• Advantage Actor-Critic (A2C) is variant of policy gradient
• Asynchronously update
– no need to get large batch
– no need experience replay

42
Q-Prop
• On-policy + Off-policy
• Policy gradient + value gradient
Stability and sample efficiency

43
Two main ideas
• First-order Taylor expansion
• Control variate

44
First-order Taylor expansion

49
More detail about Q-Prop
• https://www.slideshare.net/ReijiHatsugai/q-prop

Continuous control

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (19)

En vedette

En vedette (6)

Similaire à Continuous control

Similaire à Continuous control (20)

Dernier

Dernier (20)

Continuous control