11. 11
What is Continuous Control?
Atari invaders
Robot arm
• Stick directions
• Buttons
• Torques
12. 12
What is Continuous Control?
Robot arm
Assume π is gaussian (gaussian
policy)
μ(s)
σ(s
)
action
action is sampled from this distribution
μ(s) and σ(s) is represented by neural
network
13. 13
Overview of reinforcement learningʼ complexity
Action space complexity
Statespacecomplexity
Discrete Continuous
15. 15
Definition
𝑄C
𝑠, 𝑎 = 𝐸E,C F 𝛾"
𝑟"
"
|𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C 𝑄C
(𝑠.
, 𝑎.
)
𝑉C
(𝑠) = 𝐸E,C F 𝛾"
𝑟"
"
|𝑠 = 𝐸E,C 𝑟 + 𝛾𝑉C
(𝑠.
)
𝐴C
s, a = 𝑄C
𝑠, 𝑎 − 𝑉C
(𝑠)
exist s, take a, and then according to π
exist s, and then according to π
true influence of a
(Advantage function)
bellman equation
25. 25
Property of Policy gradient
• unbiased estimate
– stable
• on-policy and high-variance estimate
– need large batch size (or A3C like asynchronous training)
– less sample efficiency
• on-policy or off-policy
– current policy can be updated by only current policyʼs sample (on-policy)
– current policy can be updated by any policyʼs sample (off-policy)
26. 26
High variance
• In policy gradient, we have to estimate ∑ 𝛾 𝜏
𝑅 𝜏
∞
𝜏=0
• Estimation of ∑ 𝛾 𝜏
𝑅 𝜏
∞
𝜏=0 is high variance
– long time sequence
– environmentʼs state transition probability
• There are several methods to reduce variance
27. 27
Actor-critic method
policy gradient evaluate how good π(a_t|s_t) was
That depends on only τ > t (causality)
𝛻[ 𝐽 ≈ 𝐸CX
F 𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑄C
(𝑠", 𝑎")
"
reduce variance, but biased estimate
35. 35
Property of Value gradient
• biased estimate
– it depends on function approximation of Q
– less stable
• off-policy and low-variance estimate
– high sample efficiency
37. 37
TRPO (Trust Region Policy Optimization)
• Problem of Policy gradient (on-policy) method
– large step size may affect policy to be divergence
– if once policy becomes bad, policy is updated by bad samples
• Careful choosing step size
– update should not cause large change
– KL constraint
38. 38
TRPO
𝐿Cstu
𝜋 = 𝐸Cstu
𝜋
𝜋stu
𝐴C
(𝑠, 𝑎) variant of PG
constraint 𝐾𝐿(𝜋stu| 𝜋 < 𝐶
max 𝐿Cstu
𝜋 − 𝜆 ∗ 𝐾𝐿(𝜋stu||𝜋)
lagrange multiplier λ
Make linear approximation to L and quadratic approximation to KL
max 𝑔 𝜃 − 𝜃stu −
𝜆
2
𝜃 − 𝜃stu
y
𝐹 (𝜃 − 𝜃stu) 𝐹 =
𝜕|
𝜕𝜃|
𝐾𝐿
40. 40
A3C
• Asynchronously Advantage Actor-Critic
• Advantage Actor-Critic (A2C) is variant of policy gradient
• Asynchronously update
– no need to get large batch
– no need experience replay