Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Prochain SlideShare
Chargement dans…5
×

# Continuous control

525 vues

Publié le

Reinforcement Learning, Continuous Control, DeepX, Model-free, TRPO, DDPG, NAF, A3C

Publié dans : Ingénierie
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Soyez le premier à commenter

### Continuous control

1. 1. Model-free Continuous Control Reinforcement Learning 初谷怜慈
2. 2. 2 ⾃⼰紹介 • アメフト -> 東⼤Warriors • 東京⼤学情報理⼯学系研究科修⼠2年 • DeepX, シニアエンジニア • 研究 -> 強化学習 – 特に実環境に向けて • twitter -> @Reiji_Hatsu • github -> rarilurelo
3. 3. 3 What is reinforcement learning? Environment Agent
4. 4. 4 What is reinforcement learning? Environment Agent state, reward
5. 5. 5 What is reinforcement learning? Environment Agent action
6. 6. 6 What is reinforcement learning? Environment Agent state, reward action Described as MDP or POMDP
7. 7. 7 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+)
8. 8. 8 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+) Modeling π! π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑Get Model-free
9. 9. 9 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+) Modeling π and P! π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑Get Model-base
10. 10. 10 Example • DQN – Learn Q function – ε-greedy <- policy π! 𝜋 𝑎 𝑠 = 2 𝑎𝑟𝑔𝑚𝑎𝑥6 𝑄 𝑠, 𝑎 𝜀 < 𝑢 𝑟𝑎𝑛𝑑𝑜𝑚 𝑢 < 𝜀 (𝑢~𝑢𝑛𝑖𝑓𝑜𝑟𝑚 0,1 )
11. 11. 11 What is Continuous Control? Atari invaders Robot arm • Stick directions • Buttons • Torques
12. 12. 12 What is Continuous Control? Robot arm Assume π is gaussian (gaussian policy) μ(s) σ(s ) action action is sampled from this distribution μ(s) and σ(s) is represented by neural network
13. 13. 13 Overview of reinforcement learningʼ complexity Action space complexity Statespacecomplexity Discrete Continuous
14. 14. 14 Learning methods of continuous policy • NAF • Policy gradient • Value gradient
15. 15. 15 Definition 𝑄C 𝑠, 𝑎 = 𝐸E,C F 𝛾" 𝑟" " |𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C 𝑄C (𝑠. , 𝑎. ) 𝑉C (𝑠) = 𝐸E,C F 𝛾" 𝑟" " |𝑠 = 𝐸E,C 𝑟 + 𝛾𝑉C (𝑠. ) 𝐴C s, a = 𝑄C 𝑠, 𝑎 − 𝑉C (𝑠) exist s, take a, and then according to π exist s, and then according to π true influence of a (Advantage function) bellman equation
16. 16. 16 Q-learning Consider optimal policy 𝜋∗ 𝑄C∗ 𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾𝐸C∗ 𝑄C∗ (𝑠. , 𝑎. ) 𝜋∗ 𝑎 𝑠 = O 1 (𝑎𝑟𝑔𝑚𝑎𝑥6 𝑄C∗ 𝑠, 𝑎 ) 0 (𝑜𝑡ℎ𝑒𝑟𝑠) 𝑄C∗ 𝑠, 𝑎 = 𝐸E 𝑟 + 𝛾 max 6 𝑄C∗ (𝑠, 𝑎) minimize (lhs – rhs)**2 (for function approximation)
17. 17. 17 NAF (Normalized Advantage Function) In DQN, we can get max Q How to get max Q with continuous action?
18. 18. 18 NAF (Normalized Advantage Function) 𝑄(𝑠, 𝑎) = 𝐴(𝑠, 𝑎) + 𝑉(𝑠) 𝐴 𝑠, 𝑎 = − 1 2 𝑎 − 𝜇 𝑠 𝑃(𝑠)(𝑎 − 𝜇 𝑠 ) Positive definite matrix Conve x μ(s) We can get max Q as V! 0 max 𝑄 𝑠, 𝑎 = 0 + 𝑉(𝑠) minimize (r+γmaxQ – Q) w.r.t all parameters
19. 19. 19 Learning methods of continuous policy • NAF • Policy gradient • Value gradient
20. 20. 20 Formulation Agent Environment 𝐴"~𝜋(𝑎|𝑆") 𝑆"*+~𝑃(𝑠. |𝑆", 𝐴") 𝑟"*+ = 𝑟(𝑆", 𝐴", 𝑆"*+) Modeling π! π∗ = argmax π Eπ [ γ τ rτ ] τ =0 ∞ ∑Get Model-free
21. 21. 21 Policy gradient more direct approach than Q-learning 𝐽 = 𝐸CX F 𝛾Y 𝑟Y Z Y 𝜋[ 𝑎 𝑠 = 𝒩(𝜇[ 𝑠 , 𝜎[ 𝑠 ) 𝛻𝜃 𝐽 is what we want
22. 22. 22 Policy gradient ∇θ J = ∇θ Eπθ [ γ τ rτ ] τ =0 ∞ ∑ = ∇θ Es0 ~ρ,s'~p πθ at ,st( ) γ τ rτ τ =0 ∞ ∑t=0 ∏ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Es0 ~ρ,s'~p ∇θ πθ at ,st( ) γ τ rτ τ =0 ∞ ∑t=0 ∏ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Es~ρ πθ at ,st( ) ∇θ πθ at ,st( ) t=0 ∏ πθ at ,st( ) t=0 ∏ γ τ rτ τ =0 ∞ ∑ t=0 ∏ ⎡ ⎣ ⎢ ⎢ ⎢ ⎤ ⎦ ⎥ ⎥ ⎥ = Es~ρ πθ (at | st ) ∇θ log(πθ (at | st )) t=0 ∑t=0 ∏ γ τ rτ τ =0 ∞ ∑ ⎡ ⎣ ⎢ ⎤ ⎦ ⎥ = Eπθ [ ∇θ log(πθ (at | st )) t=0 ∑ γ τ rτ τ =t ∞ ∑ ] Expectation to summation differentiate w.r.t theta multiple pi to nominator and denominator logarithmic differentiation Causality Approximated by MC
23. 23. 23 Intuition
24. 24. 24 Intuition
25. 25. 25 Property of Policy gradient • unbiased estimate – stable • on-policy and high-variance estimate – need large batch size (or A3C like asynchronous training) – less sample efficiency • on-policy or off-policy – current policy can be updated by only current policyʼs sample (on-policy) – current policy can be updated by any policyʼs sample (off-policy)
26. 26. 26 High variance • In policy gradient, we have to estimate ∑ 𝛾 𝜏 𝑅 𝜏 ∞ 𝜏=0 • Estimation of ∑ 𝛾 𝜏 𝑅 𝜏 ∞ 𝜏=0 is high variance – long time sequence – environmentʼs state transition probability • There are several methods to reduce variance
27. 27. 27 Actor-critic method policy gradient evaluate how good π(a_t|s_t) was That depends on only τ > t (causality) 𝛻[ 𝐽 ≈ 𝐸CX F 𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑄C (𝑠", 𝑎") " reduce variance, but biased estimate
28. 28. 28 Bias-variance
29. 29. 29 Bias-variance
30. 30. 30 Bias-variance
31. 31. 31 Baseline 𝛻[ 𝐽 ≈ 𝐸CX F 𝛻[ log 𝜋[ 𝑎" 𝑠" (𝑄C 𝑠", 𝑎" − 𝑏 𝑠" ) " 𝐸CX 𝛻[ log 𝜋[ 𝑎" 𝑠" 𝑏(𝑠") = 𝑏(𝑠")𝛻[ 𝐸C[ 𝜋[ = 0 b=V is good choice, because Q and V are correlation! 𝛻[ 𝐽 ≈ 𝐸CX F 𝛻[ log 𝜋[ 𝑎" 𝑠" (𝐴C (𝑠", 𝑎")) "
32. 32. 32 Learning methods of continuous policy • NAF • Policy gradient • Value gradient
33. 33. 33 Value gradient state transition distribution 𝜌 𝐽 = 𝐸j 𝑄C (𝑠, 𝑎) = 𝐸j 𝛻6 𝑄 𝑠, 𝑎 k 6lm n *op n 𝛻[(𝜇 𝑠 + 𝜀𝜎 𝑠 ) 𝛻[ 𝐽 = 𝐸j 𝛻[ 𝑄C (𝑠, 𝑎) = Ej 𝛻6 𝑄 𝑠, 𝑎 k 6lm n 𝛻[ 𝜇(𝑠) DPG SVG
34. 34. 34 Similarity of GANs policy (generator) is updated by gradient of Q function (Discriminator)
35. 35. 35 Property of Value gradient • biased estimate – it depends on function approximation of Q – less stable • off-policy and low-variance estimate – high sample efficiency
36. 36. 36 Recent approaches • TRPO • A3C • Q-Prop
37. 37. 37 TRPO (Trust Region Policy Optimization) • Problem of Policy gradient (on-policy) method – large step size may affect policy to be divergence – if once policy becomes bad, policy is updated by bad samples • Careful choosing step size – update should not cause large change – KL constraint
38. 38. 38 TRPO 𝐿Cstu 𝜋 = 𝐸Cstu 𝜋 𝜋stu 𝐴C (𝑠, 𝑎) variant of PG constraint 𝐾𝐿(𝜋stu| 𝜋 < 𝐶 max 𝐿Cstu 𝜋 − 𝜆 ∗ 𝐾𝐿(𝜋stu||𝜋) lagrange multiplier λ Make linear approximation to L and quadratic approximation to KL max 𝑔 𝜃 − 𝜃stu − 𝜆 2 𝜃 − 𝜃stu y 𝐹 (𝜃 − 𝜃stu) 𝐹 = 𝜕| 𝜕𝜃| 𝐾𝐿
39. 39. 39 TRPO 𝜃 − 𝜃stu = 1 𝜆 𝐹}+ 𝑔 Finally, natural gradient is obtained Conjugate gradient method, line search
40. 40. 40 A3C • Asynchronously Advantage Actor-Critic • Advantage Actor-Critic (A2C) is variant of policy gradient • Asynchronously update – no need to get large batch – no need experience replay
41. 41. 41 A3C
42. 42. 42 Q-Prop • On-policy + Off-policy • Policy gradient + value gradient Stability and sample efficiency
43. 43. 43 Two main ideas • First-order Taylor expansion • Control variate
44. 44. 44 First-order Taylor expansion
45. 45. 45 Value gradient appear
46. 46. 46 Can we compute these?
47. 47. 47 Control variate