Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」

深層強化学習入門.2020年6月実施の「Deep Learning基礎講座」強化学習の松嶋担当分の講義資料を再編集したものです.本資料は,資料を作成した松嶋が公開するものであり,他の講義回について,研究室としての公開は予定されていないとのことです.

  • Identifiez-vous pour voir les commentaires

深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」

  1. 1. • • • • •
  2. 2. • • • • • •
  3. 3. • • • •
  4. 4. • • • • •
  5. 5. • • • • • • • • •
  6. 6. • • • • • • •
  7. 7. • • • •
  8. 8. • • •
  9. 9. • • • 10360 1060 10120 10220
  10. 10. • • • •
  11. 11. • • •
  12. 12. • • st at
  13. 13. • • • st+1 st at p(st+1 |st, at) st at st+1 rt rt+1 rt+2 at 1 at at+1 st st+1
  14. 14. • • • • st ot at st rt ot rt+1 ot+1 rt+2 at 1 at at+1 st st+1
  15. 15. • • • • • • • rt+1 st at {s0, a0, r1, s1, a1, r2, ⋯} Rt {rt+1, rt+2, ⋯} Rt = rt+1 + γrt+2 + γ2 rt+3⋯ = ∞ ∑ k=0 γk rt+k+1 γ (0 ≤ γ ≤ 1) 𝔼 [Rt] γ
  16. 16. • • • • π(s) s a Vπ (s) π s Qπ (s, a) π s a
  17. 17. • • Rt Vπ (s) = 𝔼 [Rt |st = s] = 𝔼 [ ∞ ∑ k=0 γk rt+k+1 |st = s ] Qπ (s, a) = 𝔼 [Rt |st = s, at = a] = 𝔼 [ ∞ ∑ k=0 γk rt+k+1 |st = s, at = a ]
  18. 18. • • Aπ = Qπ − Vπ π s a
  19. 19. • • • • • s a μ s a μ
  20. 20. • • • • • •
  21. 21. • • • • • •
  22. 22. • • • • • •
  23. 23. • • • • • Qπ (st, at) ← Qπ (st, at) + αδt δt = yt − Qπ (st, at) α δt yt
  24. 24. • • • • • ySARSA t = rt+1 + γQπ (st+1, at+1) Qπ (st, at) ← Qπ (st, at) + α (rt+1 + γQπ (st+1, at+1) − Qπ (st, at)) π {st, at, rt, st+1, at+1}
  25. 25. • • • yQ−learning t = rt+1 + γ max a′ Qπ (st+1, a′) Qπ (st, at) ← Qπ (st, at) + α ( rt+1 + γ max a′ Qπ (st+1, a′) − Qπ (st, at) )
  26. 26. • • • • a = arg maxa′ Q(s, a′) ϵ ϵ 1 − ϵ
  27. 27. • • Qπ (s, a) = 𝔼π,p [ ∞ ∑ k=0 γk rt+k+1 |st = s, at = a ] = 𝔼π,p [ rt+1 + γ ∞ ∑ k=0 γk rt+k+2 |st = s, at = a ] = 𝔼s′∼p r(s, a, s′) + γ𝔼a′∼π 𝔼π,p [ ∞ ∑ k=0 γk r(t+1)+k+1 |st+1 = s′, at+1 = a′ ] |st = s, at = a = 𝔼s′∼p [r(s, a, s′) + γ𝔼a′∼π [Qπ (s′, a′))]] Rt+1 Qπ (s, a) = 𝔼 [Rt |st = s, at = a]
  28. 28. • • π Qπ (s, a) = 𝔼s′∼p [r(s, a, s′) + γ𝔼a′∼π [Qπ (s′, a′))]] = 𝔼s′∼p [ r(s, a, s′) + γ max a′ [Qπ (s′, a′))]]
  29. 29. • • • • • •
  30. 30. • • • • • • •
  31. 31. • • • • • 🤔 s0 s Point in state space to be represented Tiling 1 Tiling 2 Tiling 3 Tiling 4 Continuous 2D state space Four active tiles/features overlap the point and are used to represent it
  32. 32. • • • • • • Vπ (s), Qπ (s . a) π
  33. 33. • • • • • • • Qπ (st, at) ← Qπ (st, at) + α ( rt+1 + γ max a′ Qπ (st+1, a′) − Qπ (st, at) )
  34. 34. • • • • • • • 255210×160×3
  35. 35. • • • Qπ (st, at) ← Qπ (st, at) + α ( rt+1 + γ max a′ Qπ (st+1, a′) − Qπ (st, at) ) max a′ Qπ (st+1, a′) max a′ Qπ (st+1, a′) yQ−learning t = rt+1 + γ max a′ Qπ (st+1, a′)
  36. 36. • • • • Qπ (s, a) = 𝔼s′∼p [ r(s, a, s′) + γ max a′ [Qπ (s′, a′))]] p π
  37. 37. • • • [−1,1]
  38. 38. • • • • πθ 𝒥(πθ) = 𝔼πθ [fπθ ( ⋅ )] θ fπθ ( ⋅ ) Rt Vπ , Qπ ∇θ 𝒥(πθ) = 𝔼πθ [∇θlog πθ ⋅ fπθ ( ⋅ )]
  39. 39. • • • • • • • • Rt fπθ ( ⋅ ) = Rt Rt Rt b(s) fπθ ( ⋅ ) = Rt − b(s) b(s)
  40. 40. • • • • Vπ , Qπ π Rt b(s) Vπ , Qπ Rt Qπ b(s) Vπ Aπ = Qπ − Vπ
  41. 41. • • • • • • • •
  42. 42. • • • • • a = arg maxa′ Q(s, a′)
  43. 43. • • • a = arg maxa′ Q(s, a′)
  44. 44. • Y LeCun How Much Information Does the Machine Need to Predict? “Pure” Reinforcement Learning (cherry) The machine predicts a scalar reward given once in a while. A few bits for some samples Supervised Learning (icing) The machine predicts a category or a few numbers for each input Predicting human-supplied data 10 10,000 bits per sample→ Unsupervised/Predictive Learning (cake) The machine predicts any part of its input for any observed part. Predicts future frames in videos Millions of bits per sample (Yes, I know, this picture is slightly offensive to RL folks. But I’ll make it up)
  45. 45. • • • • • •
  46. 46. • • • • • • • •
  47. 47. • • •
  48. 48. • • •
  49. 49. • • • • • • • • • •
  50. 50. • • • • •
  51. 51. • •
  52. 52.
  53. 53. • • • • •
  54. 54. • • • • • • •
  55. 55. • • • • • •

×