Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.

Reinforcement Learning 13. Policy Gradient Methods

196 vues

Publié le

A summary of Chapter 6: Policy Gradient Methods of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html

Check my website for more slides of books and papers!

Publié dans : Technologie
  • Soyez le premier à commenter

  • Soyez le premier à aimer ceci

Reinforcement Learning 13. Policy Gradient Methods

  1. 1. Chapter 13: Policy Gradient Methods Seungjae Ryan Lee
  2. 2. Preview: Policy Gradients ● Action-value Methods ○ Learn values of actions and select actions with estimated action values ○ Policy derived from action-value estimates ● Policy Gradient Methods ○ Learn parameterized policy that can select action without a value function ○ Can still use value function to learn the policy parameter
  3. 3. Policy Gradient Methods ● Define a performance measure J(θ) to maximize ● Learn policy parameter θ through approximate gradient ascent Stochastic estimate of J(θ)
  4. 4. Soft-max in Action Preferences ● Numerical preference h(s, a, θ) for each state-action pair ● Action selection through soft-max
  5. 5. Soft-max in Action Preferences: Advantages 1. Approximate policy can approach deterministic policy ○ No “limit” like ε-greedy methods ○ Using soft-max on action values cannot approach deterministic policy
  6. 6. Soft-max in Action Preferences: Advantages 2. Allow stochastic policy ○ Best approximate policy can be stochastic in problems with significant function approximation ● Consider small corridor with -1 reward on each step ○ States are indistinguishable ○ Action transition is reversed in the second state
  7. 7. Soft-max in Action Preferences: Advantages 2. Allow stochastic policy ○ ε-greedy methods (ε=0.1) cannot find optimal policy
  8. 8. Soft-max in Action Preferences: Advantages 3. Policy may be simpler to approximate ○ Differs among problems http://proceedings.mlr.press/v48/simsek16.pdf
  9. 9. Theoretical Advantage of Policy Gradient Methods ● Smooth transition of policy for parameter changes ● Allows for stronger convergence guarantees h = 1 h = 0.9 0.4750.525 h = 1 h = 1.1 0.475 0.525 Q = 1 Q = 0.9 0.050.95 Q = 1 Q = 1.1 0.5 0.95
  10. 10. Policy Gradient Theorem ● Define performance measure as value of the start state ● Want to compute w.r.t. policy parameter
  11. 11. Policy Gradient Theorem Episodic: Average episode length Continuing: 1 On-policy state distribution
  12. 12. Policy Gradient Theorem: Proof
  13. 13. Policy Gradient Theorem: Proof Product rule
  14. 14. Policy Gradient Theorem: Proof
  15. 15. Policy Gradient Theorem: Proof
  16. 16. Policy Gradient Theorem: Proof Unrolling
  17. 17. Policy Gradient Theorem: Proof
  18. 18. Policy Gradient Theorem: Proof
  19. 19. Policy Gradient Theorem: Proof
  20. 20. Policy Gradient Theorem: Proof
  21. 21. Policy Gradient Theorem: Proof
  22. 22. Stochastic Gradient Descent ● Need samples with expectation
  23. 23. Stochastic Gradient Descent is an on-policy state distribution of
  24. 24. Stochastic Gradient Descent
  25. 25. Stochastic Gradient Descent
  26. 26. Stochastic Gradient Descent: REINFORCE
  27. 27. REINFORCE (1992) ● Sample return like Monte Carlo ● Increment proportional to return ● Increment inverse proportional to action probability ○ Prevent frequent actions dominating due to frequent updates
  28. 28. REINFORCE: Pseudocode Eligibility vector
  29. 29. REINFORCE: Results
  30. 30. REINFORCE with Baseline ● REINFORCE ○ Good theoretical convergence ○ Bad convergence speed due to high variance Baseline
  31. 31. REINFORCE with Baseline: Pseudocode Learn state value with MC to use as baseline
  32. 32. REINFORCE with Baseline: Results
  33. 33. ● Learn approximations for both policy (Actor) and value function (Critic) ● Critic vs Baseline in REINFORCE ○ Critic is used for bootstrapping ○ Bootstrapping introduces bias and relies on state representation ○ Bootstrapping reduces variance and accelerates learning Actor-Critic Methods Actor Critic
  34. 34. One-step Actor Critic ● Replace return with one-step return ● Replace baseline with approximated value function (Critic) ○ Learned with semi-gradient TD(0) REINFORCE: One-step AC:
  35. 35. One-step Actor-Critic Update Critic (value function) parameters Update Actor (policy) parameters
  36. 36. Actor-Critic with Eligibility Traces
  37. 37. Average Reward for Continuing Problems ● Measure performance in terms of average reward
  38. 38. Actor-Critic for Continuing Problems
  39. 39. Policy Gradient Theorem Proof (Continuing Case) 1. Same procedure: 2. Rearrange equation:
  40. 40. Policy Gradient Theorem Proof (Continuing Case) 3. Sum over all states weighted by state-distribution a. Nothing changes since neither side depend on s and
  41. 41. Policy Gradient Theorem Proof (Continuing Case) 4. Use ergodicity:
  42. 42. ● Policy based methods can handle continuous action spaces ● Learn statistics of the probability distribution ○ ex) mean and variance of Gaussian ● Choose action from the learned distribution ○ ex) Gaussian distribution Policy Parameterization for Continuous Actions
  43. 43. Policy Parametrization to Gaussian Distribution ● Divide policy parameter vector into mean and variance: ● Approximate mean with linear function: ● Approximate variance with exponential of linear function: ○ Guaranteed positive ● All PG algorithms can be applied to the parameter vector
  44. 44. Summary ● Policy Gradient methods have many advantages over action-value methods ○ Represent stochastic policy and approach deterministic policies ○ Learn appropriate levels of exploration ○ Handle continuous action spaces ○ Compute effect of policy parameter on performance with Policy Gradient Theorem ● Actor-Critic estimates value function for bootstrapping ○ Introduces bias but is often desirable due to lower variance ○ Similar to preferring TD over MC “Policy Gradient methods provide a significantly different set of strengths and weaknesses than action-value methods.”
  45. 45. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai