Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Prochain SlideShare
Chargement dans…5
×

# Reinforcement Learning 13. Policy Gradient Methods

166 vues

Publié le

A summary of Chapter 6: Policy Gradient Methods of the book 'Reinforcement Learning: An Introduction' by Sutton and Barto. You can find the full book in Professor Sutton's website: http://incompleteideas.net/book/the-book-2nd.html

Check my website for more slides of books and papers!
https://www.endtoend.ai

Publié dans : Technologie
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Soyez le premier à commenter

• Soyez le premier à aimer ceci

### Reinforcement Learning 13. Policy Gradient Methods

1. 1. Chapter 13: Policy Gradient Methods Seungjae Ryan Lee
2. 2. Preview: Policy Gradients ● Action-value Methods ○ Learn values of actions and select actions with estimated action values ○ Policy derived from action-value estimates ● Policy Gradient Methods ○ Learn parameterized policy that can select action without a value function ○ Can still use value function to learn the policy parameter
3. 3. Policy Gradient Methods ● Define a performance measure J(θ) to maximize ● Learn policy parameter θ through approximate gradient ascent Stochastic estimate of J(θ)
4. 4. Soft-max in Action Preferences ● Numerical preference h(s, a, θ) for each state-action pair ● Action selection through soft-max
5. 5. Soft-max in Action Preferences: Advantages 1. Approximate policy can approach deterministic policy ○ No “limit” like ε-greedy methods ○ Using soft-max on action values cannot approach deterministic policy
6. 6. Soft-max in Action Preferences: Advantages 2. Allow stochastic policy ○ Best approximate policy can be stochastic in problems with significant function approximation ● Consider small corridor with -1 reward on each step ○ States are indistinguishable ○ Action transition is reversed in the second state
7. 7. Soft-max in Action Preferences: Advantages 2. Allow stochastic policy ○ ε-greedy methods (ε=0.1) cannot find optimal policy
8. 8. Soft-max in Action Preferences: Advantages 3. Policy may be simpler to approximate ○ Differs among problems http://proceedings.mlr.press/v48/simsek16.pdf
9. 9. Theoretical Advantage of Policy Gradient Methods ● Smooth transition of policy for parameter changes ● Allows for stronger convergence guarantees h = 1 h = 0.9 0.4750.525 h = 1 h = 1.1 0.475 0.525 Q = 1 Q = 0.9 0.050.95 Q = 1 Q = 1.1 0.5 0.95
10. 10. Policy Gradient Theorem ● Define performance measure as value of the start state ● Want to compute w.r.t. policy parameter
11. 11. Policy Gradient Theorem Episodic: Average episode length Continuing: 1 On-policy state distribution
12. 12. Policy Gradient Theorem: Proof
13. 13. Policy Gradient Theorem: Proof Product rule
14. 14. Policy Gradient Theorem: Proof
15. 15. Policy Gradient Theorem: Proof
16. 16. Policy Gradient Theorem: Proof Unrolling
17. 17. Policy Gradient Theorem: Proof
18. 18. Policy Gradient Theorem: Proof
19. 19. Policy Gradient Theorem: Proof
20. 20. Policy Gradient Theorem: Proof
21. 21. Policy Gradient Theorem: Proof
22. 22. Stochastic Gradient Descent ● Need samples with expectation
23. 23. Stochastic Gradient Descent is an on-policy state distribution of
26. 26. Stochastic Gradient Descent: REINFORCE
27. 27. REINFORCE (1992) ● Sample return like Monte Carlo ● Increment proportional to return ● Increment inverse proportional to action probability ○ Prevent frequent actions dominating due to frequent updates
28. 28. REINFORCE: Pseudocode Eligibility vector
29. 29. REINFORCE: Results
30. 30. REINFORCE with Baseline ● REINFORCE ○ Good theoretical convergence ○ Bad convergence speed due to high variance Baseline
31. 31. REINFORCE with Baseline: Pseudocode Learn state value with MC to use as baseline
32. 32. REINFORCE with Baseline: Results
33. 33. ● Learn approximations for both policy (Actor) and value function (Critic) ● Critic vs Baseline in REINFORCE ○ Critic is used for bootstrapping ○ Bootstrapping introduces bias and relies on state representation ○ Bootstrapping reduces variance and accelerates learning Actor-Critic Methods Actor Critic
34. 34. One-step Actor Critic ● Replace return with one-step return ● Replace baseline with approximated value function (Critic) ○ Learned with semi-gradient TD(0) REINFORCE: One-step AC:
35. 35. One-step Actor-Critic Update Critic (value function) parameters Update Actor (policy) parameters
36. 36. Actor-Critic with Eligibility Traces
37. 37. Average Reward for Continuing Problems ● Measure performance in terms of average reward
38. 38. Actor-Critic for Continuing Problems
39. 39. Policy Gradient Theorem Proof (Continuing Case) 1. Same procedure: 2. Rearrange equation:
40. 40. Policy Gradient Theorem Proof (Continuing Case) 3. Sum over all states weighted by state-distribution a. Nothing changes since neither side depend on s and
41. 41. Policy Gradient Theorem Proof (Continuing Case) 4. Use ergodicity:
42. 42. ● Policy based methods can handle continuous action spaces ● Learn statistics of the probability distribution ○ ex) mean and variance of Gaussian ● Choose action from the learned distribution ○ ex) Gaussian distribution Policy Parameterization for Continuous Actions
43. 43. Policy Parametrization to Gaussian Distribution ● Divide policy parameter vector into mean and variance: ● Approximate mean with linear function: ● Approximate variance with exponential of linear function: ○ Guaranteed positive ● All PG algorithms can be applied to the parameter vector
44. 44. Summary ● Policy Gradient methods have many advantages over action-value methods ○ Represent stochastic policy and approach deterministic policies ○ Learn appropriate levels of exploration ○ Handle continuous action spaces ○ Compute effect of policy parameter on performance with Policy Gradient Theorem ● Actor-Critic estimates value function for bootstrapping ○ Introduces bias but is often desirable due to lower variance ○ Similar to preferring TD over MC “Policy Gradient methods provide a significantly different set of strengths and weaknesses than action-value methods.”
45. 45. Thank you! Original content from ● Reinforcement Learning: An Introduction by Sutton and Barto You can find more content in ● github.com/seungjaeryanlee ● www.endtoend.ai