SlideShare une entreprise Scribd logo
1  sur  65
Télécharger pour lire hors ligne
Trust Region Policy Optimization, Schulman et al, 2015.
옥찬호
utilForever@gmail.com
Trust Region Policy Optimization, Schulman et al, 2015.Introduction
• Most algorithms for policy optimization can be classified...
• (1) Policy iteration methods (Alternate between estimating the value
function under the current policy and improving the policy)
• (2) Policy gradient methods (Use an estimator of the gradient of the
expected return)
• (3) Derivative-free optimization methods (Treat the return as a black box
function to be optimized in terms of the policy parameters) such as
• The Cross-Entropy Method (CEM)
• The Covariance Matrix Adaptation (CMA)
Introduction
• General derivative-free stochastic optimization methods
such as CEM and CMA are preferred on many problems,
• They achieve good results while being simple to understand and implement.
• For example, while Tetris is a classic benchmark problem for
approximate dynamic programming(ADP) methods, stochastic
optimization methods are difficult to beat on this task.
Trust Region Policy Optimization, Schulman et al, 2015.
Introduction
• For continuous control problems, methods like CMA have been
successful at learning control policies for challenging tasks like
locomotion when provided with hand-engineered policy
classes with low-dimensional parameterizations.
• The inability of ADP and gradient-based methods to
consistently beat gradient-free random search is unsatisfying,
• Gradient-based optimization algorithms enjoy much better sample
complexity guarantees than gradient-free methods.
Trust Region Policy Optimization, Schulman et al, 2015.
Introduction
• Continuous gradient-based optimization has been very
successful at learning function approximators for supervised
learning tasks with huge numbers of parameters, and
extending their success to reinforcement learning would allow
for efficient training of complex and powerful policies.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Markov Decision Process(MDP) : 𝒮, 𝒜, 𝑃, 𝑟, 𝜌0, 𝛾
• 𝒮 is a finite set of states
• 𝒜 is a finite set of actions
• 𝑃 ∶ 𝒮 × 𝒜 × 𝒮 → ℝ is the transition probability
• 𝑟 ∶ 𝒮 → ℝ is the reward function
• 𝜌0 ∶ 𝒮 → ℝ is the distribution of the initial state 𝑠0
• 𝛾 ∈ 0, 1 is the discount factor
• Policy
• 𝜋 ∶ 𝒮 × 𝒜 → 0, 1
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
How to measure policy’s performance?
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Expected cumulative discounted reward
• Action value, Value/Advantage functions
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Lemma 1. Kakade & Langford (2002)
• Proof
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Discounted (unnormalized) visitation frequencies
• Rewrite Lemma 1
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• This equation implies that any policy update 𝜋 → ෤𝜋 that has
a nonnegative expected advantage at every state 𝑠,
i.e., σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) ≥ 0, is guaranteed to increase
the policy performance 𝜂, or leave it constant in the case that
the expected advantage is zero everywhere.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• However, in the approximate setting, it will typically be
unavoidable, due to estimation and approximation error, that
there will be some states 𝑠 for which the expected advantage
is negative, that is, σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) < 0.
• The complex dependency of 𝜌෥𝜋 𝑠 on ෤𝜋 makes equation
difficult to optimize directly.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Instead, we introduce the following local approximation to 𝜂:
• Note that 𝐿 𝜋 uses the visitation frequency 𝜌 𝜋 rather than 𝜌෥𝜋,
ignoring changes in state visitation density due to changes in
the policy.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• However, if we have a parameterized policy 𝜋 𝜃, where
𝜋 𝜃 𝑎 𝑠 is a differentiable function of the parameter vector 𝜃,
then 𝐿 𝜋 matches 𝜂 to first order.
• This equation implies that a sufficiently small step 𝜋 𝜃0
→ ෤𝜋 that
improves 𝐿 𝜋 𝜃 𝑜𝑙𝑑
will also improve 𝜂, but does not give us any
guidance on how big of a step to take.
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Conservative Policy Iteration
• Provide explicit lower bounds on the improvement of 𝜂.
• Let 𝜋 𝑜𝑙𝑑 denote the current policy, and let 𝜋′ = argmax 𝜋′ 𝐿 𝜋 𝑜𝑙𝑑
(𝜋′).
The new policy 𝜋 𝑛𝑒𝑤 was defined to be the following texture:
Trust Region Policy Optimization, Schulman et al, 2015.
Preliminaries
• Conservative Policy Iteration
• The following lower bound:
• However, this policy class is unwieldy and restrictive in practice,
and it is desirable for a practical policy update scheme to be applicable to
all general stochastic policy classes.
Trust Region Policy Optimization, Schulman et al, 2015.
Monotonic Improvement Guarantee
• The policy improvement bound in above equation can be
extended to general stochastic policies, by
• (1) Replacing 𝛼 with a distance measure between 𝜋 and ෤𝜋
• (2) Changing the constant 𝜖 appropriately
Monotonic Improvement Guarantee
• The particular distance measure we use is
the total variation divergence for
discrete probability distributions 𝑝, 𝑞:
𝐷TV 𝑝 ∥ 𝑞 =
1
2
෍
𝑖
𝑝𝑖 − 𝑞𝑖
• Define 𝐷TV
max
𝜋, ෤𝜋 as
Monotonic Improvement Guarantee
• Theorem 1. Let 𝛼 = 𝐷TV
max
𝜋old, 𝜋new .
Then the following bound holds:
Monotonic Improvement Guarantee
• We note the following relationship between the total variation
divergence and the KL divergence.
𝐷TV 𝑝 ∥ 𝑞 2
≤ 𝐷KL 𝑝 ∥ 𝑞
• Let 𝐷KL
max
𝜋, ෤𝜋 = max
𝑠
𝐷KL 𝜋 ∙ 𝑠 ∥ ෤𝜋 ∙ |𝑠 .
The following bound then follows directly from Theorem 1:
Monotonic Improvement Guarantee
Monotonic Improvement Guarantee
• Algorithm 1
Monotonic Improvement Guarantee
Monotonic Improvement Guarantee
• Algorithm 1 is guaranteed to generate a monotonically
improving sequence of policies 𝜂 𝜋0 ≤ 𝜂 𝜋1 ≤ 𝜂 𝜋2 ≤ ⋯.
• Let 𝑀𝑖 𝜋 = 𝐿 𝜋 𝑖
− 𝐶𝐷KL
max
𝜋𝑖, 𝜋 . Then,
• Thus, by maximizing 𝑀𝑖 at each iteration, we guarantee that
the true objective 𝜂 is non-decreasing.
• This algorithm is a type of minorization-maximization (MM) algorithm.
Optimization of Parameterized Policies
• Overload our previous notation to use functions of 𝜃.
• 𝜂 𝜃 ≔ 𝜂 𝜋 𝜃
• 𝐿 𝜃
෨𝜃 ≔ 𝐿 𝜋 𝜃
𝜋෩𝜃
• 𝐷KL 𝜃 ∥ ෨𝜃 ≔ 𝐷KL 𝜋 𝜃 ∥ 𝜋෩𝜃
• Use 𝜃old to denote the previous policy parameters
Optimization of Parameterized Policies
• The preceding section showed that
𝜂 𝜃 ≥ 𝐿 𝜃old
𝜃 − 𝐶𝐷KL
max
𝜃old, 𝜃
• Thus, by performing the following maximization,
we are guaranteed to improve the true objective 𝜂:
maximize 𝜃 𝐿 𝜃old
𝜃 − 𝐶𝐷KL
max
𝜃old, 𝜃
Optimization of Parameterized Policies
• In practice, if we used the penalty coefficient 𝐶 recommended
by the theory above, the step sizes would be very small.
• One way to take largest steps in a robust way is to use a
constraint on the KL divergence between the new policy and
the old policy, i.e., a trust region constraint:
Optimization of Parameterized Policies
• This problem imposes a constraint that the KL divergence is
bounded at every point in the state space. While it is motivated
by the theory, this problem is impractical to solve due to the
large number of constraints.
• Instead, we can use a heuristic approximation which considers
the average KL divergence:
Optimization of Parameterized Policies
• We therefore propose solving the following optimization
problem to generate a policy update:
Optimization of Parameterized Policies
Sample-Based Estimation
• How the objective and constraint functions can be
approximated using Monte Carlo simulation?
• We seek to solve the following optimization problem, obtained
by expanding 𝐿 𝜃old
in previous equation:
Sample-Based Estimation
• We replace
• σ 𝑠 𝜌 𝜃old
𝑠 … →
1
1−𝛾
𝐸𝑠~𝜌 𝜃old
[… ]
• 𝐴 𝜃old
→ 𝑄 𝜃old
• σ 𝑎 𝜋 𝜃old
𝑎|𝑠 𝐴 𝜃old
→ 𝐸 𝑎~𝑞
𝜋 𝜃(𝑎|𝑠)
𝑞(𝑎|𝑠)
𝐴 𝜃old
(We replace the sum over the actions by an importance sampling estimator.)
Sample-Based Estimation
Sample-Based Estimation
Sample-Based Estimation
• Now, our optimization problem in previous equation is exactly
equivalent to the following one, written in terms of
expectations:
• Constrained Optimization → Sample-based Estimation
Sample-Based Estimation
• All that remains is to replace the expectations by sample
averages and replace the 𝑄 value by an empirical estimate.
The following sections describe two different schemes for
performing this estimation.
• Single Path: Typically used for policy gradient estimation (Bartlett & Baxter,
2011), and is based on sampling individual trajectories.
• Vine: Constructing a rollout set and then performing multiple actions from
each state in the rollout set.
Sample-Based Estimation
Sample-Based Estimation
Sample-Based Estimation
Practical Algorithm
1. Use the single path or vine procedures to collect a set of state-
action pairs along with Monte Carlo estimates of their 𝑄-values.
2. By averaging over samples, construct the estimated objective and
constraint in Equation .
3. Approximately solve this constrained optimization problem to
update the policy’s parameter vector 𝜃. We use the conjugate
gradient algorithm followed by a line search, which is altogether
only slightly more expensive than computing the gradient itself.
See Appendix C for details.
Trust Region Policy Optimization, Schulman et al, 2015.
Practical Algorithm
• With regard to (3), we construct the Fisher information matrix
(FIM) by analytically computing the Hessian of KL divergence,
rather than using the covariance matrix of the gradients.
• That is, we estimate 𝐴𝑖𝑗 as ,
rather than .
• This analytic estimator has computational benefits in the
large-scale setting, since it removes the need to store a dense
Hessian or all policy gradients from a batch of trajectories.
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
We designed experiments to investigate the following questions:
1. What are the performance characteristics of the single path and vine
sampling procedures?
2. TRPO is related to prior methods but makes several changes, most notably by
using a fixed KL divergence rather than a fixed penalty coefficient. How does
this affect the performance of the algorithm?
3. Can TRPO be used to solve challenging large-scale problems? How does
TRPO compare with other methods when applied to large-scale problems,
with regard to final performance, computation time, and sample complexity?
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Simulated Robotic Locomotion
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
• To evaluate TRPO on a partially observed task with complex observations, we
trained policies for playing Atari games, using raw images as input.
• Challenging points
• The high dimensionality, challenging elements of these games include delayed rewards (no
immediate penalty is incurred when a life is lost in Breakout or Space Invaders)
• Complex sequences of behavior (Q*bert requires a character to hop on 21 different platforms)
• Non-stationary image statistics (Enduro involves a changing and flickering background)
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Experiments
• Playing Games from Images
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• We proposed and analyzed trust region methods for optimizing
stochastic control policies.
• We proved monotonic improvement for an algorithm that
repeatedly optimizes a local approximation to the expected
return of the policy with a KL divergence penalty.
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• In the domain of robotic locomotion, we successfully learned
controllers for swimming, walking and hopping in a physics
simulator, using general purpose neural networks and
minimally informative rewards.
• At the intersection of the two experimental domains we
explored, there is the possibility of learning robotic control
policies that use vision and raw sensory data as input, providing
a unified scheme for training robotic controllers that perform
both perception and control.
Trust Region Policy Optimization, Schulman et al, 2015.
Discussion
• By combining our method with model learning, it would also be
possible to substantially reduce its sample complexity, making
it applicable to real-world settings where samples are
expensive.
Trust Region Policy Optimization, Schulman et al, 2015.
Summary Trust Region Policy Optimization, Schulman et al, 2015.
References
• https://www.slideshare.net/WoongwonLee/trpo-87165690
• https://reinforcement-learning-kr.github.io/2018/06/24/5_trpo/
• https://www.youtube.com/watch?v=XBO4oPChMfI
• https://www.youtube.com/watch?v=CKaN5PgkSBc
Trust Region Policy Optimization, Schulman et al, 2015.
Thank you!

Contenu connexe

Tendances

Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsSeung Jae Lee
 
AI-04 Production System - Search Problem.pptx
AI-04 Production System - Search Problem.pptxAI-04 Production System - Search Problem.pptx
AI-04 Production System - Search Problem.pptxPankaj Debbarma
 
Hill Climbing Algorithm in Artificial Intelligence
Hill Climbing Algorithm in Artificial IntelligenceHill Climbing Algorithm in Artificial Intelligence
Hill Climbing Algorithm in Artificial IntelligenceBharat Bhushan
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강Woong won Lee
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)Dong Guo
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesThomas da Silva Paula
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsSeung Jae Lee
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesSeung Jae Lee
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learningBig Data Colombia
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!ChenYiHuang5
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기Woong won Lee
 
Reinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionReinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionSeung Jae Lee
 
Adversarial Search
Adversarial SearchAdversarial Search
Adversarial SearchMegha Sharma
 

Tendances (20)

Reinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed BanditsReinforcement Learning 2. Multi-armed Bandits
Reinforcement Learning 2. Multi-armed Bandits
 
AI-04 Production System - Search Problem.pptx
AI-04 Production System - Search Problem.pptxAI-04 Production System - Search Problem.pptx
AI-04 Production System - Search Problem.pptx
 
Hill Climbing Algorithm in Artificial Intelligence
Hill Climbing Algorithm in Artificial IntelligenceHill Climbing Algorithm in Artificial Intelligence
Hill Climbing Algorithm in Artificial Intelligence
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강
 
Problem space
Problem spaceProblem space
Problem space
 
DQN (Deep Q-Network)
DQN (Deep Q-Network)DQN (Deep Q-Network)
DQN (Deep Q-Network)
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo MethodsReinforcement Learning 5. Monte Carlo Methods
Reinforcement Learning 5. Monte Carlo Methods
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Reinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision ProcessesReinforcement Learning 3. Finite Markov Decision Processes
Reinforcement Learning 3. Finite Markov Decision Processes
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 
Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!Paper study: Attention, learn to solve routing problems!
Paper study: Attention, learn to solve routing problems!
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
Reinforcement Learning 1. Introduction
Reinforcement Learning 1. IntroductionReinforcement Learning 1. Introduction
Reinforcement Learning 1. Introduction
 
Adversarial Search
Adversarial SearchAdversarial Search
Adversarial Search
 

Similaire à Trust Region Policy Optimization, Schulman et al, 2015

Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Chris Ohk
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy OptimizationShubhaManikarnike
 
Intro to Reinforcement learning - part I
Intro to Reinforcement learning - part IIntro to Reinforcement learning - part I
Intro to Reinforcement learning - part IMikko Mäkipää
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easyWeam Banjar
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIMikko Mäkipää
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxrajalakshmi5921
 
Dummy variables - title
Dummy variables - titleDummy variables - title
Dummy variables - titleMehul Mohan
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...Sunny Mervyne Baa
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
Stepwise Selection Choosing the Optimal Model .ppt
Stepwise Selection  Choosing the Optimal Model .pptStepwise Selection  Choosing the Optimal Model .ppt
Stepwise Selection Choosing the Optimal Model .pptneelamsanjeevkumar
 
3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptxMBALOGISTICSSUPPLYCH
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Bean Yen
 

Similaire à Trust Region Policy Optimization, Schulman et al, 2015 (20)

Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Proximal Policy Optimization
Proximal Policy OptimizationProximal Policy Optimization
Proximal Policy Optimization
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
Intro to Reinforcement learning - part I
Intro to Reinforcement learning - part IIntro to Reinforcement learning - part I
Intro to Reinforcement learning - part I
 
Regression analysis made easy
Regression analysis made easyRegression analysis made easy
Regression analysis made easy
 
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
 
Intro to Reinforcement learning - part II
Intro to Reinforcement learning - part IIIntro to Reinforcement learning - part II
Intro to Reinforcement learning - part II
 
Statistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptxStatistical Learning and Model Selection (1).pptx
Statistical Learning and Model Selection (1).pptx
 
Dummy variables (1)
Dummy variables (1)Dummy variables (1)
Dummy variables (1)
 
Dummy variables
Dummy variablesDummy variables
Dummy variables
 
Dummy variables xd
Dummy variables xdDummy variables xd
Dummy variables xd
 
Dummy variables - title
Dummy variables - titleDummy variables - title
Dummy variables - title
 
Dummy ppt
Dummy pptDummy ppt
Dummy ppt
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...Models of Operational research, Advantages & disadvantages of Operational res...
Models of Operational research, Advantages & disadvantages of Operational res...
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Stepwise Selection Choosing the Optimal Model .ppt
Stepwise Selection  Choosing the Optimal Model .pptStepwise Selection  Choosing the Optimal Model .ppt
Stepwise Selection Choosing the Optimal Model .ppt
 
3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx3.3 Forecasting Part 2(1) (1).pptx
3.3 Forecasting Part 2(1) (1).pptx
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)Reinforcement learning:policy gradient (part 1)
Reinforcement learning:policy gradient (part 1)
 

Plus de Chris Ohk

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍Chris Ohk
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들Chris Ohk
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Chris Ohk
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Chris Ohk
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기Chris Ohk
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features SummaryChris Ohk
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지Chris Ohk
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게Chris Ohk
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기Chris Ohk
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기Chris Ohk
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your WayChris Ohk
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Chris Ohk
 

Plus de Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 

Dernier

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 

Dernier (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 

Trust Region Policy Optimization, Schulman et al, 2015

  • 1. Trust Region Policy Optimization, Schulman et al, 2015. 옥찬호 utilForever@gmail.com
  • 2. Trust Region Policy Optimization, Schulman et al, 2015.Introduction • Most algorithms for policy optimization can be classified... • (1) Policy iteration methods (Alternate between estimating the value function under the current policy and improving the policy) • (2) Policy gradient methods (Use an estimator of the gradient of the expected return) • (3) Derivative-free optimization methods (Treat the return as a black box function to be optimized in terms of the policy parameters) such as • The Cross-Entropy Method (CEM) • The Covariance Matrix Adaptation (CMA)
  • 3. Introduction • General derivative-free stochastic optimization methods such as CEM and CMA are preferred on many problems, • They achieve good results while being simple to understand and implement. • For example, while Tetris is a classic benchmark problem for approximate dynamic programming(ADP) methods, stochastic optimization methods are difficult to beat on this task. Trust Region Policy Optimization, Schulman et al, 2015.
  • 4. Introduction • For continuous control problems, methods like CMA have been successful at learning control policies for challenging tasks like locomotion when provided with hand-engineered policy classes with low-dimensional parameterizations. • The inability of ADP and gradient-based methods to consistently beat gradient-free random search is unsatisfying, • Gradient-based optimization algorithms enjoy much better sample complexity guarantees than gradient-free methods. Trust Region Policy Optimization, Schulman et al, 2015.
  • 5. Introduction • Continuous gradient-based optimization has been very successful at learning function approximators for supervised learning tasks with huge numbers of parameters, and extending their success to reinforcement learning would allow for efficient training of complex and powerful policies. Trust Region Policy Optimization, Schulman et al, 2015.
  • 6. Preliminaries • Markov Decision Process(MDP) : 𝒮, 𝒜, 𝑃, 𝑟, 𝜌0, 𝛾 • 𝒮 is a finite set of states • 𝒜 is a finite set of actions • 𝑃 ∶ 𝒮 × 𝒜 × 𝒮 → ℝ is the transition probability • 𝑟 ∶ 𝒮 → ℝ is the reward function • 𝜌0 ∶ 𝒮 → ℝ is the distribution of the initial state 𝑠0 • 𝛾 ∈ 0, 1 is the discount factor • Policy • 𝜋 ∶ 𝒮 × 𝒜 → 0, 1 Trust Region Policy Optimization, Schulman et al, 2015.
  • 7. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 8. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 9. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 10. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 11. Preliminaries How to measure policy’s performance? Trust Region Policy Optimization, Schulman et al, 2015.
  • 12. Preliminaries • Expected cumulative discounted reward • Action value, Value/Advantage functions Trust Region Policy Optimization, Schulman et al, 2015.
  • 13. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 14. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 15. Preliminaries • Lemma 1. Kakade & Langford (2002) • Proof Trust Region Policy Optimization, Schulman et al, 2015.
  • 16. Preliminaries • Discounted (unnormalized) visitation frequencies • Rewrite Lemma 1 Trust Region Policy Optimization, Schulman et al, 2015.
  • 17. Preliminaries • This equation implies that any policy update 𝜋 → ෤𝜋 that has a nonnegative expected advantage at every state 𝑠, i.e., σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) ≥ 0, is guaranteed to increase the policy performance 𝜂, or leave it constant in the case that the expected advantage is zero everywhere. Trust Region Policy Optimization, Schulman et al, 2015.
  • 18. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 19. Preliminaries • However, in the approximate setting, it will typically be unavoidable, due to estimation and approximation error, that there will be some states 𝑠 for which the expected advantage is negative, that is, σ 𝑎 ෤𝜋 𝑎 𝑠 𝐴 𝜋(𝑠, 𝑎) < 0. • The complex dependency of 𝜌෥𝜋 𝑠 on ෤𝜋 makes equation difficult to optimize directly. Trust Region Policy Optimization, Schulman et al, 2015.
  • 20. Preliminaries • Instead, we introduce the following local approximation to 𝜂: • Note that 𝐿 𝜋 uses the visitation frequency 𝜌 𝜋 rather than 𝜌෥𝜋, ignoring changes in state visitation density due to changes in the policy. Trust Region Policy Optimization, Schulman et al, 2015.
  • 21. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 22. Preliminaries • However, if we have a parameterized policy 𝜋 𝜃, where 𝜋 𝜃 𝑎 𝑠 is a differentiable function of the parameter vector 𝜃, then 𝐿 𝜋 matches 𝜂 to first order. • This equation implies that a sufficiently small step 𝜋 𝜃0 → ෤𝜋 that improves 𝐿 𝜋 𝜃 𝑜𝑙𝑑 will also improve 𝜂, but does not give us any guidance on how big of a step to take. Trust Region Policy Optimization, Schulman et al, 2015.
  • 23. Preliminaries Trust Region Policy Optimization, Schulman et al, 2015.
  • 24. Preliminaries • Conservative Policy Iteration • Provide explicit lower bounds on the improvement of 𝜂. • Let 𝜋 𝑜𝑙𝑑 denote the current policy, and let 𝜋′ = argmax 𝜋′ 𝐿 𝜋 𝑜𝑙𝑑 (𝜋′). The new policy 𝜋 𝑛𝑒𝑤 was defined to be the following texture: Trust Region Policy Optimization, Schulman et al, 2015.
  • 25. Preliminaries • Conservative Policy Iteration • The following lower bound: • However, this policy class is unwieldy and restrictive in practice, and it is desirable for a practical policy update scheme to be applicable to all general stochastic policy classes. Trust Region Policy Optimization, Schulman et al, 2015.
  • 26. Monotonic Improvement Guarantee • The policy improvement bound in above equation can be extended to general stochastic policies, by • (1) Replacing 𝛼 with a distance measure between 𝜋 and ෤𝜋 • (2) Changing the constant 𝜖 appropriately
  • 27. Monotonic Improvement Guarantee • The particular distance measure we use is the total variation divergence for discrete probability distributions 𝑝, 𝑞: 𝐷TV 𝑝 ∥ 𝑞 = 1 2 ෍ 𝑖 𝑝𝑖 − 𝑞𝑖 • Define 𝐷TV max 𝜋, ෤𝜋 as
  • 28. Monotonic Improvement Guarantee • Theorem 1. Let 𝛼 = 𝐷TV max 𝜋old, 𝜋new . Then the following bound holds:
  • 29. Monotonic Improvement Guarantee • We note the following relationship between the total variation divergence and the KL divergence. 𝐷TV 𝑝 ∥ 𝑞 2 ≤ 𝐷KL 𝑝 ∥ 𝑞 • Let 𝐷KL max 𝜋, ෤𝜋 = max 𝑠 𝐷KL 𝜋 ∙ 𝑠 ∥ ෤𝜋 ∙ |𝑠 . The following bound then follows directly from Theorem 1:
  • 33. Monotonic Improvement Guarantee • Algorithm 1 is guaranteed to generate a monotonically improving sequence of policies 𝜂 𝜋0 ≤ 𝜂 𝜋1 ≤ 𝜂 𝜋2 ≤ ⋯. • Let 𝑀𝑖 𝜋 = 𝐿 𝜋 𝑖 − 𝐶𝐷KL max 𝜋𝑖, 𝜋 . Then, • Thus, by maximizing 𝑀𝑖 at each iteration, we guarantee that the true objective 𝜂 is non-decreasing. • This algorithm is a type of minorization-maximization (MM) algorithm.
  • 34. Optimization of Parameterized Policies • Overload our previous notation to use functions of 𝜃. • 𝜂 𝜃 ≔ 𝜂 𝜋 𝜃 • 𝐿 𝜃 ෨𝜃 ≔ 𝐿 𝜋 𝜃 𝜋෩𝜃 • 𝐷KL 𝜃 ∥ ෨𝜃 ≔ 𝐷KL 𝜋 𝜃 ∥ 𝜋෩𝜃 • Use 𝜃old to denote the previous policy parameters
  • 35. Optimization of Parameterized Policies • The preceding section showed that 𝜂 𝜃 ≥ 𝐿 𝜃old 𝜃 − 𝐶𝐷KL max 𝜃old, 𝜃 • Thus, by performing the following maximization, we are guaranteed to improve the true objective 𝜂: maximize 𝜃 𝐿 𝜃old 𝜃 − 𝐶𝐷KL max 𝜃old, 𝜃
  • 36. Optimization of Parameterized Policies • In practice, if we used the penalty coefficient 𝐶 recommended by the theory above, the step sizes would be very small. • One way to take largest steps in a robust way is to use a constraint on the KL divergence between the new policy and the old policy, i.e., a trust region constraint:
  • 37. Optimization of Parameterized Policies • This problem imposes a constraint that the KL divergence is bounded at every point in the state space. While it is motivated by the theory, this problem is impractical to solve due to the large number of constraints. • Instead, we can use a heuristic approximation which considers the average KL divergence:
  • 38. Optimization of Parameterized Policies • We therefore propose solving the following optimization problem to generate a policy update:
  • 40. Sample-Based Estimation • How the objective and constraint functions can be approximated using Monte Carlo simulation? • We seek to solve the following optimization problem, obtained by expanding 𝐿 𝜃old in previous equation:
  • 41. Sample-Based Estimation • We replace • σ 𝑠 𝜌 𝜃old 𝑠 … → 1 1−𝛾 𝐸𝑠~𝜌 𝜃old [… ] • 𝐴 𝜃old → 𝑄 𝜃old • σ 𝑎 𝜋 𝜃old 𝑎|𝑠 𝐴 𝜃old → 𝐸 𝑎~𝑞 𝜋 𝜃(𝑎|𝑠) 𝑞(𝑎|𝑠) 𝐴 𝜃old (We replace the sum over the actions by an importance sampling estimator.)
  • 44. Sample-Based Estimation • Now, our optimization problem in previous equation is exactly equivalent to the following one, written in terms of expectations: • Constrained Optimization → Sample-based Estimation
  • 45. Sample-Based Estimation • All that remains is to replace the expectations by sample averages and replace the 𝑄 value by an empirical estimate. The following sections describe two different schemes for performing this estimation. • Single Path: Typically used for policy gradient estimation (Bartlett & Baxter, 2011), and is based on sampling individual trajectories. • Vine: Constructing a rollout set and then performing multiple actions from each state in the rollout set.
  • 49. Practical Algorithm 1. Use the single path or vine procedures to collect a set of state- action pairs along with Monte Carlo estimates of their 𝑄-values. 2. By averaging over samples, construct the estimated objective and constraint in Equation . 3. Approximately solve this constrained optimization problem to update the policy’s parameter vector 𝜃. We use the conjugate gradient algorithm followed by a line search, which is altogether only slightly more expensive than computing the gradient itself. See Appendix C for details. Trust Region Policy Optimization, Schulman et al, 2015.
  • 50. Practical Algorithm • With regard to (3), we construct the Fisher information matrix (FIM) by analytically computing the Hessian of KL divergence, rather than using the covariance matrix of the gradients. • That is, we estimate 𝐴𝑖𝑗 as , rather than . • This analytic estimator has computational benefits in the large-scale setting, since it removes the need to store a dense Hessian or all policy gradients from a batch of trajectories. Trust Region Policy Optimization, Schulman et al, 2015.
  • 51. Experiments We designed experiments to investigate the following questions: 1. What are the performance characteristics of the single path and vine sampling procedures? 2. TRPO is related to prior methods but makes several changes, most notably by using a fixed KL divergence rather than a fixed penalty coefficient. How does this affect the performance of the algorithm? 3. Can TRPO be used to solve challenging large-scale problems? How does TRPO compare with other methods when applied to large-scale problems, with regard to final performance, computation time, and sample complexity? Trust Region Policy Optimization, Schulman et al, 2015.
  • 52. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 53. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 54. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 55. Experiments • Simulated Robotic Locomotion Trust Region Policy Optimization, Schulman et al, 2015.
  • 56. Experiments • Playing Games from Images • To evaluate TRPO on a partially observed task with complex observations, we trained policies for playing Atari games, using raw images as input. • Challenging points • The high dimensionality, challenging elements of these games include delayed rewards (no immediate penalty is incurred when a life is lost in Breakout or Space Invaders) • Complex sequences of behavior (Q*bert requires a character to hop on 21 different platforms) • Non-stationary image statistics (Enduro involves a changing and flickering background) Trust Region Policy Optimization, Schulman et al, 2015.
  • 57. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 58. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 59. Experiments • Playing Games from Images Trust Region Policy Optimization, Schulman et al, 2015.
  • 60. Discussion • We proposed and analyzed trust region methods for optimizing stochastic control policies. • We proved monotonic improvement for an algorithm that repeatedly optimizes a local approximation to the expected return of the policy with a KL divergence penalty. Trust Region Policy Optimization, Schulman et al, 2015.
  • 61. Discussion • In the domain of robotic locomotion, we successfully learned controllers for swimming, walking and hopping in a physics simulator, using general purpose neural networks and minimally informative rewards. • At the intersection of the two experimental domains we explored, there is the possibility of learning robotic control policies that use vision and raw sensory data as input, providing a unified scheme for training robotic controllers that perform both perception and control. Trust Region Policy Optimization, Schulman et al, 2015.
  • 62. Discussion • By combining our method with model learning, it would also be possible to substantially reduce its sample complexity, making it applicable to real-world settings where samples are expensive. Trust Region Policy Optimization, Schulman et al, 2015.
  • 63. Summary Trust Region Policy Optimization, Schulman et al, 2015.
  • 64. References • https://www.slideshare.net/WoongwonLee/trpo-87165690 • https://reinforcement-learning-kr.github.io/2018/06/24/5_trpo/ • https://www.youtube.com/watch?v=XBO4oPChMfI • https://www.youtube.com/watch?v=CKaN5PgkSBc Trust Region Policy Optimization, Schulman et al, 2015.