SlideShare une entreprise Scribd logo
1  sur  36
Télécharger pour lire hors ligne
Proximal Policy Optimization Algorithms,
Schulman et al, 2017
옥찬호
utilForever@gmail.com
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• In recent years, several different approaches have been
proposed for reinforcement learning with neural network
function approximators. The leading contenders are deep Q-
learning, “vanilla” policy gradient methods, and trust region /
natural policy gradient methods.
• However, there is room for improvement in developing a
method that is scalable (to large models and parallel
implementations), data efficient, and robust (i.e., successful
on a variety of problems without hyperparameter tuning).
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• DQN
• fails on many simple problems and is poorly understood
• A3C – “Vanilla” policy gradient methods
• have poor data efficiency and robustness
• TRPO
• relatively complicated
• is not compatible with architectures that include noise (such as dropout) or
parameter sharing (between the policy and value function, or with auxiliary
tasks)
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• We propose a novel objective with clipped probability ratios,
which forms a pessimistic estimate (i.e., lower bound) of the
performance of the policy.
• This paper seeks to improve the current state of affairs by
introducing an algorithm that attains the data efficiency and
reliable performance of TRPO, while using only first-order
optimization.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• To optimize policies, we alternate between sampling data from
the policy and performing several epochs of optimization on
the sampled data.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction
• Our experiments compare the performance of various different
versions of the surrogate objective, and find that the version
with the clipped probability ratios performs best.
• We also compare PPO to several previous algorithms from the
literature.
• On continuous control tasks, it performs better than the algorithms we
compare against.
• On Atari, it performs significantly better (in terms of sample complexity) than
A2C and similarly to ACER though it is much simpler.
PPO, Schulman et al, 2017Background: Policy Optimization
• Policy gradient methods work by computing an estimator of
the policy gradient and plugging it into a stochastic gradient
ascent algorithm. The most commonly used gradient estimator
has the form
ො𝑔 = ෡𝔼 𝑡 ∇ 𝜃 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡
መ𝐴 𝑡
• where 𝜋 𝜃 is a stochastic policy and መ𝐴 𝑡 is an estimator of the advantage
function at timestep 𝑡. Here, the expectation ෡𝔼 𝑡 … indicates the empirical
average over a finite batch of samples, in an algorithm that alternates
between sampling and optimization.
PPO, Schulman et al, 2017Background: Policy Optimization
• Implementations that use automatic differentiation software
work by constructing an objective function whose gradient is
the policy gradient estimator; the estimator ො𝑔 is obtained by
differentiating the objective
𝐿 𝑃𝐺
𝜃 = ෡𝔼 𝑡 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡
መ𝐴 𝑡
PPO, Schulman et al, 2017Background: Policy Optimization
• While it is appealing to perform multiple steps of optimization
on this loss 𝐿 𝑃𝐺
using the same trajectory, doing so is not well-
justified, and empirically it often leads to destructively large
policy updates.
PPO, Schulman et al, 2017Background: Policy Optimization
• In TRPO, an objective function (the “surrogate” objective) is
maximized subject to a constraint on the size of the policy
update. Specifically,
maximize
𝜃
෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡
subject to ෡𝔼 𝑡 𝐾𝐿 𝜋 𝜃old
∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡 ≤ 𝛿
• Here, 𝜃old is the vector of policy parameters before the update.
PPO, Schulman et al, 2017Background: Policy Optimization
• This problem can efficiently be approximately solved using the
conjugate gradient algorithm, after making a linear
approximation to the objective and a quadratic approximation
to the constraint.
PPO, Schulman et al, 2017Background: Policy Optimization
• The theory justifying TRPO actually suggests using a penalty
instead of a constraint, i.e., solving the unconstrained
optimization problem for some coefficient 𝛽.
maximize
𝜃
෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old
∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡
PPO, Schulman et al, 2017Background: Policy Optimization
• This follows from the fact that a certain surrogate objective
(which computes the max KL over states instead of the mean)
forms a lower bound (i.e., a pessimistic bound) on the
performance of the policy 𝜋.
• TRPO uses a hard constraint rather than a penalty because it is
hard to choose a single value of 𝜷 that performs well across
different problems—or even within a single problem, where the
characteristics change over the course of learning.
PPO, Schulman et al, 2017Clipped Surrogate Objective
• Let 𝑟𝑡(𝜃) denote the probability ratio 𝑟𝑡 𝜃 =
𝜋 𝜃(𝑎 𝑡|𝑠 𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠 𝑡)
,
so 𝑟𝑡 𝜃old = 1. TRPO maximizes a “surrogate” objective
𝐿 𝐶𝑃𝐼
𝜃 = ෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡 = ෡𝔼 𝑡 𝑟𝑡 𝜃 መ𝐴 𝑡
• The superscript CPI refers to conservative policy iteration,
where this objective was proposed.
PPO, Schulman et al, 2017Clipped Surrogate Objective
• Without a constraint, maximization of 𝐿 𝐶𝑃𝐼
would lead to an
excessively large policy update; hence, we now consider how
to modify the objective, to penalize changes to the policy that
move 𝑟𝑡(𝜃) away from 1.
• The main objective we propose is the following:
𝐿 𝐶𝐿𝐼𝑃
𝜃 = ෡𝔼 𝑡 min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡)
• where epsilon is a hyperparameter, say, 𝜖 = 0.2.
PPO, Schulman et al, 2017Clipped Surrogate Objective
• The motivation for this objective is as follows.
• The first term inside the min is 𝐿 𝐶𝑃𝐼 𝜃 .
• The second term, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡, modifies the surrogate objective
by clipping the probability ratio, which removes the incentive for moving
𝑟𝑡 𝜃 outside of the interval 1 − 𝜖, 1 + 𝜖 .
• Finally, we take the minimum of the clipped and unclipped objective, so the
final objective is a lower bound (i.e., a pessimistic bound) on the unclipped
objective.
PPO, Schulman et al, 2017Clipped Surrogate Objective
Figure 1: Plots showing one term (i.e., a single timestep) of the surrogate function
𝐿 𝐶𝐿𝐼𝑃
as a function of the probability ratio 𝑟, for positive advantages (left) and negative
advantages (right). The red circle on each plot shows the starting point for the
optimization, i.e., 𝑟 = 1. Note that 𝐿 𝐶𝐿𝐼𝑃
sums many of these terms.
PPO, Schulman et al, 2017Clipped Surrogate Objective
Figure 2: Surrogate objectives, as we interpolate between the initial policy parameter
𝜃old, and the updated policy parameter, which we compute after one iteration of PPO.
The updated policy has a KL divergence of about 0.02 from the initial policy, and this is
the point at which 𝐿 𝐶𝐿𝐼𝑃
is maximal.
PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient
• Another approach, which can be used as an alternative to the
clipped surrogate objective, or in addition to it, is to use a
penalty on KL divergence, and to adapt the penalty coefficient
so that we achieve some target value of the KL divergence
𝑑targ each policy update.
• In our experiments, we found that the KL penalty performed worse than the
clipped surrogate objective, however, we’ve included it here because it’s an
important baseline.
PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient
• In the simplest instantiation of this algorithm, we perform the
following steps in each policy update:
• Using several epochs of minibatch SGD, optimize the KL-penalized objective
𝐿 𝐾𝐿𝑃𝐸𝑁 𝜃 = ෡𝔼 𝑡
𝜋 𝜃(𝑎 𝑡|𝑠𝑡)
𝜋 𝜃old
(𝑎 𝑡|𝑠𝑡)
መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old
∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡
• Compute 𝒅 = ෡𝔼 𝒕 𝑲𝑳 𝝅 𝜽 𝒐𝒍𝒅
∙ 𝒔 𝒕 , 𝝅 𝜽 ∙ 𝒔 𝒕
• If 𝑑 < Τ𝑑 𝑡𝑎𝑟𝑔 1.5 , 𝛽 ← Τ𝛽 2
• If 𝑑 > 𝑑 𝑡𝑎𝑟𝑔 × 1.5, 𝛽 ← 𝛽 × 2
• The updated 𝛽 is used for the next policy update.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• Most techniques for computing variance-reduced advantage-
function estimators make use a learned state-value function 𝑉(𝑠)
• Generalized advantage estimation
• The finite-horizon estimators
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• If using a neural network architecture that shares parameters
between the policy and value function, we must use a loss function
that combines the policy surrogate and a value function error term.
• This objective can further be augmented by adding an entropy bonus
to ensure sufficient exploration, as suggested in past work (A3C).
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• Combining these terms, we obtain the following objective,
which is (approximately) maximized each iteration:
𝐿 𝑡
𝐶𝐿𝐼𝑃+𝑉𝐹+𝑆
𝜃 = ෡𝔼 𝑡 𝐿 𝑡
𝐶𝐿𝐼𝑃
𝜃 − 𝑐1 𝐿 𝑡
𝑉𝐹
𝜃 + 𝑐2 𝑆 𝜋 𝜃 𝑠𝑡
• where 𝑐1, 𝑐2 are coefficients, and 𝑆 denotes an entropy bonus,
and 𝐿 𝑡
𝑉𝐹
is a squared-error loss 𝑉𝜃 𝑠𝑡 − 𝑉𝑡
targ 2
.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• One style of policy gradient implementation, popularized in
A3C and well-suited for use with recurrent neural networks,
runs the policy for 𝑻 timesteps (where 𝑻 is much less than the
episode length), and uses the collected samples for an update.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• This style requires an advantage estimator that does not look
beyond timestep 𝑻. The estimator used by A3C is
መ𝐴 𝑡 = −𝑉 𝑠𝑡 + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇−𝑡+1
𝑟 𝑇−1 + 𝛾 𝑇−𝑡
𝑉 𝑠 𝑇
• where 𝑡 specifies the time index in 0, 𝑇 ,
within a given length-𝑇 trajectory segment.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• Generalizing this choice, we can use a truncated version of
generalized advantage estimation, which reduces to previous
equation when 𝜆 = 1:
መ𝐴 𝑡 = 𝛿𝑡 + 𝛾𝜆 𝛿𝑡+1 + ⋯ + ⋯ + 𝛾𝜆 𝑇−𝑡+1
𝛿 𝑇−1
where 𝛿𝑡 = 𝑟𝑡 + 𝛾𝑉 𝑠𝑡+1 − 𝑉 𝑠𝑡
Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm
• A proximal policy optimization (PPO) algorithm that uses
fixed-length trajectory segments is shown below.
• Each iteration, each of 𝑁 (parallel) actors collect 𝑇 timesteps of data.
• Then we construct the surrogate loss on these 𝑁𝑇 timesteps of data, and
optimize it with minibatch SGD (or usually for better performance, Adam),
for 𝐾 epochs.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
• First, we compare several different surrogate objectives under
different hyperparameters. Here, we compare the surrogate
objective 𝐿 𝐶𝐿𝐼𝑃
to several natural variations and ablated
versions.
• No clipping or penalty: 𝐿 𝑡(𝜃) = 𝑟𝑡 𝜃 መ𝐴 𝑡
• Clipping: 𝐿 𝑡(𝜃) = min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡)
• KL penalty (fixed or adaptive): 𝐿 𝑡 𝜃 = 𝑟𝑡 𝜃 መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old
, 𝜋 𝜃
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion
• We have introduced proximal policy optimization, a family of
policy optimization methods that use multiple epochs of
stochastic gradient ascent to perform each policy update.
Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion
• These methods have the stability and reliability of trust-region
methods but are much simpler to implement, requiring only
few lines of code change to a vanilla policy gradient (A3C)
implementation, applicable in more general settings (for
example, when using a joint architecture for the policy and
value function), and have better overall performance.
References
• https://reinforcement-learning-kr.github.io/2018/06/22/7_ppo/
• https://lynnn.tistory.com/73
• https://jay.tech.blog/2018/10/09/trpo%EC%99%80-ppo/
• https://talkingaboutme.tistory.com/entry/RL-Policy-Gradient-
Algorithms
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Thank you!

Contenu connexe

Tendances

Logic (slides)
Logic (slides)Logic (slides)
Logic (slides)IIUM
 
Graph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community DetectionGraph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community DetectionTigerGraph
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
Lecture 2 predicates quantifiers and rules of inference
Lecture 2 predicates quantifiers and rules of inferenceLecture 2 predicates quantifiers and rules of inference
Lecture 2 predicates quantifiers and rules of inferenceasimnawaz54
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!Dongmin Lee
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
L4 one sided limits limits at infinity
L4 one sided limits limits at infinityL4 one sided limits limits at infinity
L4 one sided limits limits at infinityJames Tagara
 
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)Euijin Jeong
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsSangwoo Mo
 
Exponential Functions.pptx
Exponential Functions.pptxExponential Functions.pptx
Exponential Functions.pptxJenniferMagango
 

Tendances (20)

Logic (slides)
Logic (slides)Logic (slides)
Logic (slides)
 
Lecture26
Lecture26Lecture26
Lecture26
 
Limit and continuity
Limit and continuityLimit and continuity
Limit and continuity
 
Graph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community DetectionGraph Gurus Episode 6: Community Detection
Graph Gurus Episode 6: Community Detection
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Eulers totient
Eulers totientEulers totient
Eulers totient
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
Lecture 2 predicates quantifiers and rules of inference
Lecture 2 predicates quantifiers and rules of inferenceLecture 2 predicates quantifiers and rules of inference
Lecture 2 predicates quantifiers and rules of inference
 
안.전.제.일. 강화학습!
안.전.제.일. 강화학습!안.전.제.일. 강화학습!
안.전.제.일. 강화학습!
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
L4 one sided limits limits at infinity
L4 one sided limits limits at infinityL4 one sided limits limits at infinity
L4 one sided limits limits at infinity
 
Gremlin's Anatomy
Gremlin's AnatomyGremlin's Anatomy
Gremlin's Anatomy
 
Policy gradient
Policy gradientPolicy gradient
Policy gradient
 
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
강화학습 기초_2(Deep sarsa, Deep Q-learning, DQN)
 
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
 
Linear algebra
Linear algebraLinear algebra
Linear algebra
 
Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Exponential Functions.pptx
Exponential Functions.pptxExponential Functions.pptx
Exponential Functions.pptx
 

Similaire à PPO Algorithm Summary

Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmSupun Abeysinghe
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models ananth
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGgerogepatton
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysYasutoTamura1
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionIJECEIAES
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptxHadrian7
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of AlgorithmsBulbul Agrawal
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...IAEME Publication
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...IAEME Publication
 
Six Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality ManagementSix Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality ManagementRSIS International
 

Similaire à PPO Algorithm Summary (20)

Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
LFA-NPG-Paper.pdf
LFA-NPG-Paper.pdfLFA-NPG-Paper.pdf
LFA-NPG-Paper.pdf
 
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models Artificial Intelligence Course: Linear models
Artificial Intelligence Course: Linear models
 
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNINGAUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
AUTOMATIC TRANSFER RATE ADJUSTMENT FOR TRANSFER REINFORCEMENT LEARNING
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model SelectionAdapted Branch-and-Bound Algorithm Using SVM With Model Selection
Adapted Branch-and-Bound Algorithm Using SVM With Model Selection
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
Pydata presentation
Pydata presentationPydata presentation
Pydata presentation
 
Analysis and Design of Algorithms
Analysis and Design of AlgorithmsAnalysis and Design of Algorithms
Analysis and Design of Algorithms
 
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
COMPARISON BETWEEN THE GENETIC ALGORITHMS OPTIMIZATION AND PARTICLE SWARM OPT...
 
Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...Comparison between the genetic algorithms optimization and particle swarm opt...
Comparison between the genetic algorithms optimization and particle swarm opt...
 
Six Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality ManagementSix Sigma Methods and Formulas for Successful Quality Management
Six Sigma Methods and Formulas for Successful Quality Management
 

Plus de Chris Ohk

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍Chris Ohk
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들Chris Ohk
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Chris Ohk
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기Chris Ohk
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features SummaryChris Ohk
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지Chris Ohk
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게Chris Ohk
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기Chris Ohk
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기Chris Ohk
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your WayChris Ohk
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Chris Ohk
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer GraphicsChris Ohk
 
C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2Chris Ohk
 

Plus de Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics
 
C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2
 

Dernier

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Dernier (20)

DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

PPO Algorithm Summary

  • 1. Proximal Policy Optimization Algorithms, Schulman et al, 2017 옥찬호 utilForever@gmail.com
  • 2. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • In recent years, several different approaches have been proposed for reinforcement learning with neural network function approximators. The leading contenders are deep Q- learning, “vanilla” policy gradient methods, and trust region / natural policy gradient methods. • However, there is room for improvement in developing a method that is scalable (to large models and parallel implementations), data efficient, and robust (i.e., successful on a variety of problems without hyperparameter tuning).
  • 3. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • DQN • fails on many simple problems and is poorly understood • A3C – “Vanilla” policy gradient methods • have poor data efficiency and robustness • TRPO • relatively complicated • is not compatible with architectures that include noise (such as dropout) or parameter sharing (between the policy and value function, or with auxiliary tasks)
  • 4. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • We propose a novel objective with clipped probability ratios, which forms a pessimistic estimate (i.e., lower bound) of the performance of the policy. • This paper seeks to improve the current state of affairs by introducing an algorithm that attains the data efficiency and reliable performance of TRPO, while using only first-order optimization.
  • 5. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • To optimize policies, we alternate between sampling data from the policy and performing several epochs of optimization on the sampled data.
  • 6. Proximal Policy Optimization Algorithms, Schulman et al, 2017Introduction • Our experiments compare the performance of various different versions of the surrogate objective, and find that the version with the clipped probability ratios performs best. • We also compare PPO to several previous algorithms from the literature. • On continuous control tasks, it performs better than the algorithms we compare against. • On Atari, it performs significantly better (in terms of sample complexity) than A2C and similarly to ACER though it is much simpler.
  • 7. PPO, Schulman et al, 2017Background: Policy Optimization • Policy gradient methods work by computing an estimator of the policy gradient and plugging it into a stochastic gradient ascent algorithm. The most commonly used gradient estimator has the form ො𝑔 = ෡𝔼 𝑡 ∇ 𝜃 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 መ𝐴 𝑡 • where 𝜋 𝜃 is a stochastic policy and መ𝐴 𝑡 is an estimator of the advantage function at timestep 𝑡. Here, the expectation ෡𝔼 𝑡 … indicates the empirical average over a finite batch of samples, in an algorithm that alternates between sampling and optimization.
  • 8. PPO, Schulman et al, 2017Background: Policy Optimization • Implementations that use automatic differentiation software work by constructing an objective function whose gradient is the policy gradient estimator; the estimator ො𝑔 is obtained by differentiating the objective 𝐿 𝑃𝐺 𝜃 = ෡𝔼 𝑡 log 𝜋 𝜃 𝑎 𝑡 𝑠𝑡 መ𝐴 𝑡
  • 9. PPO, Schulman et al, 2017Background: Policy Optimization • While it is appealing to perform multiple steps of optimization on this loss 𝐿 𝑃𝐺 using the same trajectory, doing so is not well- justified, and empirically it often leads to destructively large policy updates.
  • 10. PPO, Schulman et al, 2017Background: Policy Optimization • In TRPO, an objective function (the “surrogate” objective) is maximized subject to a constraint on the size of the policy update. Specifically, maximize 𝜃 ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 subject to ෡𝔼 𝑡 𝐾𝐿 𝜋 𝜃old ∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡 ≤ 𝛿 • Here, 𝜃old is the vector of policy parameters before the update.
  • 11. PPO, Schulman et al, 2017Background: Policy Optimization • This problem can efficiently be approximately solved using the conjugate gradient algorithm, after making a linear approximation to the objective and a quadratic approximation to the constraint.
  • 12. PPO, Schulman et al, 2017Background: Policy Optimization • The theory justifying TRPO actually suggests using a penalty instead of a constraint, i.e., solving the unconstrained optimization problem for some coefficient 𝛽. maximize 𝜃 ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old ∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡
  • 13. PPO, Schulman et al, 2017Background: Policy Optimization • This follows from the fact that a certain surrogate objective (which computes the max KL over states instead of the mean) forms a lower bound (i.e., a pessimistic bound) on the performance of the policy 𝜋. • TRPO uses a hard constraint rather than a penalty because it is hard to choose a single value of 𝜷 that performs well across different problems—or even within a single problem, where the characteristics change over the course of learning.
  • 14. PPO, Schulman et al, 2017Clipped Surrogate Objective • Let 𝑟𝑡(𝜃) denote the probability ratio 𝑟𝑡 𝜃 = 𝜋 𝜃(𝑎 𝑡|𝑠 𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠 𝑡) , so 𝑟𝑡 𝜃old = 1. TRPO maximizes a “surrogate” objective 𝐿 𝐶𝑃𝐼 𝜃 = ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 = ෡𝔼 𝑡 𝑟𝑡 𝜃 መ𝐴 𝑡 • The superscript CPI refers to conservative policy iteration, where this objective was proposed.
  • 15. PPO, Schulman et al, 2017Clipped Surrogate Objective • Without a constraint, maximization of 𝐿 𝐶𝑃𝐼 would lead to an excessively large policy update; hence, we now consider how to modify the objective, to penalize changes to the policy that move 𝑟𝑡(𝜃) away from 1. • The main objective we propose is the following: 𝐿 𝐶𝐿𝐼𝑃 𝜃 = ෡𝔼 𝑡 min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡) • where epsilon is a hyperparameter, say, 𝜖 = 0.2.
  • 16. PPO, Schulman et al, 2017Clipped Surrogate Objective • The motivation for this objective is as follows. • The first term inside the min is 𝐿 𝐶𝑃𝐼 𝜃 . • The second term, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡, modifies the surrogate objective by clipping the probability ratio, which removes the incentive for moving 𝑟𝑡 𝜃 outside of the interval 1 − 𝜖, 1 + 𝜖 . • Finally, we take the minimum of the clipped and unclipped objective, so the final objective is a lower bound (i.e., a pessimistic bound) on the unclipped objective.
  • 17. PPO, Schulman et al, 2017Clipped Surrogate Objective Figure 1: Plots showing one term (i.e., a single timestep) of the surrogate function 𝐿 𝐶𝐿𝐼𝑃 as a function of the probability ratio 𝑟, for positive advantages (left) and negative advantages (right). The red circle on each plot shows the starting point for the optimization, i.e., 𝑟 = 1. Note that 𝐿 𝐶𝐿𝐼𝑃 sums many of these terms.
  • 18. PPO, Schulman et al, 2017Clipped Surrogate Objective Figure 2: Surrogate objectives, as we interpolate between the initial policy parameter 𝜃old, and the updated policy parameter, which we compute after one iteration of PPO. The updated policy has a KL divergence of about 0.02 from the initial policy, and this is the point at which 𝐿 𝐶𝐿𝐼𝑃 is maximal.
  • 19. PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient • Another approach, which can be used as an alternative to the clipped surrogate objective, or in addition to it, is to use a penalty on KL divergence, and to adapt the penalty coefficient so that we achieve some target value of the KL divergence 𝑑targ each policy update. • In our experiments, we found that the KL penalty performed worse than the clipped surrogate objective, however, we’ve included it here because it’s an important baseline.
  • 20. PPO, Schulman et al, 2017Adaptive KL Penalty Coefficient • In the simplest instantiation of this algorithm, we perform the following steps in each policy update: • Using several epochs of minibatch SGD, optimize the KL-penalized objective 𝐿 𝐾𝐿𝑃𝐸𝑁 𝜃 = ෡𝔼 𝑡 𝜋 𝜃(𝑎 𝑡|𝑠𝑡) 𝜋 𝜃old (𝑎 𝑡|𝑠𝑡) መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old ∙ 𝑠𝑡 , 𝜋 𝜃 ∙ 𝑠𝑡 • Compute 𝒅 = ෡𝔼 𝒕 𝑲𝑳 𝝅 𝜽 𝒐𝒍𝒅 ∙ 𝒔 𝒕 , 𝝅 𝜽 ∙ 𝒔 𝒕 • If 𝑑 < Τ𝑑 𝑡𝑎𝑟𝑔 1.5 , 𝛽 ← Τ𝛽 2 • If 𝑑 > 𝑑 𝑡𝑎𝑟𝑔 × 1.5, 𝛽 ← 𝛽 × 2 • The updated 𝛽 is used for the next policy update.
  • 21. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • Most techniques for computing variance-reduced advantage- function estimators make use a learned state-value function 𝑉(𝑠) • Generalized advantage estimation • The finite-horizon estimators
  • 22. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • If using a neural network architecture that shares parameters between the policy and value function, we must use a loss function that combines the policy surrogate and a value function error term. • This objective can further be augmented by adding an entropy bonus to ensure sufficient exploration, as suggested in past work (A3C).
  • 23. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • Combining these terms, we obtain the following objective, which is (approximately) maximized each iteration: 𝐿 𝑡 𝐶𝐿𝐼𝑃+𝑉𝐹+𝑆 𝜃 = ෡𝔼 𝑡 𝐿 𝑡 𝐶𝐿𝐼𝑃 𝜃 − 𝑐1 𝐿 𝑡 𝑉𝐹 𝜃 + 𝑐2 𝑆 𝜋 𝜃 𝑠𝑡 • where 𝑐1, 𝑐2 are coefficients, and 𝑆 denotes an entropy bonus, and 𝐿 𝑡 𝑉𝐹 is a squared-error loss 𝑉𝜃 𝑠𝑡 − 𝑉𝑡 targ 2 .
  • 24. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • One style of policy gradient implementation, popularized in A3C and well-suited for use with recurrent neural networks, runs the policy for 𝑻 timesteps (where 𝑻 is much less than the episode length), and uses the collected samples for an update.
  • 25. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • This style requires an advantage estimator that does not look beyond timestep 𝑻. The estimator used by A3C is መ𝐴 𝑡 = −𝑉 𝑠𝑡 + 𝑟𝑡 + 𝛾𝑟𝑡+1 + ⋯ + 𝛾 𝑇−𝑡+1 𝑟 𝑇−1 + 𝛾 𝑇−𝑡 𝑉 𝑠 𝑇 • where 𝑡 specifies the time index in 0, 𝑇 , within a given length-𝑇 trajectory segment.
  • 26. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • Generalizing this choice, we can use a truncated version of generalized advantage estimation, which reduces to previous equation when 𝜆 = 1: መ𝐴 𝑡 = 𝛿𝑡 + 𝛾𝜆 𝛿𝑡+1 + ⋯ + ⋯ + 𝛾𝜆 𝑇−𝑡+1 𝛿 𝑇−1 where 𝛿𝑡 = 𝑟𝑡 + 𝛾𝑉 𝑠𝑡+1 − 𝑉 𝑠𝑡
  • 27. Proximal Policy Optimization Algorithms, Schulman et al, 2017Algorithm • A proximal policy optimization (PPO) algorithm that uses fixed-length trajectory segments is shown below. • Each iteration, each of 𝑁 (parallel) actors collect 𝑇 timesteps of data. • Then we construct the surrogate loss on these 𝑁𝑇 timesteps of data, and optimize it with minibatch SGD (or usually for better performance, Adam), for 𝐾 epochs.
  • 28. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments • First, we compare several different surrogate objectives under different hyperparameters. Here, we compare the surrogate objective 𝐿 𝐶𝐿𝐼𝑃 to several natural variations and ablated versions. • No clipping or penalty: 𝐿 𝑡(𝜃) = 𝑟𝑡 𝜃 መ𝐴 𝑡 • Clipping: 𝐿 𝑡(𝜃) = min(𝑟𝑡 𝜃 መ𝐴 𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ𝐴 𝑡) • KL penalty (fixed or adaptive): 𝐿 𝑡 𝜃 = 𝑟𝑡 𝜃 መ𝐴 𝑡 − 𝛽𝐾𝐿 𝜋 𝜃old , 𝜋 𝜃
  • 29. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 30. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 31. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 32. Proximal Policy Optimization Algorithms, Schulman et al, 2017Experiments
  • 33. Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion • We have introduced proximal policy optimization, a family of policy optimization methods that use multiple epochs of stochastic gradient ascent to perform each policy update.
  • 34. Proximal Policy Optimization Algorithms, Schulman et al, 2017Conclusion • These methods have the stability and reliability of trust-region methods but are much simpler to implement, requiring only few lines of code change to a vanilla policy gradient (A3C) implementation, applicable in more general settings (for example, when using a joint architecture for the policy and value function), and have better overall performance.
  • 35. References • https://reinforcement-learning-kr.github.io/2018/06/22/7_ppo/ • https://lynnn.tistory.com/73 • https://jay.tech.blog/2018/10/09/trpo%EC%99%80-ppo/ • https://talkingaboutme.tistory.com/entry/RL-Policy-Gradient- Algorithms Proximal Policy Optimization Algorithms, Schulman et al, 2017