SlideShare une entreprise Scribd logo
1  sur  45
Télécharger pour lire hors ligne
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Markov Decision Process
• Policy Iteration
• Policy Gradient
• Actor-Critic
• Entropy
• KL Divergence
• PPO (Proximal Policy Optimization)
• GAN (Generative Adversarial Network)
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Generalization and exploration in RL still represent key
challenges that leave most current methods ineffective.
• First, a battery of recent studies (Farebrother et al., 2018; Zhang et al.,
2018a; Song et al., 2020; Cobbe et al., 2020) indicates that current RL
methods fail to generalize correctly even when agents have been trained in a
diverse set of environments.
• Second, exploration has been extensively studied in RL; however, most hard-
exploration problems use the same environment for training and evaluation.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Hence, since a well-designed exploration strategy should
maximize the information received from a trajectory about an
environment, the exploration capabilities may not be
appropriately assessed if that information is memorized.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• In this work, we propose Adversarially Guided Actor-Critic
(AGAC), which reconsiders the actor-critic framework by
introducing a third protagonist: the adversary.
• Its role is to predict the actor’s actions correctly. Meanwhile, the actor must
not only find the optimal actions to maximize the sum of expected returns,
but also counteract the predictions of the adversary.
• This formulation is lightly inspired by adversarial methods, specifically
generative adversarial networks (GANs) (Goodfellow et al., 2014).
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Such a link between GANs and actor-critic methods has been
formalized by Pfau & Vinyals (2016); however, in the context
of a third protagonist, we draw a different analogy.
• The adversary can be interpreted as playing the role of a discriminator that
must predict the actions of the actor, and the actor can be considered as
playing the role of a generator that behaves to deceive the predictions of the
• This approach has the advantage, as with GANs, that the optimization
procedure generates a diversity of meaningful data, corresponding to
sequences of actions in AGAC.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• The contributions of this work are as follow:
• (i) we propose a novel actor-critic formulation inspired from adversarial
learning (AGAC)
• (ii) we analyze empirically AGAC on key reinforcement learning aspects such
as diversity, exploration and stability
• (iii) we demonstrate significant gains in performance on several sparse-
reward hard-exploration tasks including procedurally-generated tasks
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Backgrounds and Notations
• Markov Decision Process (MDP)
𝑀 = 𝒮, 𝒜, 𝒫, 𝑅, 𝛾
• 𝒮 : The state space
• 𝒜 : The action space
• 𝒫 : The transition kernel
• 𝑅 : The bounded reward function
• 𝛾 ∈ 0, 1 : The discount factor
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Backgrounds and Notations
• Let 𝜋 denote a stochastic policy mapping states to distributions
over actions. We place ourselves in the infinite-horizon setting.
• i.e., we seek a policy that optimizes 𝐽 𝜋 = σ𝑡=0
𝑟 𝑠𝑡, 𝑎𝑡 .
• The value of a state is the quantity 𝑉𝜋 𝑠 = 𝔼𝜋 σ𝑡=0
𝛾𝑡𝑟 𝑠𝑡, 𝑎𝑡 𝑠0 = 𝑠] and the
value of a state-action pair 𝑄𝜋
(𝑠, 𝑎) of performing action 𝑎 in state 𝑠 and then
following policy 𝜋 is defined as: 𝑄𝜋
𝑠, 𝑎 = 𝔼𝜋 σ𝑡=0
𝑟 𝑠𝑡, 𝑎𝑡 𝑠0 = 𝑠, 𝑎0 = 𝑎].
• The advantage function, which quantifies how an action 𝑎 is better than the
average action in state 𝑠, is 𝐴𝜋 𝑠, 𝑎 = 𝑄𝜋 𝑠, 𝑎 − 𝑉𝜋 𝑠 .
• Finally, the entropy ℋ𝜋
of a policy is calculated as:
ℋ𝜋 𝑠 = 𝔼𝜋 ∙ 𝑠 − log 𝜋 ∙ 𝑠 .
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Backgrounds and Notations
• An actor-critic algorithm is composed of two main components:
a policy and a value predictor.
• In deep RL, both the policy and the value function are obtained via
parametric estimators: we denote 𝜃 and 𝜙 their respective parameters.
• The policy is updated via policy gradient, while the value is usually updated
via temporal difference or Monte Carlo rollouts.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Backgrounds and Notations
• For a sequence of transitions 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑠𝑡+1 𝑡∈ 0,𝑁 , we use the following
policy gradient loss (including the commonly used entropic penalty):
ℒ𝑃𝐺 = −
𝐴𝑡′ log 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃 + 𝛼ℋ𝜋 𝑠𝑡′, 𝜃
where 𝛼 is the entropy coefficient and 𝐴𝑡 is the generalized advantage
estimator (Schulman et al., 2016) defined as:
𝐴𝑡 = ෍
𝛾𝜆 𝑡′−𝑡(𝑟𝑡′ + 𝛾𝑉𝜙old
𝑠𝑡′+1 − 𝑉𝜙old
with 𝜆 a fixed hyperparameter and 𝑉𝜙old
the value function estimator at the
previous optimization iteration.
Critic Entropy
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Backgrounds and Notations
• To estimate the value function, we solve the non-linear regression problem
minimize𝜙 σ𝑡′=𝑡
𝑉𝜙 𝑠𝑡′ − ෠
where ෠
𝑉𝑡′ = 𝐴𝑡 + 𝑉𝜙𝑜𝑙𝑑
𝑠𝑡′ .
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• To foster diversified behavior in its trajectories, AGAC
introduces a third protagonist to the actor-critic framework:
the adversary.
• The role of the adversary is to accurately predict the actor’s actions, by
minimizing the discrepancy between its action distribution 𝜋adv and the
distribution induced by the policy 𝜋.
• Meanwhile, in addition to finding the optimal actions to maximize the sum of
expected returns, the actor must also counteract the adversary’s predictions
by maximizing the discrepancy between 𝜋 and 𝜋adv.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• This discrepancy, used as a form of exploration bonus, is
defined as the difference of action log-probabilities, whose
expectation is the Kullback-Leibler divergence:
𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 = 𝔼𝜋 ∙ 𝑠 log 𝜋 ∙ 𝑠 − log 𝜋adv ∙ 𝑠
• Formally, for each state-action pair 𝑠𝑡, 𝑎𝑡 in a trajectory, an
action-dependent bonus log 𝜋 (𝑎𝑡|𝑠𝑡) − log 𝜋adv 𝑎𝑡 𝑠𝑡 is added
to the advantage.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• The value target of the critic is modified to include the action-
independent equivalent, which is the KL-divergence
𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 .
• In addition to the parameters 𝜃 (resp. 𝜃old the parameter of the
policy at the previous iteration) and 𝜙 defined above (resp.
𝜙old that of the critic), we denote 𝜓 (resp. 𝜓old) that of the
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• AGAC minimizes the following loss:
ℒAGAC = ℒPG + 𝛽𝑉ℒ𝑉 + 𝛽advℒadv
• In the new objective ℒPG =
log 𝜋 𝑎𝑡 𝑠𝑡, 𝜃 + 𝛼ℋ𝜋 𝑠𝑡, 𝜃 ,
AGAC modifies 𝐴𝑡 as:
= 𝐴𝑡 + 𝑐 log 𝜋 𝑎𝑡 𝑠𝑡, 𝜃old − log 𝜋adv 𝑎𝑡 𝑠𝑡, 𝜓old
with 𝑐 is a varying hyperparameter that controls the dependence on the
action log-probability difference.
To encourage exploration without preventing asymptotic stability, 𝑐 is
linearly annealed during the course of training.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• ℒ𝑉 is the objective function of the critic defined as:
ℒ𝑉 =
𝑉𝜙 𝑠𝑡 − ෠
𝑉𝑡 + 𝑐𝐷KL 𝜋 ∙ 𝑠𝑡, 𝜃old ฮ𝜋adv ∙ 𝑠𝑡, 𝜓old
• Finally, ℒadv is the objective function of the adversary:
ℒadv =
𝐷KL 𝜋 ∙ 𝑠𝑡, 𝜃old ฮ𝜋adv ∙ 𝑠𝑡, 𝜓
• They are the three equations that our method modifies in the traditional
actor-critic framework. The terms 𝛽𝑉 and 𝛽adv are fixed hyperparameters.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• Under the proposed actor-critic formulation, the probability of
sampling an action is increased if the modified advantage is
positive, i.e.
• (i) the corresponding return is larger than the predicted value
• (ii) the action log-probability difference is large
• More precisely, our method favors transitions whose actions
were less accurately predicted than the average action, i.e.
log 𝜋 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 ≥ 𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 .
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• This is particularly visible for 𝜆 → 1, in which case the
generalized advantage is 𝐴𝑡 = 𝐺𝑡 − 𝑉𝜙old
(𝑠𝑡), resulting in the
appearance of both aforementioned mirrored terms in the
modified advantage:
= 𝐺𝑡 − ෠
+ 𝑐 log 𝜋 𝑎𝑡 𝑠𝑡 − log 𝜋adv 𝑎𝑡 𝑠𝑡 − ෡
𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠
with 𝐺𝑡 the observed return, ෠
the estimated return and
𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 the estimated KL-divergence.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided AC
• To avoid instability, in practice the adversary is a separate
estimator, updated with a smaller learning rate than the actor.
This way, it represents a delayed and more steady version of
the actor’s policy, which prevents the agent from having to
constantly adapt or focus solely on fooling the adversary.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Building Motivation
• We provide an interpretation of AGAC by studying the
dynamics of attraction and repulsion between the actor and
the adversary.
• To simplify, we study the equivalent of AGAC in a policy
iteration (PI) scheme. PI being the dynamic programming
scheme underlying the standard actor-critic, we have reasons
to think that some of our findings translate to the original
AGAC algorithm.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Building Motivation
• In PI, the quantity of interest is the action-value, which AGAC
would modify as:
= 𝑄𝜋𝑘
+ 𝑐 log 𝜋𝑘 − log 𝜋adv
with 𝜋𝑘 the policy at iteration 𝑘.
• Incorporating the entropic penalty, the new policy 𝜋𝑘+1 verifies:
𝜋𝑘+1 = argmax
𝒥PI 𝜋 = argmax
𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
𝑠, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Building Motivation
• We can rewrite this objective:
= 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
𝑠, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠
= 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠
= 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋 𝑎 𝑠 + log 𝜋 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠
= 𝔼𝑠 𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
𝑠, 𝑎 − 𝑐𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋𝑘 ∙ 𝑠 + 𝑐𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 + 𝛼ℋ 𝜋 ∙ 𝑠
𝜋𝑘 is attractive 𝜋adv is repulsive enforces
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Building Motivation
• Thus, in the PI scheme, AGAC finds a policy that maximizes Q-
values, while at the same time remaining close to the current
policy and far from a mixture of the previous policies (i.e.,
𝜋𝑘−1, 𝜋𝑘−2, 𝜋𝑘−3, …).
• Note that we experimentally observe that our method
performs better with a smaller learning rate for the adversarial
network than that of the other networks, which could imply
that a stable repulsive term is beneficial.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Building Motivation
• This optimization problem is strongly concave in 𝜋 (thanks to
the entropy term), and is state-wise a Legendre-Fenchel
transform. Its solution is given by:
𝜋𝑘+1 ∝
• This result gives us some insight into the behavior of the objective
function. Notably, in our example, if 𝜋adv is fixed and 𝑐 = 𝛼, we
recover a KL-regularized PI scheme (Geist et al., 2019) with the
modified reward 𝑟 − 𝑐 log 𝜋adv.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
𝜋𝑘+1 = argmax
𝒥PI 𝜋 ∝
with the objective function:
𝒥PI 𝜋 = 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘
𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• We first consider a simpler optimization problem:
𝜋, 𝑄𝜋𝑘
+ 𝛼ℋ 𝜋
whose solution is known (Vieillard et al., 2020a, Appendix A).
• The expression for the maximizer is the 𝛼-scaled softmax:
1, exp
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• We now turn towards the optimization problem of interest,
which we can rewrite as:
𝜋, 𝑄𝜋𝑘
+ 𝑐 log 𝜋𝑘 − log 𝜋adv + 𝛼ℋ 𝜋
• By the simple change of variable ෨
= 𝑄𝜋𝑘
+ 𝑐 log 𝜋𝑘 − log 𝜋adv ,
we can reuse the previous solution (replacing 𝑄𝜋𝑘
by ෨
• With the simplification:
+ 𝑐 log 𝜋𝑘 − log 𝜋adv
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• In all of the experiments, we use PPO (Schulman et al., 2017)
as the base algorithm and build on it to incorporate our method.
ℒ𝑃𝐺 = −
𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃
𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃old
, clip
𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃
𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃old
, 1 − 𝜖, 1 + 𝜖 𝐴𝑡′
with 𝐴𝑡′
given in first equation, 𝑁 the temporal length
considered for one update of parameters and 𝜖 the clipping
𝜃 = ෡
𝔼𝑡 min(𝑟𝑡 𝜃 መ
𝐴𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Similar to RIDE (Raileanu & Rocktäschel, 2019), we also
discount PPO by episodic state visitation counts.
• The actor, critic and adversary use the convolutional
architecture of the Nature paper of DQN (Mnih et al., 2015)
with different hidden sizes.
• The three neural networks are optimized using Adam (Kingma
& Ba, 2015). Our method does not use RNNs in its architecture;
instead, in all our experiments, we use frame stacking.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• At each training step, we perform a stochastic optimization
step to minimize ℒAGAC using stop-gradient:
• 𝜃 ← Adam 𝜃, ∇𝜃ℒPG, 𝜂1
• 𝜙 ← Adam 𝜙, ∇𝜙ℒV, 𝜂1
• 𝜓 ← Adam 𝜓, ∇𝜓ℒadv, 𝜂2
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• In this section, we describe our experimental study in which we
• (i) whether the adversarial bonus alone (e.g. without episodic state visitation
count) is sufficient to outperform other methods in VizDoom, a sparse-
reward task with high-dimensional observations
• (ii) whether AGAC succeeds in partially-observable and procedurally-
generated environments with high sparsity in the rewards, compared to
other methods
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• In this section, we describe our experimental study in which we
• (iii) how well AGAC is capable of exploring in environments without extrinsic
• (iv) the training stability of our method. In all of the experiments, lines are
average performances and shaded areas represent one standard deviation
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Environments
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Baselines
• RIDE (Raileanu & Rocktäschel, 2019)
• Count as Count-Based Exploration (Bellemare et al., 2016b)
• RND (Burda et al., 2018)
• ICM (Pathak et al., 2017)
• AMIGo (Campero et al., 2021)
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Adversarially-based Exploration (No episodic count)
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Hard-Exploration Tasks with Partially-Observable Environments
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Training Stability
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Exploration in Reward-Free Environment
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Visualizing Coverage and Diversity
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
• Visualizing Coverage and Diversity
• This paper introduced AGAC, a modification to the traditional
actor-critic framework: an adversary network is added as a third
• The mechanics of AGAC have been discussed from a policy iteration
point of view, and we provided theoretical insight into the inner
workings of the proposed algorithm: the adversary forces the agent
to remain close to the current policy while moving away from the
previous ones. In a nutshell, the influence of the adversary makes
the actor conservatively diversified.
Adversarially Guided Actor-Critic,
Y. Flet-Berliac et al, 2021
Thank you!

Contenu connexe


Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmSupun Abeysinghe
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...Jeromy Anglim
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoHridyesh Bisht
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design PatternsAshwin Shiv
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Dongmin Lee
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learningJie-Han Chen
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDongHyun Kwak
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learningJie-Han Chen
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...Borhan Kazimipour

Tendances (20)

Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
A brief introduction to Searn Algorithm
A brief introduction to Searn AlgorithmA brief introduction to Searn Algorithm
A brief introduction to Searn Algorithm
Policy gradient
Policy gradientPolicy gradient
Policy gradient
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Presentation based on "Hierarchical Bayesian Models of Subtask Learning. Angl...
Machine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demoMachine learning Algorithms with a Sagemaker demo
Machine learning Algorithms with a Sagemaker demo
Algorithms Design Patterns
Algorithms Design PatternsAlgorithms Design Patterns
Algorithms Design Patterns
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Va...
Temporal difference learning
Temporal difference learningTemporal difference learning
Temporal difference learning
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...A sensitivity analysis of contribution-based cooperative co-evolutionary algo...
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...

Similaire à Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021

On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...
On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...
On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...SYRTO Project
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn
Factor anaysis scale dimensionality
Factor anaysis scale dimensionalityFactor anaysis scale dimensionality
Factor anaysis scale dimensionalityCarlo Magno
Factor analysis in R by Aman Chauhan
Factor analysis in R by Aman ChauhanFactor analysis in R by Aman Chauhan
Factor analysis in R by Aman ChauhanAman Chauhan
Regressioin mini case
Regressioin mini caseRegressioin mini case
Regressioin mini caseveesingh
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Flavian Vasile
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...Predicting an Applicant Status Using Principal Component, Discriminant and Lo...
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...inventionjournals
Brown bag 2012_fall
Brown bag 2012_fallBrown bag 2012_fall
Brown bag 2012_fallXiaolei Zhou
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...ArchiLab 7
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesSho Takase
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learningSKS
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...Mahdi Cherif
Marketing Research Project
Marketing Research ProjectMarketing Research Project
Marketing Research ProjectPranav B. Gujjar
Measuring the behavioral component of financial fluctuaction. An analysis bas...
Measuring the behavioral component of financial fluctuaction. An analysis bas...Measuring the behavioral component of financial fluctuaction. An analysis bas...
Measuring the behavioral component of financial fluctuaction. An analysis bas...SYRTO Project
Local Coordination in Online Distributed Constraint Optimization Problems
Local Coordination in Online Distributed Constraint Optimization ProblemsLocal Coordination in Online Distributed Constraint Optimization Problems
Local Coordination in Online Distributed Constraint Optimization ProblemsAntonio Maria Fiscarelli
Local coordination in online distributed constraint optimization problems - P...
Local coordination in online distributed constraint optimization problems - P...Local coordination in online distributed constraint optimization problems - P...
Local coordination in online distributed constraint optimization problems - P...Antonio Maria Fiscarelli
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsUniversity of Bergen
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Naoki Hayashi

Similaire à Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 (20)

On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...
On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...
On the (ab)use of Omega? - Caporin M., Costola M., Jannin G., Maillet B. Dece...
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
1118_Seminar_Continuous_Deep Q-Learning with Model based acceleration
Factor anaysis scale dimensionality
Factor anaysis scale dimensionalityFactor anaysis scale dimensionality
Factor anaysis scale dimensionality
Factor analysis in R by Aman Chauhan
Factor analysis in R by Aman ChauhanFactor analysis in R by Aman Chauhan
Factor analysis in R by Aman Chauhan
Regressioin mini case
Regressioin mini caseRegressioin mini case
Regressioin mini case
Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2Modern Recommendation for Advanced Practitioners part2
Modern Recommendation for Advanced Practitioners part2
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...Predicting an Applicant Status Using Principal Component, Discriminant and Lo...
Predicting an Applicant Status Using Principal Component, Discriminant and Lo...
Brown bag 2012_fall
Brown bag 2012_fallBrown bag 2012_fall
Brown bag 2012_fall
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Sakanashi, h.; kakazu, y. (1994): co evolving genetic algorithm with filtered...
Harnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic RulesHarnessing Deep Neural Networks with Logic Rules
Harnessing Deep Neural Networks with Logic Rules
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
Rank Monotonicity in Centrality Measures (A report about Quality guarantees f...
Marketing Research Project
Marketing Research ProjectMarketing Research Project
Marketing Research Project
Measuring the behavioral component of financial fluctuaction. An analysis bas...
Measuring the behavioral component of financial fluctuaction. An analysis bas...Measuring the behavioral component of financial fluctuaction. An analysis bas...
Measuring the behavioral component of financial fluctuaction. An analysis bas...
Local Coordination in Online Distributed Constraint Optimization Problems
Local Coordination in Online Distributed Constraint Optimization ProblemsLocal Coordination in Online Distributed Constraint Optimization Problems
Local Coordination in Online Distributed Constraint Optimization Problems
Local coordination in online distributed constraint optimization problems - P...
Local coordination in online distributed constraint optimization problems - P...Local coordination in online distributed constraint optimization problems - P...
Local coordination in online distributed constraint optimization problems - P...
Empirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender SystemsEmpirical Evaluation of Active Learning in Recommender Systems
Empirical Evaluation of Active Learning in Recommender Systems
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...
Bayesian Generalization Error and Real Log Canonical Threshold in Non-negativ...

Plus de Chris Ohk

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍Chris Ohk
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들Chris Ohk
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기Chris Ohk
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Chris Ohk
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Chris Ohk
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Chris Ohk
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기Chris Ohk
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기Chris Ohk
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기Chris Ohk
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features SummaryChris Ohk
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지Chris Ohk
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게Chris Ohk
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기Chris Ohk
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기Chris Ohk
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your WayChris Ohk
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Chris Ohk
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer GraphicsChris Ohk
C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2Chris Ohk
[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감
[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감
[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감Chris Ohk

Plus de Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics
C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2C++17 Key Features Summary - Ver 2
C++17 Key Features Summary - Ver 2
[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감
[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감
[제1회 시나브로 그룹 오프라인 밋업] 개발자의 자존감


The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Dernier (20)

The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024

Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021

  • 1. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 옥찬호
  • 2. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Prerequisites • Markov Decision Process • Policy Iteration • Policy Gradient • Actor-Critic • Entropy • KL Divergence • PPO (Proximal Policy Optimization) • GAN (Generative Adversarial Network)
  • 3. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Introduction • Generalization and exploration in RL still represent key challenges that leave most current methods ineffective. • First, a battery of recent studies (Farebrother et al., 2018; Zhang et al., 2018a; Song et al., 2020; Cobbe et al., 2020) indicates that current RL methods fail to generalize correctly even when agents have been trained in a diverse set of environments. • Second, exploration has been extensively studied in RL; however, most hard- exploration problems use the same environment for training and evaluation.
  • 4. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Introduction • Hence, since a well-designed exploration strategy should maximize the information received from a trajectory about an environment, the exploration capabilities may not be appropriately assessed if that information is memorized.
  • 5. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Introduction • In this work, we propose Adversarially Guided Actor-Critic (AGAC), which reconsiders the actor-critic framework by introducing a third protagonist: the adversary. • Its role is to predict the actor’s actions correctly. Meanwhile, the actor must not only find the optimal actions to maximize the sum of expected returns, but also counteract the predictions of the adversary. • This formulation is lightly inspired by adversarial methods, specifically generative adversarial networks (GANs) (Goodfellow et al., 2014).
  • 6. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Introduction • Such a link between GANs and actor-critic methods has been formalized by Pfau & Vinyals (2016); however, in the context of a third protagonist, we draw a different analogy. • The adversary can be interpreted as playing the role of a discriminator that must predict the actions of the actor, and the actor can be considered as playing the role of a generator that behaves to deceive the predictions of the adversary. • This approach has the advantage, as with GANs, that the optimization procedure generates a diversity of meaningful data, corresponding to sequences of actions in AGAC.
  • 7. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Introduction • The contributions of this work are as follow: • (i) we propose a novel actor-critic formulation inspired from adversarial learning (AGAC) • (ii) we analyze empirically AGAC on key reinforcement learning aspects such as diversity, exploration and stability • (iii) we demonstrate significant gains in performance on several sparse- reward hard-exploration tasks including procedurally-generated tasks
  • 8. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Backgrounds and Notations • Markov Decision Process (MDP) 𝑀 = 𝒮, 𝒜, 𝒫, 𝑅, 𝛾 • 𝒮 : The state space • 𝒜 : The action space • 𝒫 : The transition kernel • 𝑅 : The bounded reward function • 𝛾 ∈ 0, 1 : The discount factor
  • 9. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Backgrounds and Notations • Let 𝜋 denote a stochastic policy mapping states to distributions over actions. We place ourselves in the infinite-horizon setting. • i.e., we seek a policy that optimizes 𝐽 𝜋 = σ𝑡=0 ∞ 𝛾𝑡 𝑟 𝑠𝑡, 𝑎𝑡 . • The value of a state is the quantity 𝑉𝜋 𝑠 = 𝔼𝜋 σ𝑡=0 ∞ 𝛾𝑡𝑟 𝑠𝑡, 𝑎𝑡 𝑠0 = 𝑠] and the value of a state-action pair 𝑄𝜋 (𝑠, 𝑎) of performing action 𝑎 in state 𝑠 and then following policy 𝜋 is defined as: 𝑄𝜋 𝑠, 𝑎 = 𝔼𝜋 σ𝑡=0 ∞ 𝛾𝑡 𝑟 𝑠𝑡, 𝑎𝑡 𝑠0 = 𝑠, 𝑎0 = 𝑎]. • The advantage function, which quantifies how an action 𝑎 is better than the average action in state 𝑠, is 𝐴𝜋 𝑠, 𝑎 = 𝑄𝜋 𝑠, 𝑎 − 𝑉𝜋 𝑠 . • Finally, the entropy ℋ𝜋 of a policy is calculated as: ℋ𝜋 𝑠 = 𝔼𝜋 ∙ 𝑠 − log 𝜋 ∙ 𝑠 .
  • 10. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Backgrounds and Notations • An actor-critic algorithm is composed of two main components: a policy and a value predictor. • In deep RL, both the policy and the value function are obtained via parametric estimators: we denote 𝜃 and 𝜙 their respective parameters. • The policy is updated via policy gradient, while the value is usually updated via temporal difference or Monte Carlo rollouts.
  • 11. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Backgrounds and Notations • For a sequence of transitions 𝑠𝑡, 𝑎𝑡, 𝑟𝑡, 𝑠𝑡+1 𝑡∈ 0,𝑁 , we use the following policy gradient loss (including the commonly used entropic penalty): ℒ𝑃𝐺 = − 1 𝑁 ෍ 𝑡′=𝑡 𝑡+𝑁 𝐴𝑡′ log 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃 + 𝛼ℋ𝜋 𝑠𝑡′, 𝜃 where 𝛼 is the entropy coefficient and 𝐴𝑡 is the generalized advantage estimator (Schulman et al., 2016) defined as: 𝐴𝑡 = ෍ 𝑡′=𝑡 𝑡+𝑁 𝛾𝜆 𝑡′−𝑡(𝑟𝑡′ + 𝛾𝑉𝜙old 𝑠𝑡′+1 − 𝑉𝜙old (𝑠𝑡′)) with 𝜆 a fixed hyperparameter and 𝑉𝜙old the value function estimator at the previous optimization iteration. Actor Critic Entropy
  • 12. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Backgrounds and Notations • To estimate the value function, we solve the non-linear regression problem minimize𝜙 σ𝑡′=𝑡 𝑡+𝑁 𝑉𝜙 𝑠𝑡′ − ෠ 𝑉𝑡′ 2 where ෠ 𝑉𝑡′ = 𝐴𝑡 + 𝑉𝜙𝑜𝑙𝑑 𝑠𝑡′ .
  • 13. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • To foster diversified behavior in its trajectories, AGAC introduces a third protagonist to the actor-critic framework: the adversary. • The role of the adversary is to accurately predict the actor’s actions, by minimizing the discrepancy between its action distribution 𝜋adv and the distribution induced by the policy 𝜋. • Meanwhile, in addition to finding the optimal actions to maximize the sum of expected returns, the actor must also counteract the adversary’s predictions by maximizing the discrepancy between 𝜋 and 𝜋adv.
  • 14. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC
  • 15. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • This discrepancy, used as a form of exploration bonus, is defined as the difference of action log-probabilities, whose expectation is the Kullback-Leibler divergence: 𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 = 𝔼𝜋 ∙ 𝑠 log 𝜋 ∙ 𝑠 − log 𝜋adv ∙ 𝑠 • Formally, for each state-action pair 𝑠𝑡, 𝑎𝑡 in a trajectory, an action-dependent bonus log 𝜋 (𝑎𝑡|𝑠𝑡) − log 𝜋adv 𝑎𝑡 𝑠𝑡 is added to the advantage.
  • 16. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • The value target of the critic is modified to include the action- independent equivalent, which is the KL-divergence 𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 . • In addition to the parameters 𝜃 (resp. 𝜃old the parameter of the policy at the previous iteration) and 𝜙 defined above (resp. 𝜙old that of the critic), we denote 𝜓 (resp. 𝜓old) that of the adversary.
  • 17. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • AGAC minimizes the following loss: ℒAGAC = ℒPG + 𝛽𝑉ℒ𝑉 + 𝛽advℒadv • In the new objective ℒPG = 1 𝑁 σ𝑡=0 𝑁 𝐴𝑡 AGAC log 𝜋 𝑎𝑡 𝑠𝑡, 𝜃 + 𝛼ℋ𝜋 𝑠𝑡, 𝜃 , AGAC modifies 𝐴𝑡 as: 𝐴𝑡 AGAC = 𝐴𝑡 + 𝑐 log 𝜋 𝑎𝑡 𝑠𝑡, 𝜃old − log 𝜋adv 𝑎𝑡 𝑠𝑡, 𝜓old with 𝑐 is a varying hyperparameter that controls the dependence on the action log-probability difference. To encourage exploration without preventing asymptotic stability, 𝑐 is linearly annealed during the course of training.
  • 18. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • ℒ𝑉 is the objective function of the critic defined as: ℒ𝑉 = 1 𝑁 ෍ 𝑡=0 𝑁 𝑉𝜙 𝑠𝑡 − ෠ 𝑉𝑡 + 𝑐𝐷KL 𝜋 ∙ 𝑠𝑡, 𝜃old ฮ𝜋adv ∙ 𝑠𝑡, 𝜓old 2 • Finally, ℒadv is the objective function of the adversary: ℒadv = 1 𝑁 ෍ 𝑡=0 𝑁 𝐷KL 𝜋 ∙ 𝑠𝑡, 𝜃old ฮ𝜋adv ∙ 𝑠𝑡, 𝜓 • They are the three equations that our method modifies in the traditional actor-critic framework. The terms 𝛽𝑉 and 𝛽adv are fixed hyperparameters.
  • 19. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • Under the proposed actor-critic formulation, the probability of sampling an action is increased if the modified advantage is positive, i.e. • (i) the corresponding return is larger than the predicted value • (ii) the action log-probability difference is large • More precisely, our method favors transitions whose actions were less accurately predicted than the average action, i.e. log 𝜋 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 ≥ 𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 .
  • 20. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • This is particularly visible for 𝜆 → 1, in which case the generalized advantage is 𝐴𝑡 = 𝐺𝑡 − 𝑉𝜙old (𝑠𝑡), resulting in the appearance of both aforementioned mirrored terms in the modified advantage: 𝐴𝑡 AGAC = 𝐺𝑡 − ෠ 𝑉 𝑡 𝜙old + 𝑐 log 𝜋 𝑎𝑡 𝑠𝑡 − log 𝜋adv 𝑎𝑡 𝑠𝑡 − ෡ 𝐷KL 𝜙old 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 with 𝐺𝑡 the observed return, ෠ 𝑉 𝑡 𝜙old the estimated return and ෡ 𝐷KL 𝜙old 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 the estimated KL-divergence.
  • 21. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Adversarially Guided AC • To avoid instability, in practice the adversary is a separate estimator, updated with a smaller learning rate than the actor. This way, it represents a delayed and more steady version of the actor’s policy, which prevents the agent from having to constantly adapt or focus solely on fooling the adversary.
  • 22. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Building Motivation • We provide an interpretation of AGAC by studying the dynamics of attraction and repulsion between the actor and the adversary. • To simplify, we study the equivalent of AGAC in a policy iteration (PI) scheme. PI being the dynamic programming scheme underlying the standard actor-critic, we have reasons to think that some of our findings translate to the original AGAC algorithm.
  • 23. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Building Motivation • In PI, the quantity of interest is the action-value, which AGAC would modify as: 𝑄𝜋𝑘 AGAC = 𝑄𝜋𝑘 + 𝑐 log 𝜋𝑘 − log 𝜋adv with 𝜋𝑘 the policy at iteration 𝑘. • Incorporating the entropic penalty, the new policy 𝜋𝑘+1 verifies: 𝜋𝑘+1 = argmax 𝜋 𝒥PI 𝜋 = argmax 𝜋 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 AGAC 𝑠, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠
  • 24. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Building Motivation • We can rewrite this objective: 𝒥PI 𝜋 = 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 AGAC 𝑠, 𝑎 − 𝛼 log 𝜋 𝑎 𝑠 = 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠 = 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋 𝑎 𝑠 + log 𝜋 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠 = 𝔼𝑠 𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 𝑠, 𝑎 − 𝑐𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋𝑘 ∙ 𝑠 + 𝑐𝐷KL 𝜋 ∙ 𝑠 ฮ𝜋adv ∙ 𝑠 + 𝛼ℋ 𝜋 ∙ 𝑠 𝜋𝑘 is attractive 𝜋adv is repulsive enforces stochastic policies
  • 25. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Building Motivation • Thus, in the PI scheme, AGAC finds a policy that maximizes Q- values, while at the same time remaining close to the current policy and far from a mixture of the previous policies (i.e., 𝜋𝑘−1, 𝜋𝑘−2, 𝜋𝑘−3, …). • Note that we experimentally observe that our method performs better with a smaller learning rate for the adversarial network than that of the other networks, which could imply that a stable repulsive term is beneficial.
  • 26. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Building Motivation • This optimization problem is strongly concave in 𝜋 (thanks to the entropy term), and is state-wise a Legendre-Fenchel transform. Its solution is given by: 𝜋𝑘+1 ∝ 𝜋𝑘 𝜋adv 𝑐 𝛼 exp 𝑄𝜋𝑘 𝛼 • This result gives us some insight into the behavior of the objective function. Notably, in our example, if 𝜋adv is fixed and 𝑐 = 𝛼, we recover a KL-regularized PI scheme (Geist et al., 2019) with the modified reward 𝑟 − 𝑐 log 𝜋adv.
  • 27. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Proof 𝜋𝑘+1 = argmax 𝜋 𝒥PI 𝜋 ∝ 𝜋𝑘 𝜋adv 𝑐 𝛼 exp 𝑄𝜋𝑘 𝛼 with the objective function: 𝒥PI 𝜋 = 𝔼𝑠𝔼𝑎~𝜋 ∙ 𝑠 𝑄𝜋𝑘 𝑠, 𝑎 + 𝑐 log 𝜋𝑘 𝑎 𝑠 − log 𝜋adv 𝑎 𝑠 − 𝛼 log 𝜋 𝑎 𝑠
  • 28. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Proof • We first consider a simpler optimization problem: argmax 𝜋 𝜋, 𝑄𝜋𝑘 + 𝛼ℋ 𝜋 whose solution is known (Vieillard et al., 2020a, Appendix A). • The expression for the maximizer is the 𝛼-scaled softmax: 𝜋∗ = exp 𝑄𝜋𝑘 𝛼 1, exp 𝑄𝜋𝑘 𝛼
  • 29. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Proof • We now turn towards the optimization problem of interest, which we can rewrite as: argmax 𝜋 𝜋, 𝑄𝜋𝑘 + 𝑐 log 𝜋𝑘 − log 𝜋adv + 𝛼ℋ 𝜋 • By the simple change of variable ෨ 𝑄𝜋𝑘 = 𝑄𝜋𝑘 + 𝑐 log 𝜋𝑘 − log 𝜋adv , we can reuse the previous solution (replacing 𝑄𝜋𝑘 by ෨ 𝑄𝜋𝑘 ). • With the simplification: exp 𝑄𝜋𝑘 + 𝑐 log 𝜋𝑘 − log 𝜋adv 𝛼 = 𝜋𝑘 𝜋adv 𝑐 𝛼 exp 𝑄𝜋𝑘 𝛼
  • 30. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Implementation • In all of the experiments, we use PPO (Schulman et al., 2017) as the base algorithm and build on it to incorporate our method. Hence, ℒ𝑃𝐺 = − 1 𝑁 ෍ 𝑡′=𝑡 𝑡+𝑁 min 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃old 𝐴𝑡′ AGAC , clip 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃 𝜋 𝑎𝑡′ 𝑠𝑡′, 𝜃old , 1 − 𝜖, 1 + 𝜖 𝐴𝑡′ AGAC with 𝐴𝑡′ AGAC given in first equation, 𝑁 the temporal length considered for one update of parameters and 𝜖 the clipping parameter. In PPO, 𝐿𝐶𝐿𝐼𝑃 𝜃 = ෡ 𝔼𝑡 min(𝑟𝑡 𝜃 መ 𝐴𝑡, clip(𝑟𝑡 𝜃 , 1 − 𝜖, 1 + 𝜖) መ 𝐴𝑡)
  • 31. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Implementation • Similar to RIDE (Raileanu & Rocktäschel, 2019), we also discount PPO by episodic state visitation counts. • The actor, critic and adversary use the convolutional architecture of the Nature paper of DQN (Mnih et al., 2015) with different hidden sizes. • The three neural networks are optimized using Adam (Kingma & Ba, 2015). Our method does not use RNNs in its architecture; instead, in all our experiments, we use frame stacking.
  • 32. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Implementation
  • 33. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Implementation • At each training step, we perform a stochastic optimization step to minimize ℒAGAC using stop-gradient: • 𝜃 ← Adam 𝜃, ∇𝜃ℒPG, 𝜂1 • 𝜙 ← Adam 𝜙, ∇𝜙ℒV, 𝜂1 • 𝜓 ← Adam 𝜓, ∇𝜓ℒadv, 𝜂2
  • 34. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • In this section, we describe our experimental study in which we investigate: • (i) whether the adversarial bonus alone (e.g. without episodic state visitation count) is sufficient to outperform other methods in VizDoom, a sparse- reward task with high-dimensional observations • (ii) whether AGAC succeeds in partially-observable and procedurally- generated environments with high sparsity in the rewards, compared to other methods
  • 35. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • In this section, we describe our experimental study in which we investigate: • (iii) how well AGAC is capable of exploring in environments without extrinsic reward • (iv) the training stability of our method. In all of the experiments, lines are average performances and shaded areas represent one standard deviation
  • 36. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Environments
  • 37. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Baselines • RIDE (Raileanu & Rocktäschel, 2019) • Count as Count-Based Exploration (Bellemare et al., 2016b) • RND (Burda et al., 2018) • ICM (Pathak et al., 2017) • AMIGo (Campero et al., 2021)
  • 38. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Adversarially-based Exploration (No episodic count)
  • 39. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Hard-Exploration Tasks with Partially-Observable Environments
  • 40. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Training Stability
  • 41. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Exploration in Reward-Free Environment
  • 42. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Visualizing Coverage and Diversity
  • 43. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021 Experiments • Visualizing Coverage and Diversity
  • 44. Disscussions • This paper introduced AGAC, a modification to the traditional actor-critic framework: an adversary network is added as a third protagonist. • The mechanics of AGAC have been discussed from a policy iteration point of view, and we provided theoretical insight into the inner workings of the proposed algorithm: the adversary forces the agent to remain close to the current policy while moving away from the previous ones. In a nutshell, the influence of the adversary makes the actor conservatively diversified. Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021