SlideShare une entreprise Scribd logo
1  sur  46
Télécharger pour lire hors ligne
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020
옥찬호
utilForever@gmail.com
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Arcade Learning Environment (ALE)
• An interface to a diverse set of Atari 2600 game environments designed to be
engaging and challenging for human players
• The Atari 2600 games are well suited for evaluating general competency in
AI agents for three main reasons
• 1) Varied enough to claim generality
• 2) Each interesting enough to be representative of settings that might be faced in practice
• 3) Each created by an independent party to be free of experimenter’s bias
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Previous achievement of Deep RL in ALE
• DQN (Minh et al., 2015) : 100% HNS on 23 games
• Rainbow (M. Hessel et al., 2017) : 100% HNS on 53 games
• R2D2 (Kapturowski et al., 2018) : 100% HNS on 52 games
• MuZero (Schrittwieser et al., 2019) : 100% HNS on 51 games
→ Despite all efforts, no single RL algorithm has been able to achieve over
100% HNS on all 57 Atari games with one set of hyperparameters.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• How to measure Artificial General Intelligence?
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Atari-57 5th percentile performance
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Challenging issues of Deep RL in ALE
• 1) Long-term credit assignment
• This problem is particularly hard when rewards are delayed and credit needs to be assigned over
long sequences of actions.
• The game of Skiing is a canonical example due to its peculiar reward structure.
• The goal of the game is to run downhill through all gates as fast as possible.
• A penalty of five seconds is given for each missed gate.
• The reward, given only at the end, is proportional to the time elapsed.
• long-term credit assignment is needed to understand why an action taken early in the game (e.g.
missing a gate) has a negative impact in the obtained reward.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Challenging issues of Deep RL in ALE
• 2) Exploration in large high dimensional state spaces
• Games like Private Eye, Montezuma’s Revenge, Pitfall! or Venture are widely considered hard
exploration games as hundreds of actions may be required before a first positive reward is seen.
• In order to succeed, the agents need to keep exploring the environment despite the apparent
impossibility of finding positive rewards.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• Challenging issues of Deep RL in ALE
• Exploration algorithms in deep RL generally fall into three categories:
• Randomize value function
• Unsupervised policy learning
• Intrinsic motivation: NGU
• Other work combines handcrafted features, domain-specific knowledge or
privileged pre-training to side-step the exploration problem.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• In summary, our contributions are as follows:
• 1) A new parameterization of the state-action value function that decomposes the
contributions of the intrinsic and extrinsic rewards. As a result, we significantly
increase the training stability over a large range of intrinsic reward scales.
• 2) A meta-controller: an adaptive mechanism to select which of the policies
(parameterized by exploration rate and discount factors) to prioritize throughout
the training process. This allows the agent to control the exploration/exploitation
trade-off by dedicating more resources to one or the other.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Introduction
• In summary, our contributions are as follows:
• 3) Finally, we demonstrate for the first time performance that is above the human
baseline across all Atari 57 games. As part of these experiments, we also find that
simply re-tuning the backprop through time window to be twice the previously
published window for R2D2 led to superior long-term credit assignment (e.g., in
Solaris) while still maintaining or improving overall performance on the remaining
games.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Our work builds on top of the Never Give Up (NGU) agent,
which combines two ideas:
• The curiosity-driven exploration
• Distributed deep RL agents, in particular R2D2
• NGU computes an intrinsic reward in order to encourage
exploration. This reward is defined by combining
• Per-episode novelty
• Life-long novelty
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Per-episode novelty, 𝑟𝑡
episodic
, rapidly vanishes over the course
of an episode, and it is computed by comparing observations
to the contents of an episodic memory.
• The life-long novelty, 𝛼 𝑡, slowly vanishes throughout training,
and it is computed by using a parametric model.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• With this, the intrinsic reward 𝒓𝒊
𝒕
is defined as follows:
𝑟𝑡
𝑖
= 𝑟𝑡
episodic
∙ min max 𝛼 𝑡, 1 , 𝐿
where 𝐿 = 5 is a chosen maximum reward scaling.
• This leverages the long-term novelty provided by 𝛼 𝑡,
while 𝑟𝑡
episodic
continues to encourage the agent to explore
within an episode.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• At time 𝑡, NGU adds 𝑁 different scales of the same intrinsic
reward 𝛽𝑗 𝑟𝑡
𝑖
(𝛽𝑗 ∈ ℝ+
, 𝑗 ∈ 0, … , 𝑁 − 1) to the extrinsic reward
provided by the environment, 𝑟𝑡
𝑒
, to form 𝑁 potential total
rewards 𝑟𝑗,𝑡 = 𝑟𝑡
𝑒
+ 𝛽𝑗 𝑟𝑡
𝑖
.
• Consequently, NGU aims to learn the 𝑁 different associated
optimal state-action value functions 𝑄 𝑟 𝑗
∗
associated with each
reward function 𝑟𝑗,𝑡.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• The exploration rates 𝛽𝑗 are parameters that control the degree
of exploration.
• Higher values will encourage exploratory policies.
• Smaller values will encourage exploitative policies.
• Additionally, for purposes of learning long-term credit
assignment, each 𝑄 𝑟 𝑗
∗
has its own associated discount factor 𝛾𝑗.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Since the intrinsic reward is typically much more dense than
the extrinsic reward, 𝛽𝑗, 𝛾𝑗 𝑗=0
𝑁−1
are chosen so as to allow for
long term horizons (high values of 𝛾𝑗) for exploitative policies
(small values of 𝛽𝑗) and small term horizons (low values of 𝛾𝑗)
for exploratory policies (high values of 𝛽𝑗).
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• To learn the state-action value function 𝑄 𝑟 𝑗
∗
, NGU trains a
recurrent neural network 𝑄 𝑥, 𝑎, 𝑗; 𝜃 , where 𝑗 is a one-hot
vector indexing one of 𝑁 implied MDPs (in particular 𝛽𝑗, 𝛾𝑗 ),
𝑥 is the current observation, 𝑎 is an action, and 𝜃 are the
parameters of the network (including the recurrent state).
• In practice, NGU can be unstable and fail to learn an
appropriate approximation of 𝑄 𝑟 𝑗
∗
for all the state-action value
functions in the family, even in simple environments.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• This is especially the case when the scale and sparseness of 𝑟𝑡
𝑒
and 𝑟𝑡
𝑖
are both different, or when one reward is more noisy
than the other.
• We conjecture that learning a common state-action value
function for a mix of rewards is difficult when the rewards are
very different in nature.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• Our agent is a deep distributed RL agent, in the lineage of
R2D2 and NGU.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• It decouples the data collection and the learning processes by
having many actors feed data to a central prioritized replay
buffer.
• A learner can then sample training data from this buffer.
• More precisely, the replay buffer contains sequences of
transitions that are removed regularly in a FIFO-manner.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• These sequences come from actor processes that interact with
independent copies of the environment, and they are
prioritized based on temporal differences errors.
• The priorities are initialized by the actors and updated by the
learner with the updated state-action value function
𝑄 𝑥, 𝑎, 𝑗; 𝜃 .
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• According to those priorities, the learner samples sequences of
transitions from the replay buffer to construct an RL loss.
• Then, it updates the parameters of the neural network
𝑄 𝑥, 𝑎, 𝑗; 𝜃 by minimizing the RL loss to approximate the
optimal state-action value function.
• Finally, each actor shares the same network architecture as the
learner but with different weights.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Background: NGU
• We refer as 𝜃𝑙 to the parameters of the 𝑙-th actor. The learner
weights 𝜃 are sent to the actor frequently, which allows it to
update its own weights 𝜃𝑙.
• Each actor uses different values 𝜖𝑙, which are employed to
follow an 𝜖𝑙-greedy policy based on the current estimate of the
state-action value function 𝑄 𝑥, 𝑎, 𝑗; 𝜃𝑙 .
• In particular, at the beginning of each episode and in each
actor, NGU uniformly selects a pair 𝛽𝑗, 𝛾𝑗 .
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• State-Action Value Function Parameterization
𝑄 𝑥, 𝑎, 𝑗; 𝜃 = 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑒
+ 𝛽𝑗 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑖
• 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑒 : Extrinsic component
• 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑖 : Intrinsic component
• Weight: 𝜃 𝑒
and 𝜃 𝑖
(separately)
with identical architecture and 𝜃 = 𝜃 𝑖 ∪ 𝜃 𝑒
• Reward: 𝑟 𝑒 and 𝑟 𝑖 (separately)
But, same target policy
𝜋 = argmax 𝑎∈𝒜 𝑄(𝑥, 𝑎, 𝑗; 𝜃 𝑒
)
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• State-Action Value Function Parameterization
• We show that this optimization of separate state-action values is equivalent
to the optimization of the original single state-action value function with
reward 𝒓 𝒆
+ 𝜷𝒋 𝒓𝒊
in Appendix B.
• Even though the theoretical objective being optimized is the same, the
parameterization is different: we use two different neural networks to
approximate each one of these state-action values.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Adaptive Exploration over a Family of Policies
• The core idea of NGU is to jointly train a family of policies with different
degrees of exploratory behavior using a single network architecture.
• In this way, training these exploratory policies plays the role of a set of
auxiliary tasks that can help train the shared architecture even in the absence
of extrinsic rewards.
• A major limitation of this approach is that all policies are trained equally,
regardless of their contribution to the learning progress.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• We propose to incorporate a meta-controller that can
adaptively select which policies to use both at training and
evaluation time. This carries two important consequences.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Meta-controller with adaptively select
• 1) By selecting which policies to prioritize during training, we can allocate
more of the capacity of the network to better represent the state-action
value function of the policies that are most relevant for the task at hand.
• Note that this is likely to change throughout the training process, naturally
building a curriculum to facilitate training.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Meta-controller with adaptively select
• Policies are represented by pairs of exploration rate and discount factor,
(𝛽𝑗, 𝛾𝑗), which determine the discounted cumulative rewards to maximize.
• It is natural to expect policies with higher 𝛽𝑗 and lower 𝛾𝑗 to make more
progress early in training, while the opposite would be expected as training
progresses.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Meta-controller with adaptively select
• 2) This mechanism also provides a natural way of choosing the best policy in
the family to use at evaluation time.
• Considering a wide range of values of 𝛾𝑗 with 𝛽𝑗≈0, provides a way of
automatically adjusting the discount factor on a per-task basis.
• This significantly increases the generality of the approach.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• We propose to implement the meta-controller using
a nonstationary multi-arm bandit algorithm running
independently on each actor.
• The reason for this choice, as opposed to a global meta-
controller, is that each actor follows a different 𝜖𝑙-greedy
policy which may alter the choice of the optimal arm.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• The multi-armed bandit problem is a problem in which a fixed
limited set of resources must be allocated between competing
(alternative) choices in a way that maximizes their expected
gain, when each choice's properties are only partially known at
the time of allocation, and may become better understood as
time passes or by allocating resources to the choice.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Nonstationary multi-arm bandit algorithm
• Each arm 𝑗 from the 𝑁-arm bandit is linked to a policy in the family and
corresponds to a pair (𝛽𝑗, 𝛾𝑗). At the beginning of each episode, say, the 𝑘th
episode, the meta-controller chooses an arm 𝐽 𝑘 setting which policy will be
executed. (𝐽 𝑘: random variable)
• Then the 𝑙-th actor acts 𝑙-greedily with respect to the corresponding state-
action value function, 𝑄(𝑥, 𝑎, 𝐽 𝑘; 𝜃𝑙), for the whole episode.
• The undiscounted extrinsic episode returns, noted 𝑅 𝑘
𝑒
𝐽 𝑘 , are used as a
reward signal to train the multi-arm bandit algorithm of the meta-controller.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Improvements to NGU
• Nonstationary multi-arm bandit algorithm
• The reward signal 𝑅 𝑘
𝑒
𝐽 𝑘 is non-stationary, as the agent changes throughout
training. Thus, a classical bandit algorithm such as Upper Confidence Bound
(UCB) will not be able to adapt to the changes of the reward through time.
• Therefore, we employ a simplified sliding-window UCB (SW-UCB) with
𝝐 𝐔𝐂𝐁-greedy exploration.
• With probability 1 − 𝜖UCB, this algorithm runs a slight modification of classic
UCB on a sliding window of size 𝜏 and selects a random arm with probability
𝜖UCB.
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Experiments
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Experiments
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Conclusions
• We present the first deep reinforcement learning agent with
performance above the human benchmark on all 57 Atari
games.
• The agent is able to balance the learning of different skills that
are required to be performant on such diverse set of games:
• Exploration
• Exploitation
• Long-term credit assignment
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020Conclusions
• To do that, we propose simple improvements to an existing
agent, Never Give Up, which has good performance on hard-
exploration games, but in itself does not have strong overall
performance across all 57 games.
• 1) Using a different parameterization of the state-action value function
• 2) Using a meta-controller to dynamically adapt the novelty preference and
discount
• 3) The use of longer backprop-through time window to learn from using the
Retrace algorithm
References
• https://deepmind.com/blog/article/Agent57-Outperforming-the-
human-Atari-benchmark
• https://seungeonbaek.tistory.com/4
• https://seolhokim.github.io/deeplearning/2020/04/08/NGU-
review/
• https://openreview.net/pdf?id=Sye57xStvB
• https://en.wikipedia.org/wiki/Multi-armed_bandit
• https://arxiv.org/abs/0805.3415
Agent57: Outperforming the Atari Human Benchmark,
Badia, A. P. et al, 2020
Thank you!

Contenu connexe

Similaire à Agent57 Outperforms Humans on All 57 Atari Games

Generative adversarial network_Ayadi_Alaeddine
Generative adversarial network_Ayadi_AlaeddineGenerative adversarial network_Ayadi_Alaeddine
Generative adversarial network_Ayadi_AlaeddineDeep Learning Italia
 
Introduction to batch normalization
Introduction to batch normalizationIntroduction to batch normalization
Introduction to batch normalizationJamie (Taka) Wang
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetCrossing Minds
 
machine learning a gentle introduction 2018 (edited)
machine learning a gentle introduction 2018 (edited)machine learning a gentle introduction 2018 (edited)
machine learning a gentle introduction 2018 (edited)Matthias Zimmermann
 
A Game-play Architecture for Performance
A Game-play Architecture for PerformanceA Game-play Architecture for Performance
A Game-play Architecture for Performancetektor
 
Advances in Visual Quality Restoration with Generative Adversarial Networks
Advances in Visual Quality Restoration with Generative Adversarial NetworksAdvances in Visual Quality Restoration with Generative Adversarial Networks
Advances in Visual Quality Restoration with Generative Adversarial NetworksFörderverein Technische Fakultät
 
Progressive content generation
Progressive content generationProgressive content generation
Progressive content generationJayyes
 
[TRECVID 2018] Video to Text
[TRECVID 2018] Video to Text[TRECVID 2018] Video to Text
[TRECVID 2018] Video to TextGeorge Awad
 
MongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query Pitfalls
MongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query PitfallsMongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query Pitfalls
MongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query PitfallsMongoDB
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
Java in 21st century: Are you thinking far enough ahead ?
Java in 21st century: Are you thinking far enough ahead ?Java in 21st century: Are you thinking far enough ahead ?
Java in 21st century: Are you thinking far enough ahead ?Steve Wallin
 
IRJET - Obstacle Detection using a Stereo Vision of a Car
IRJET -  	  Obstacle Detection using a Stereo Vision of a CarIRJET -  	  Obstacle Detection using a Stereo Vision of a Car
IRJET - Obstacle Detection using a Stereo Vision of a CarIRJET Journal
 
Scalawox deeplearning
Scalawox deeplearningScalawox deeplearning
Scalawox deeplearningscalawox
 
Atari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural NetworksAtari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural Networksjohnstamford
 
Secrets of Top Pentesters
Secrets of Top PentestersSecrets of Top Pentesters
Secrets of Top Pentestersamiable_indian
 
Project Presentations Phase 2 GANs.pptx
Project Presentations Phase 2 GANs.pptxProject Presentations Phase 2 GANs.pptx
Project Presentations Phase 2 GANs.pptxDivyanshSinghRajput1
 
[Qraft] asset allocation with deep learning hyojunmoon
[Qraft] asset allocation with deep learning hyojunmoon[Qraft] asset allocation with deep learning hyojunmoon
[Qraft] asset allocation with deep learning hyojunmoon형식 김
 

Similaire à Agent57 Outperforms Humans on All 57 Atari Games (20)

Generative adversarial network_Ayadi_Alaeddine
Generative adversarial network_Ayadi_AlaeddineGenerative adversarial network_Ayadi_Alaeddine
Generative adversarial network_Ayadi_Alaeddine
 
Introduction to batch normalization
Introduction to batch normalizationIntroduction to batch normalization
Introduction to batch normalization
 
Recommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right DatasetRecommender Systems from A to Z – The Right Dataset
Recommender Systems from A to Z – The Right Dataset
 
machine learning a gentle introduction 2018 (edited)
machine learning a gentle introduction 2018 (edited)machine learning a gentle introduction 2018 (edited)
machine learning a gentle introduction 2018 (edited)
 
A Game-play Architecture for Performance
A Game-play Architecture for PerformanceA Game-play Architecture for Performance
A Game-play Architecture for Performance
 
Advances in Visual Quality Restoration with Generative Adversarial Networks
Advances in Visual Quality Restoration with Generative Adversarial NetworksAdvances in Visual Quality Restoration with Generative Adversarial Networks
Advances in Visual Quality Restoration with Generative Adversarial Networks
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 
Progressive content generation
Progressive content generationProgressive content generation
Progressive content generation
 
[TRECVID 2018] Video to Text
[TRECVID 2018] Video to Text[TRECVID 2018] Video to Text
[TRECVID 2018] Video to Text
 
Spring ME
Spring MESpring ME
Spring ME
 
MongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query Pitfalls
MongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query PitfallsMongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query Pitfalls
MongoDB.local Seattle 2019: Tips & Tricks for Avoiding Common Query Pitfalls
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
Java in 21st century: Are you thinking far enough ahead ?
Java in 21st century: Are you thinking far enough ahead ?Java in 21st century: Are you thinking far enough ahead ?
Java in 21st century: Are you thinking far enough ahead ?
 
IRJET - Obstacle Detection using a Stereo Vision of a Car
IRJET -  	  Obstacle Detection using a Stereo Vision of a CarIRJET -  	  Obstacle Detection using a Stereo Vision of a Car
IRJET - Obstacle Detection using a Stereo Vision of a Car
 
Scalawox deeplearning
Scalawox deeplearningScalawox deeplearning
Scalawox deeplearning
 
Atari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural NetworksAtari Game State Representation using Convolutional Neural Networks
Atari Game State Representation using Convolutional Neural Networks
 
Spring ME JavaOne
Spring ME JavaOneSpring ME JavaOne
Spring ME JavaOne
 
Secrets of Top Pentesters
Secrets of Top PentestersSecrets of Top Pentesters
Secrets of Top Pentesters
 
Project Presentations Phase 2 GANs.pptx
Project Presentations Phase 2 GANs.pptxProject Presentations Phase 2 GANs.pptx
Project Presentations Phase 2 GANs.pptx
 
[Qraft] asset allocation with deep learning hyojunmoon
[Qraft] asset allocation with deep learning hyojunmoon[Qraft] asset allocation with deep learning hyojunmoon
[Qraft] asset allocation with deep learning hyojunmoon
 

Plus de Chris Ohk

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍Chris Ohk
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들Chris Ohk
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Chris Ohk
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Chris Ohk
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기Chris Ohk
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features SummaryChris Ohk
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지Chris Ohk
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게Chris Ohk
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기Chris Ohk
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기Chris Ohk
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your WayChris Ohk
 

Plus de Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
 

Dernier

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 

Dernier (20)

Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 

Agent57 Outperforms Humans on All 57 Atari Games

  • 1. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020 옥찬호 utilForever@gmail.com
  • 2.
  • 3. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Introduction • Arcade Learning Environment (ALE) • An interface to a diverse set of Atari 2600 game environments designed to be engaging and challenging for human players • The Atari 2600 games are well suited for evaluating general competency in AI agents for three main reasons • 1) Varied enough to claim generality • 2) Each interesting enough to be representative of settings that might be faced in practice • 3) Each created by an independent party to be free of experimenter’s bias
  • 4. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Introduction • Previous achievement of Deep RL in ALE • DQN (Minh et al., 2015) : 100% HNS on 23 games • Rainbow (M. Hessel et al., 2017) : 100% HNS on 53 games • R2D2 (Kapturowski et al., 2018) : 100% HNS on 52 games • MuZero (Schrittwieser et al., 2019) : 100% HNS on 51 games → Despite all efforts, no single RL algorithm has been able to achieve over 100% HNS on all 57 Atari games with one set of hyperparameters.
  • 5. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Introduction • How to measure Artificial General Intelligence?
  • 6. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Introduction • Atari-57 5th percentile performance
  • 7. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Introduction • Challenging issues of Deep RL in ALE • 1) Long-term credit assignment • This problem is particularly hard when rewards are delayed and credit needs to be assigned over long sequences of actions. • The game of Skiing is a canonical example due to its peculiar reward structure. • The goal of the game is to run downhill through all gates as fast as possible. • A penalty of five seconds is given for each missed gate. • The reward, given only at the end, is proportional to the time elapsed. • long-term credit assignment is needed to understand why an action taken early in the game (e.g. missing a gate) has a negative impact in the obtained reward.
  • 8. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Introduction • Challenging issues of Deep RL in ALE • 2) Exploration in large high dimensional state spaces • Games like Private Eye, Montezuma’s Revenge, Pitfall! or Venture are widely considered hard exploration games as hundreds of actions may be required before a first positive reward is seen. • In order to succeed, the agents need to keep exploring the environment despite the apparent impossibility of finding positive rewards.
  • 9. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Introduction • Challenging issues of Deep RL in ALE • Exploration algorithms in deep RL generally fall into three categories: • Randomize value function • Unsupervised policy learning • Intrinsic motivation: NGU • Other work combines handcrafted features, domain-specific knowledge or privileged pre-training to side-step the exploration problem.
  • 10. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Introduction • In summary, our contributions are as follows: • 1) A new parameterization of the state-action value function that decomposes the contributions of the intrinsic and extrinsic rewards. As a result, we significantly increase the training stability over a large range of intrinsic reward scales. • 2) A meta-controller: an adaptive mechanism to select which of the policies (parameterized by exploration rate and discount factors) to prioritize throughout the training process. This allows the agent to control the exploration/exploitation trade-off by dedicating more resources to one or the other.
  • 11. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Introduction • In summary, our contributions are as follows: • 3) Finally, we demonstrate for the first time performance that is above the human baseline across all Atari 57 games. As part of these experiments, we also find that simply re-tuning the backprop through time window to be twice the previously published window for R2D2 led to superior long-term credit assignment (e.g., in Solaris) while still maintaining or improving overall performance on the remaining games.
  • 12. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • Our work builds on top of the Never Give Up (NGU) agent, which combines two ideas: • The curiosity-driven exploration • Distributed deep RL agents, in particular R2D2 • NGU computes an intrinsic reward in order to encourage exploration. This reward is defined by combining • Per-episode novelty • Life-long novelty
  • 13. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • Per-episode novelty, 𝑟𝑡 episodic , rapidly vanishes over the course of an episode, and it is computed by comparing observations to the contents of an episodic memory. • The life-long novelty, 𝛼 𝑡, slowly vanishes throughout training, and it is computed by using a parametric model.
  • 14. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU
  • 15. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • With this, the intrinsic reward 𝒓𝒊 𝒕 is defined as follows: 𝑟𝑡 𝑖 = 𝑟𝑡 episodic ∙ min max 𝛼 𝑡, 1 , 𝐿 where 𝐿 = 5 is a chosen maximum reward scaling. • This leverages the long-term novelty provided by 𝛼 𝑡, while 𝑟𝑡 episodic continues to encourage the agent to explore within an episode.
  • 16. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • At time 𝑡, NGU adds 𝑁 different scales of the same intrinsic reward 𝛽𝑗 𝑟𝑡 𝑖 (𝛽𝑗 ∈ ℝ+ , 𝑗 ∈ 0, … , 𝑁 − 1) to the extrinsic reward provided by the environment, 𝑟𝑡 𝑒 , to form 𝑁 potential total rewards 𝑟𝑗,𝑡 = 𝑟𝑡 𝑒 + 𝛽𝑗 𝑟𝑡 𝑖 . • Consequently, NGU aims to learn the 𝑁 different associated optimal state-action value functions 𝑄 𝑟 𝑗 ∗ associated with each reward function 𝑟𝑗,𝑡.
  • 17. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • The exploration rates 𝛽𝑗 are parameters that control the degree of exploration. • Higher values will encourage exploratory policies. • Smaller values will encourage exploitative policies. • Additionally, for purposes of learning long-term credit assignment, each 𝑄 𝑟 𝑗 ∗ has its own associated discount factor 𝛾𝑗.
  • 18. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • Since the intrinsic reward is typically much more dense than the extrinsic reward, 𝛽𝑗, 𝛾𝑗 𝑗=0 𝑁−1 are chosen so as to allow for long term horizons (high values of 𝛾𝑗) for exploitative policies (small values of 𝛽𝑗) and small term horizons (low values of 𝛾𝑗) for exploratory policies (high values of 𝛽𝑗).
  • 19. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU
  • 20. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • To learn the state-action value function 𝑄 𝑟 𝑗 ∗ , NGU trains a recurrent neural network 𝑄 𝑥, 𝑎, 𝑗; 𝜃 , where 𝑗 is a one-hot vector indexing one of 𝑁 implied MDPs (in particular 𝛽𝑗, 𝛾𝑗 ), 𝑥 is the current observation, 𝑎 is an action, and 𝜃 are the parameters of the network (including the recurrent state). • In practice, NGU can be unstable and fail to learn an appropriate approximation of 𝑄 𝑟 𝑗 ∗ for all the state-action value functions in the family, even in simple environments.
  • 21. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • This is especially the case when the scale and sparseness of 𝑟𝑡 𝑒 and 𝑟𝑡 𝑖 are both different, or when one reward is more noisy than the other. • We conjecture that learning a common state-action value function for a mix of rewards is difficult when the rewards are very different in nature.
  • 22. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • Our agent is a deep distributed RL agent, in the lineage of R2D2 and NGU.
  • 23. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • It decouples the data collection and the learning processes by having many actors feed data to a central prioritized replay buffer. • A learner can then sample training data from this buffer. • More precisely, the replay buffer contains sequences of transitions that are removed regularly in a FIFO-manner.
  • 24. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • These sequences come from actor processes that interact with independent copies of the environment, and they are prioritized based on temporal differences errors. • The priorities are initialized by the actors and updated by the learner with the updated state-action value function 𝑄 𝑥, 𝑎, 𝑗; 𝜃 .
  • 25. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • According to those priorities, the learner samples sequences of transitions from the replay buffer to construct an RL loss. • Then, it updates the parameters of the neural network 𝑄 𝑥, 𝑎, 𝑗; 𝜃 by minimizing the RL loss to approximate the optimal state-action value function. • Finally, each actor shares the same network architecture as the learner but with different weights.
  • 26. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Background: NGU • We refer as 𝜃𝑙 to the parameters of the 𝑙-th actor. The learner weights 𝜃 are sent to the actor frequently, which allows it to update its own weights 𝜃𝑙. • Each actor uses different values 𝜖𝑙, which are employed to follow an 𝜖𝑙-greedy policy based on the current estimate of the state-action value function 𝑄 𝑥, 𝑎, 𝑗; 𝜃𝑙 . • In particular, at the beginning of each episode and in each actor, NGU uniformly selects a pair 𝛽𝑗, 𝛾𝑗 .
  • 27. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • State-Action Value Function Parameterization 𝑄 𝑥, 𝑎, 𝑗; 𝜃 = 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑒 + 𝛽𝑗 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑖 • 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑒 : Extrinsic component • 𝑄 𝑥, 𝑎, 𝑗; 𝜃 𝑖 : Intrinsic component • Weight: 𝜃 𝑒 and 𝜃 𝑖 (separately) with identical architecture and 𝜃 = 𝜃 𝑖 ∪ 𝜃 𝑒 • Reward: 𝑟 𝑒 and 𝑟 𝑖 (separately) But, same target policy 𝜋 = argmax 𝑎∈𝒜 𝑄(𝑥, 𝑎, 𝑗; 𝜃 𝑒 )
  • 28. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • State-Action Value Function Parameterization • We show that this optimization of separate state-action values is equivalent to the optimization of the original single state-action value function with reward 𝒓 𝒆 + 𝜷𝒋 𝒓𝒊 in Appendix B. • Even though the theoretical objective being optimized is the same, the parameterization is different: we use two different neural networks to approximate each one of these state-action values.
  • 29. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU
  • 30. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU
  • 31. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • Adaptive Exploration over a Family of Policies • The core idea of NGU is to jointly train a family of policies with different degrees of exploratory behavior using a single network architecture. • In this way, training these exploratory policies plays the role of a set of auxiliary tasks that can help train the shared architecture even in the absence of extrinsic rewards. • A major limitation of this approach is that all policies are trained equally, regardless of their contribution to the learning progress.
  • 32. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • We propose to incorporate a meta-controller that can adaptively select which policies to use both at training and evaluation time. This carries two important consequences.
  • 33. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • Meta-controller with adaptively select • 1) By selecting which policies to prioritize during training, we can allocate more of the capacity of the network to better represent the state-action value function of the policies that are most relevant for the task at hand. • Note that this is likely to change throughout the training process, naturally building a curriculum to facilitate training.
  • 34. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • Meta-controller with adaptively select • Policies are represented by pairs of exploration rate and discount factor, (𝛽𝑗, 𝛾𝑗), which determine the discounted cumulative rewards to maximize. • It is natural to expect policies with higher 𝛽𝑗 and lower 𝛾𝑗 to make more progress early in training, while the opposite would be expected as training progresses.
  • 35. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • Meta-controller with adaptively select • 2) This mechanism also provides a natural way of choosing the best policy in the family to use at evaluation time. • Considering a wide range of values of 𝛾𝑗 with 𝛽𝑗≈0, provides a way of automatically adjusting the discount factor on a per-task basis. • This significantly increases the generality of the approach.
  • 36. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • We propose to implement the meta-controller using a nonstationary multi-arm bandit algorithm running independently on each actor. • The reason for this choice, as opposed to a global meta- controller, is that each actor follows a different 𝜖𝑙-greedy policy which may alter the choice of the optimal arm.
  • 37. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • The multi-armed bandit problem is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices in a way that maximizes their expected gain, when each choice's properties are only partially known at the time of allocation, and may become better understood as time passes or by allocating resources to the choice.
  • 38. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU
  • 39. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • Nonstationary multi-arm bandit algorithm • Each arm 𝑗 from the 𝑁-arm bandit is linked to a policy in the family and corresponds to a pair (𝛽𝑗, 𝛾𝑗). At the beginning of each episode, say, the 𝑘th episode, the meta-controller chooses an arm 𝐽 𝑘 setting which policy will be executed. (𝐽 𝑘: random variable) • Then the 𝑙-th actor acts 𝑙-greedily with respect to the corresponding state- action value function, 𝑄(𝑥, 𝑎, 𝐽 𝑘; 𝜃𝑙), for the whole episode. • The undiscounted extrinsic episode returns, noted 𝑅 𝑘 𝑒 𝐽 𝑘 , are used as a reward signal to train the multi-arm bandit algorithm of the meta-controller.
  • 40. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Improvements to NGU • Nonstationary multi-arm bandit algorithm • The reward signal 𝑅 𝑘 𝑒 𝐽 𝑘 is non-stationary, as the agent changes throughout training. Thus, a classical bandit algorithm such as Upper Confidence Bound (UCB) will not be able to adapt to the changes of the reward through time. • Therefore, we employ a simplified sliding-window UCB (SW-UCB) with 𝝐 𝐔𝐂𝐁-greedy exploration. • With probability 1 − 𝜖UCB, this algorithm runs a slight modification of classic UCB on a sliding window of size 𝜏 and selects a random arm with probability 𝜖UCB.
  • 41. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Experiments
  • 42. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Experiments
  • 43. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Conclusions • We present the first deep reinforcement learning agent with performance above the human benchmark on all 57 Atari games. • The agent is able to balance the learning of different skills that are required to be performant on such diverse set of games: • Exploration • Exploitation • Long-term credit assignment
  • 44. Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Conclusions • To do that, we propose simple improvements to an existing agent, Never Give Up, which has good performance on hard- exploration games, but in itself does not have strong overall performance across all 57 games. • 1) Using a different parameterization of the state-action value function • 2) Using a meta-controller to dynamically adapt the novelty preference and discount • 3) The use of longer backprop-through time window to learn from using the Retrace algorithm
  • 45. References • https://deepmind.com/blog/article/Agent57-Outperforming-the- human-Atari-benchmark • https://seungeonbaek.tistory.com/4 • https://seolhokim.github.io/deeplearning/2020/04/08/NGU- review/ • https://openreview.net/pdf?id=Sye57xStvB • https://en.wikipedia.org/wiki/Multi-armed_bandit • https://arxiv.org/abs/0805.3415 Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020