SlideShare une entreprise Scribd logo
1  sur  25
Télécharger pour lire hors ligne
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
옥찬호
utilForever@gmail.com
Introduction
• Deep Q Network solve problems...
• High-dimensional observation spaces
• Discrete and low-dimensional action spaces
• Limitation of DQN
• For example, a 7 degree of freedom system with
the coarsest discretization 𝑎𝑖 ∈ −𝑘, 0, 𝑘 for each
joint leads to an action space with dimensionality:
37
= 2187.
→ How to solve high-dimensional action spaces problem?
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Introduction
• Use Deterministic Policy Gradient (DPG) algorithm!
• However, a naive application of this actor-critic method with
neural function approximators is unstable for challenging problems.
→ How?
• Combine actor-critic approach with insights from Deep Q Network (DQN)
• Use replay buffer to minimize correlations between samples.
• Use target Q network to give consistent targets.
• Use batch normalization.
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Background
• The action-value function is used in many reinforcement
learning algorithms. It describes the expected return after
taking an action 𝑎 𝑡 in state 𝑠𝑡 and thereafter following policy 𝜋:
• Action-value function using Bellman-equation
• If target policy is deterministic,
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Background
• Q-learning, a commonly used off-policy algorithm, uses the
greedy policy 𝜇 𝑠 = argmax 𝑎 𝑄(𝑠, 𝑎). We consider function
approximators parameterized by 𝜃 𝑄
, which we optimize by
minimizing the loss:
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Algorithm
• It is not possible to straightforwardly apply Q-learning
to continuous action space.
• In continuous spaces finding the greedy policy requires
an optimization of 𝑎 𝑡 at every timestep.
𝜇 𝑠 = argmax 𝑎 𝑄(𝑠, 𝑎)
• This optimization is too slow to be practical with large,
unconstrained function approximators and nontrivial action spaces.
• Instead, here we used an actor-critic approach
based on the DPG algorithm.
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Algorithm
• The DPG algorithm maintains a parameterized actor function
𝜇 𝑠 𝜃 𝜇
) which specifies the current policy by deterministically
mapping states to a specific action.
• Critic network (Value): Using the Bellman equation as in DQN
• Actor network (Policy):
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Replay buffer
• As with Q learning, introducing non-linear function approximators means
that convergence is no longer guaranteed. However, such approximators
appear essential in order to learn and generalize on large state spaces.
• NFQCA uses the same update rules as DPG but with neural network function
approximators, uses batch learning for stability, which is intractable for large
networks. A minibatch version of NFQCA which does not reset the policy at
each update, as would be required to scale to large networks.
• Our contribution here is to provide modifications to DPG, inspired by the
success of DQN, which allow it to use neural network function approximators
to learn in large state and action spaces online. We refer to our algorithm as
Deep DPG (DDPG, Algorithm 1).
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Replay buffer
• As in DQN, we used a replay buffer to address these issues. The replay buffer
is a finite sized cache ℛ. Transitions were sampled from the environment
according to the exploration policy and the tuple (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1) was store in
the replay buffer. When the replay buffer was full the oldest samples were
discarded. At each timestep the actor and critic are updated by sampling
minibatch uniformly from the buffer. Because DDPG is an off-policy
algorithm, the replay buffer can be large, allowing the algorithm to benefit
from learning across a set of uncorrelated transitions.
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Soft target update
• Directly implementing Q learning (equation 4) with neural networks proved
to be unstable in many environments. Since the network 𝑄(𝑠, 𝑎|𝜃 𝑄
) being
updated is also used in calculating the target value (equation 5), the Q
update is prone to divergence.
• Our solution is similar to the target network used in (Mnih et al., 2013) but
modified for actor-critic and using “soft” target updates, rather than directly
copying the weights.
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Soft target update
• We create a copy of the actor and critic networks, 𝑄′(𝑠, 𝑎|𝜃 𝑄′
) and 𝜇′(𝑠|𝜃 𝜇′
)
respectively, that are used for calculating the target values. The weights of
these target networks are then updated by having them slowly track the
learned networks: 𝜽′ ← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ with 𝝉 ≪ 𝟏. This means that the target
values are constrained to change slowly, greatly improving the stability of
learning.
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Batch normalization
• When learning from low dimensional feature vector observations, the
different components of the observation may have different physical units
(for example, positions versus velocities) and the ranges may vary across
environments. This can make it difficult for the network to learn effectively
and may make it difficult to find hyper-parameters which generalize across
environments with different scales of state values.
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Batch normalization
• One approach to this problem is to manually scale the features so they are in
similar ranges across environments and units. We address this issue by
adapting a recent technique from deep learning called batch normalization.
This technique normalizes each dimension across the samples in a minibatch
to have unit mean and variance.
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Noise process
• A major challenge of learning in continuous action spaces is exploration. An
advantage of off-policies algorithms such as DDPG is that we can treat the
problem of exploration independently from the learning algorithm.
• We constructed an exploration policy 𝜇′
by adding noise sampled from
a noise process 𝒩 to our actor policy 𝜇′ 𝑠𝑡 = 𝜇 𝑠𝑡 𝜃𝑡
𝜇
+ 𝒩, where 𝒩 can be
chosen to suit the environment.
• As detailed in the supplementary materials we used an Ornstein-Uhlenbeck
process (Uhlenbeck & Ornstein, 1930) to generate temporally correlated
exploration for exploration efficiency in physical control problems with inertia
(similar use of autocorrelated noise was introduced in (Wawrzynski, 2015)).
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Noise process
• Ornstein–Uhlenbeck process
• It is a stochastic process that returns to the mean. 𝜃 means a parameter
that indicates how quickly this process returns to the mean and 𝜇 means
the mean. 𝜎 means variability of processes and 𝑊𝑡 means Wiener process.
• Therefore, it is temporarily correlated with previous noise.
• The reason for using the temporally correlated noise process is that it is
more effective when learning in an inertial environment such as physical
control.
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Algorithm
• Network structure
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
𝑠
Critic
Network 2
Critic
Network 1
Actor
Network 1
𝑄(𝑠, 𝑎)
𝑎
𝑠′
Algorithm
• Critic network
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
𝐿 =
1
𝑁
෍
𝑖
𝑦𝑖 − 𝑄 𝑠𝑖, 𝑎𝑖 𝜃 𝑄
2
𝑠
Critic
Network 2
Critic
Network 1
Actor
Network 1
𝑄(𝑠, 𝑎)
𝑎
𝑠′
Algorithm
• Critic network
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
𝜕𝑄 𝑠, 𝑎, 𝑤
𝜕𝑎
𝑠
Critic
Network 2
Critic
Network 1
Actor
Network 1
𝑄(𝑠, 𝑎)
𝑎
𝑠′
𝜕𝑎
𝜕𝜃
Algorithm Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Results Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Results Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Conclusion
• The work combines insights from recent advances in deep learning
and reinforcement learning, resulting in an algorithm that robustly
solves challenging problems across a variety of domains with
continuous action spaces, even when using raw pixels for
observations.
• As with most reinforcement learning algorithms, the use of non-
linear function approximators nullifies any convergence guarantees;
however, our experimental results demonstrate that stable learning
without the need for any modifications between environments.
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Conclusion
• Interestingly, all of our experiments used substantially fewer steps
of experience than was used by DQN learning to find solutions in
the Atari domain.
• Most notably, as with most model-free reinforcement approaches,
DDPG requires a large number of training episodes to find solutions.
However, we believe that a robust model-free approach may be an
important component of larger systems which may attack these
limitations.
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
References
• https://nervanasystems.github.io/coach/components/agents/poli
cy_optimization/ddpg.html
• https://youtu.be/h2WSVBAC1t4
• https://reinforcement-learning-kr.github.io/2018/06/27/2_dpg/
• https://reinforcement-learning-kr.github.io/2018/06/26/3_ddpg/
Continuous Control With Deep Reinforcement Learning,
Lillicrap et al, 2015.
Thank you!

Contenu connexe

Tendances

Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Taehoon Kim
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithmJie-Han Chen
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIMikko Mäkipää
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryAndrii Gakhov
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...Taegyun Jeon
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기Woong won Lee
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratchJie-Han Chen
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningTaehoon Kim
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialOmar Enayet
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningshivani saluja
 
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based PoliciesDeep Learning JP
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningYoonho Lee
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강Woong won Lee
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningSeolhokim
 

Tendances (20)

Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)Continuous control with deep reinforcement learning (DDPG)
Continuous control with deep reinforcement learning (DDPG)
 
Actor critic algorithm
Actor critic algorithmActor critic algorithm
Actor critic algorithm
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Intro to Reinforcement learning - part III
Intro to Reinforcement learning - part IIIIntro to Reinforcement learning - part III
Intro to Reinforcement learning - part III
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Recurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: TheoryRecurrent Neural Networks. Part 1: Theory
Recurrent Neural Networks. Part 1: Theory
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
[PR12] PR-050: Convolutional LSTM Network: A Machine Learning Approach for Pr...
 
RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기RLCode와 A3C 쉽고 깊게 이해하기
RLCode와 A3C 쉽고 깊게 이해하기
 
Deep reinforcement learning from scratch
Deep reinforcement learning from scratchDeep reinforcement learning from scratch
Deep reinforcement learning from scratch
 
Dueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learningDueling network architectures for deep reinforcement learning
Dueling network architectures for deep reinforcement learning
 
Soft Actor Critic 解説
Soft Actor Critic 解説Soft Actor Critic 解説
Soft Actor Critic 解説
 
Reinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners TutorialReinforcement Learning : A Beginners Tutorial
Reinforcement Learning : A Beginners Tutorial
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
[DL輪読会]Reinforcement Learning with Deep Energy-Based Policies
 
Dueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement LearningDueling Network Architectures for Deep Reinforcement Learning
Dueling Network Architectures for Deep Reinforcement Learning
 
파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강파이썬과 케라스로 배우는 강화학습 저자특강
파이썬과 케라스로 배우는 강화학습 저자특강
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Multi-Agent Reinforcement Learning
Multi-Agent Reinforcement LearningMulti-Agent Reinforcement Learning
Multi-Agent Reinforcement Learning
 

Similaire à DRL Solves Continuous Control with Experience Replay

Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...gabrielesisinna
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learningCairo University
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Chris Ohk
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDAmmar Rashed
 
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...Jihun Yun
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Universitat Politècnica de Catalunya
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017mooopan
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birdsWangyu Han
 
DQN Variants: A quick glance
DQN Variants: A quick glanceDQN Variants: A quick glance
DQN Variants: A quick glanceTejas Kotha
 
consistency regularization for generative adversarial networks_review
consistency regularization for generative adversarial networks_reviewconsistency regularization for generative adversarial networks_review
consistency regularization for generative adversarial networks_reviewYoonho Na
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Chris Ohk
 
Introduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsIntroduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsSayak Paul
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash courseVishwas N
 
PR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
PR12-193 NISP: Pruning Networks using Neural Importance Score PropagationPR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
PR12-193 NISP: Pruning Networks using Neural Importance Score PropagationTaesu Kim
 

Similaire à DRL Solves Continuous Control with Experience Replay (20)

Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
Deep Learning in Robotics: Robot gains Social Intelligence through Multimodal...
 
Summary of BRAC
Summary of BRACSummary of BRAC
Summary of BRAC
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017Proximal Policy Optimization Algorithms, Schulman et al, 2017
Proximal Policy Optimization Algorithms, Schulman et al, 2017
 
Deep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfDDeep Q-learning from Demonstrations DQfD
Deep Q-learning from Demonstrations DQfD
 
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
ProxGen: Adaptive Proximal Gradient Methods for Structured Neural Networks (N...
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017Model-Based Reinforcement Learning @NIPS2017
Model-Based Reinforcement Learning @NIPS2017
 
DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
DQN Variants: A quick glance
DQN Variants: A quick glanceDQN Variants: A quick glance
DQN Variants: A quick glance
 
consistency regularization for generative adversarial networks_review
consistency regularization for generative adversarial networks_reviewconsistency regularization for generative adversarial networks_review
consistency regularization for generative adversarial networks_review
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 
Lec3 dqn
Lec3 dqnLec3 dqn
Lec3 dqn
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Introduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural netsIntroduction to cyclical learning rates for training neural nets
Introduction to cyclical learning rates for training neural nets
 
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsPR-231: A Simple Framework for Contrastive Learning of Visual Representations
PR-231: A Simple Framework for Contrastive Learning of Visual Representations
 
Deep learning crash course
Deep learning crash courseDeep learning crash course
Deep learning crash course
 
ngboost.pptx
ngboost.pptxngboost.pptx
ngboost.pptx
 
PR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
PR12-193 NISP: Pruning Networks using Neural Importance Score PropagationPR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
PR12-193 NISP: Pruning Networks using Neural Importance Score Propagation
 

Plus de Chris Ohk

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍Chris Ohk
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들Chris Ohk
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneChris Ohk
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Chris Ohk
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Chris Ohk
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Chris Ohk
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Chris Ohk
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Chris Ohk
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기Chris Ohk
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기Chris Ohk
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features SummaryChris Ohk
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지Chris Ohk
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게Chris Ohk
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기Chris Ohk
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기Chris Ohk
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your WayChris Ohk
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Chris Ohk
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer GraphicsChris Ohk
 

Plus de Chris Ohk (20)

인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
인프콘 2022 - Rust 크로스 플랫폼 프로그래밍
 
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
고려대학교 컴퓨터학과 특강 - 대학생 때 알았더라면 좋았을 것들
 
Momenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStoneMomenti Seminar - 5 Years of RosettaStone
Momenti Seminar - 5 Years of RosettaStone
 
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
선린인터넷고등학교 2021 알고리즘 컨퍼런스 - Rust로 알고리즘 문제 풀어보기
 
Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2Momenti Seminar - A Tour of Rust, Part 2
Momenti Seminar - A Tour of Rust, Part 2
 
Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1Momenti Seminar - A Tour of Rust, Part 1
Momenti Seminar - A Tour of Rust, Part 1
 
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
Adversarially Guided Actor-Critic, Y. Flet-Berliac et al, 2021
 
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
Agent57: Outperforming the Atari Human Benchmark, Badia, A. P. et al, 2020
 
Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015Trust Region Policy Optimization, Schulman et al, 2015
Trust Region Policy Optimization, Schulman et al, 2015
 
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
GDG Gwangju DevFest 2019 - <하스스톤> 강화학습 환경 개발기
 
[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기[RLKorea] <하스스톤> 강화학습 환경 개발기
[RLKorea] <하스스톤> 강화학습 환경 개발기
 
[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기[NDC 2019] 하스스톤 강화학습 환경 개발기
[NDC 2019] 하스스톤 강화학습 환경 개발기
 
C++20 Key Features Summary
C++20 Key Features SummaryC++20 Key Features Summary
C++20 Key Features Summary
 
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
[델리만주] 대학원 캐슬 - 석사에서 게임 프로그래머까지
 
디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게디미고 특강 - 개발을 시작하려는 여러분에게
디미고 특강 - 개발을 시작하려는 여러분에게
 
청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기청강대 특강 - 프로젝트 제대로 해보기
청강대 특강 - 프로젝트 제대로 해보기
 
[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기[NDC 2018] 유체역학 엔진 개발기
[NDC 2018] 유체역학 엔진 개발기
 
My Way, Your Way
My Way, Your WayMy Way, Your Way
My Way, Your Way
 
Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발Re:Zero부터 시작하지 않는 오픈소스 개발
Re:Zero부터 시작하지 않는 오픈소스 개발
 
[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics[9XD] Introduction to Computer Graphics
[9XD] Introduction to Computer Graphics
 

Dernier

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...itnewsafrica
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Karmanjay Verma
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...itnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 

Dernier (20)

Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...Zeshan Sattar- Assessing the skill requirements and industry expectations for...
Zeshan Sattar- Assessing the skill requirements and industry expectations for...
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#Microservices, Docker deploy and Microservices source code in C#
Microservices, Docker deploy and Microservices source code in C#
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
Irene Moetsana-Moeng: Stakeholders in Cybersecurity: Collaborative Defence fo...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 

DRL Solves Continuous Control with Experience Replay

  • 1. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015. 옥찬호 utilForever@gmail.com
  • 2. Introduction • Deep Q Network solve problems... • High-dimensional observation spaces • Discrete and low-dimensional action spaces • Limitation of DQN • For example, a 7 degree of freedom system with the coarsest discretization 𝑎𝑖 ∈ −𝑘, 0, 𝑘 for each joint leads to an action space with dimensionality: 37 = 2187. → How to solve high-dimensional action spaces problem? Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 3. Introduction • Use Deterministic Policy Gradient (DPG) algorithm! • However, a naive application of this actor-critic method with neural function approximators is unstable for challenging problems. → How? • Combine actor-critic approach with insights from Deep Q Network (DQN) • Use replay buffer to minimize correlations between samples. • Use target Q network to give consistent targets. • Use batch normalization. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 4. Background • The action-value function is used in many reinforcement learning algorithms. It describes the expected return after taking an action 𝑎 𝑡 in state 𝑠𝑡 and thereafter following policy 𝜋: • Action-value function using Bellman-equation • If target policy is deterministic, Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 5. Background • Q-learning, a commonly used off-policy algorithm, uses the greedy policy 𝜇 𝑠 = argmax 𝑎 𝑄(𝑠, 𝑎). We consider function approximators parameterized by 𝜃 𝑄 , which we optimize by minimizing the loss: Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 6. Algorithm • It is not possible to straightforwardly apply Q-learning to continuous action space. • In continuous spaces finding the greedy policy requires an optimization of 𝑎 𝑡 at every timestep. 𝜇 𝑠 = argmax 𝑎 𝑄(𝑠, 𝑎) • This optimization is too slow to be practical with large, unconstrained function approximators and nontrivial action spaces. • Instead, here we used an actor-critic approach based on the DPG algorithm. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 7. Algorithm • The DPG algorithm maintains a parameterized actor function 𝜇 𝑠 𝜃 𝜇 ) which specifies the current policy by deterministically mapping states to a specific action. • Critic network (Value): Using the Bellman equation as in DQN • Actor network (Policy): Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 8. Replay buffer • As with Q learning, introducing non-linear function approximators means that convergence is no longer guaranteed. However, such approximators appear essential in order to learn and generalize on large state spaces. • NFQCA uses the same update rules as DPG but with neural network function approximators, uses batch learning for stability, which is intractable for large networks. A minibatch version of NFQCA which does not reset the policy at each update, as would be required to scale to large networks. • Our contribution here is to provide modifications to DPG, inspired by the success of DQN, which allow it to use neural network function approximators to learn in large state and action spaces online. We refer to our algorithm as Deep DPG (DDPG, Algorithm 1). Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 9. Replay buffer • As in DQN, we used a replay buffer to address these issues. The replay buffer is a finite sized cache ℛ. Transitions were sampled from the environment according to the exploration policy and the tuple (𝑠𝑡, 𝑎 𝑡, 𝑟𝑡, 𝑠𝑡+1) was store in the replay buffer. When the replay buffer was full the oldest samples were discarded. At each timestep the actor and critic are updated by sampling minibatch uniformly from the buffer. Because DDPG is an off-policy algorithm, the replay buffer can be large, allowing the algorithm to benefit from learning across a set of uncorrelated transitions. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 10. Soft target update • Directly implementing Q learning (equation 4) with neural networks proved to be unstable in many environments. Since the network 𝑄(𝑠, 𝑎|𝜃 𝑄 ) being updated is also used in calculating the target value (equation 5), the Q update is prone to divergence. • Our solution is similar to the target network used in (Mnih et al., 2013) but modified for actor-critic and using “soft” target updates, rather than directly copying the weights. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 11. Soft target update • We create a copy of the actor and critic networks, 𝑄′(𝑠, 𝑎|𝜃 𝑄′ ) and 𝜇′(𝑠|𝜃 𝜇′ ) respectively, that are used for calculating the target values. The weights of these target networks are then updated by having them slowly track the learned networks: 𝜽′ ← 𝝉𝜽 + 𝟏 − 𝝉 𝜽′ with 𝝉 ≪ 𝟏. This means that the target values are constrained to change slowly, greatly improving the stability of learning. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 12. Batch normalization • When learning from low dimensional feature vector observations, the different components of the observation may have different physical units (for example, positions versus velocities) and the ranges may vary across environments. This can make it difficult for the network to learn effectively and may make it difficult to find hyper-parameters which generalize across environments with different scales of state values. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 13. Batch normalization • One approach to this problem is to manually scale the features so they are in similar ranges across environments and units. We address this issue by adapting a recent technique from deep learning called batch normalization. This technique normalizes each dimension across the samples in a minibatch to have unit mean and variance. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 14. Noise process • A major challenge of learning in continuous action spaces is exploration. An advantage of off-policies algorithms such as DDPG is that we can treat the problem of exploration independently from the learning algorithm. • We constructed an exploration policy 𝜇′ by adding noise sampled from a noise process 𝒩 to our actor policy 𝜇′ 𝑠𝑡 = 𝜇 𝑠𝑡 𝜃𝑡 𝜇 + 𝒩, where 𝒩 can be chosen to suit the environment. • As detailed in the supplementary materials we used an Ornstein-Uhlenbeck process (Uhlenbeck & Ornstein, 1930) to generate temporally correlated exploration for exploration efficiency in physical control problems with inertia (similar use of autocorrelated noise was introduced in (Wawrzynski, 2015)). Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 15. Noise process • Ornstein–Uhlenbeck process • It is a stochastic process that returns to the mean. 𝜃 means a parameter that indicates how quickly this process returns to the mean and 𝜇 means the mean. 𝜎 means variability of processes and 𝑊𝑡 means Wiener process. • Therefore, it is temporarily correlated with previous noise. • The reason for using the temporally correlated noise process is that it is more effective when learning in an inertial environment such as physical control. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 16. Algorithm • Network structure Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015. 𝑠 Critic Network 2 Critic Network 1 Actor Network 1 𝑄(𝑠, 𝑎) 𝑎 𝑠′
  • 17. Algorithm • Critic network Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015. 𝐿 = 1 𝑁 ෍ 𝑖 𝑦𝑖 − 𝑄 𝑠𝑖, 𝑎𝑖 𝜃 𝑄 2 𝑠 Critic Network 2 Critic Network 1 Actor Network 1 𝑄(𝑠, 𝑎) 𝑎 𝑠′
  • 18. Algorithm • Critic network Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015. 𝜕𝑄 𝑠, 𝑎, 𝑤 𝜕𝑎 𝑠 Critic Network 2 Critic Network 1 Actor Network 1 𝑄(𝑠, 𝑎) 𝑎 𝑠′ 𝜕𝑎 𝜕𝜃
  • 19. Algorithm Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 20. Results Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 21. Results Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 22. Conclusion • The work combines insights from recent advances in deep learning and reinforcement learning, resulting in an algorithm that robustly solves challenging problems across a variety of domains with continuous action spaces, even when using raw pixels for observations. • As with most reinforcement learning algorithms, the use of non- linear function approximators nullifies any convergence guarantees; however, our experimental results demonstrate that stable learning without the need for any modifications between environments. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 23. Conclusion • Interestingly, all of our experiments used substantially fewer steps of experience than was used by DQN learning to find solutions in the Atari domain. • Most notably, as with most model-free reinforcement approaches, DDPG requires a large number of training episodes to find solutions. However, we believe that a robust model-free approach may be an important component of larger systems which may attack these limitations. Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.
  • 24. References • https://nervanasystems.github.io/coach/components/agents/poli cy_optimization/ddpg.html • https://youtu.be/h2WSVBAC1t4 • https://reinforcement-learning-kr.github.io/2018/06/27/2_dpg/ • https://reinforcement-learning-kr.github.io/2018/06/26/3_ddpg/ Continuous Control With Deep Reinforcement Learning, Lillicrap et al, 2015.