SlideShare une entreprise Scribd logo
Discovering Reinforcement Learning
Algorithms
Oh et al. in <NeurIPS 2020>
발표자 : 윤지상
Graduate School of Information. Yonsei Univ.
Machine Learning & Computational Finance Lab.
1. Introduction
2. LPG
3. Details in LPG Architecture
4. Experiments
INDEX
1 Introduction
1. Introduction
RL에서 Learning to learning 이라고 알려져 있는 meta-learning은 특정
value function을 알고 있다면 policy update rule을 스스로 학습하고
unseen task에 적용이 가능하다는 연구들이 나오고 있다.
Scratch부터 RL의 학습을 최적화하는 방향으로
스스로 찾을 수는 없을까?
1. Introduction
This study contributes:
1. Agent의 policy와 semantic prediction vector를 학습하는 방법을 모델이
직접 찾을 수 있고 좋은 성능을 가질 수 있는 feasibility를 보여주었다.
2. Semantic prediction vector에 어떠한 가정도 넣지 않아 사용자의 설정을
더 최소화하고 meta-learning에 가까운 모델이 되었다.
3. 간단한 task들을 통해 만들어진 RL 학습 알고리즘이 복잡한 task에도
유의미한 성능을 보여주었다.
2 LPG
2. LPG
Learned Policy Gradient (LPG)
1. 몇 번의 행동 후 특정 상황에서의 점프
타이밍을 배운다.
2. 여러 번의 의사결정으로 몬스터 속도,
지름길 등 게임 전략들을 배운다.
3. 게임이 끝나고 점수를 더 높이기 위해서
어떻게 하면 게임 전략을 더 많이 배울 수
있을지 고민한다.
4. 다른 게임에서 게임 전략을 더 많이 터득
할 수 있는 노하우를 적용한다.
2. LPG
Learned Policy Gradient (LPG)
2. LPG
Learned Policy Gradient (LPG)
LPG parameterized by 𝜂 (Backward LSTM)
agent parameterized by θ
There are TWO learnable model
최종 목적 : optimized 𝜼 찾기
2. LPG
Learned Policy Gradient (LPG)
① agent가 𝜃의 parameter를 이용해 2개의 값 출력
1. 문제에 대한 action을 뽑을 분포 policy 𝜋𝜃
2. 문제의 action을 선택할 정보를 추정한 prediction 𝑦𝜃
게임 중 행동에 대한
선택과 기준
2. LPG
Learned Policy Gradient (LPG)
② agent가 𝑇 time-step 만큼 action을 취해 trajectory를 형성하고 LPG에서 나온
agent의 학습을 도와줄 정답 target 𝜋, 𝑦에 가깝게 agent의 𝜃 update
많은 행동으로 여
러 전략 터득
2. LPG
Learned Policy Gradient (LPG)
③ 여러 상황 environment 들에 대해 각각의 agent들이 𝑇 time-step마다 학습되고
모든 environment가 끝나면 total reward가 최대가 되도록 LPG의 𝜂 update
점수를 더 높일 게임
노하우 학습
3 Details in LPG
Architecture
3. Details in LPG Architecture
1) LPG Architecture
2) Agent Update (𝜃)
3) LPG Update (𝜂)
4) Balancing Agent Hyperparameters for Stabilisation (𝛼)
𝑝 ℰ 는 environment ℰ의 분포
𝑝 𝜃0 는 agent parameter 𝜃의 초기값 분포
𝐺는 lifetime 전체의 reward 합
𝜂∗ = 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0
[𝐺]
Objective :
1) LPG Architecture
Backward LSTM인 LPG에는,
input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝜑(𝑦𝜃 𝑠𝑡 ), 𝜑(𝑦𝜃 𝑠𝑡+1 )]
output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚
- 𝑟𝑡 : reward
- 𝑑𝑡 : episode 종료 여부 (binary value)
- 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent
- 𝑦𝜃 ∈ 0,1 𝑚
: 𝑚-dimensional categorical prediction vector (𝑚=30 사용)
- 𝜑 : shared neural network (dim 16 → dim 1)
3. Details in LPG Architecture
1) LPG Architecture
Backward LSTM인 LPG에는,
input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝑦𝜃 𝑠𝑡 , 𝑦𝜃 𝑠𝑡+1 ]
output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚
- 𝑟𝑡 : reward
- 𝑑𝑡 : episode 종료 여부 (binary value)
- 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent
- 𝑦𝜃 ∈ 0,1 𝑚
: 𝑚-dimensional categorical prediction vector (𝑚=30 사용)
- 𝜑 : shared neural network (dim 16 → dim 1)
LPG에 input으로 action이 아니라 state
에서 action이 나올 확률을 넣기 때문에
다양한 environment에 적용 가능
3. Details in LPG Architecture
2) Agent Update (𝜽)
𝜋으로 agent가 𝜋을 취하도록 directly 𝜃 update.
𝑦으로 value function처럼 state를 semantic하게 표현하도록 indirectly 𝜃 update
𝑇 time-step만큼 trajectory 형성 후 𝜃 update (𝑇=20 사용)
∆𝜃 ∝ 𝔼𝜋𝜃
[∇𝜃𝑙𝑜𝑔𝜋𝜃 𝑎 𝑠 𝜋 − 𝛼𝑦∇𝜃𝐷𝐾𝐿(𝑦𝜃(𝑠) 𝑦)]
categorical cross
entropy
KL-divergence
3. Details in LPG Architecture
3) LPG Update (𝜼)
𝜃0 → 𝜃𝑁까지 학습이 진행되고 ∆𝜂를 계산해야 하지만 memory 문제 때문에 𝜃𝐾(𝐾 < 𝑁)
만큼 agent 학습 후 ∆𝜂 계산 (𝐾 = 5 사용)
(e.g., 𝑇=20, 𝐾=5 일 때, 20 time-step 마다 𝜃𝑛 → 𝜃𝑛+1 update,
20x5 (=100) time-step이 지나 𝜃𝑛+5까지 update 되면 𝐺를 계산하고 ∆𝜂 계산,
Environment lifetime이 끝날 때까지 반복)
∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁
𝑎 𝑠 𝐺]
𝜂∗
= 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0
[𝐺]
objective
gradient
3. Details in LPG Architecture
3) LPG Update (𝜼)
∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁
𝑎 𝑠 𝐺]
𝜂∗
= 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0
[𝐺]
objective
gradient
∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁
𝑎 𝑠 𝐺 + 𝛽0∇𝜂ℋ 𝜋𝜃𝑁
+ 𝛽1∇𝜂ℋ 𝑦𝜃𝑁
− 𝛽2∇𝜂 𝜋 2
2
− 𝛽3∇𝜂 𝑦 2
2
]
3. Details in LPG Architecture
안정적 학습을 위해
regularized term 추가
4) Balancing Agent Hyperparameters for Stabilisation (𝜶)
한번에 다양한 environment를 학습하게 되는데 모두 동일한 parameter(e.g., learning
rate)를 적용하면 학습이 unstable하기 때문에 동적으로 parameter 설정
𝜂∗
= 𝑎𝑟𝑔max
𝜂
𝔼ℰ~𝑝 ℰ max
𝛼
𝔼𝜃0~𝑝 Θ [𝐺]
3. Details in LPG Architecture
𝛼~𝑝(𝛼|ℰ)
주어진 ℰ environment마다 G를 높이는 파라미터가 뽑힐 확률을
높인다. (𝛼=learning rate, KL-Divergence weight 사용)
3. Details in LPG Architecture
Ablation Study Result
3. Details in LPG Architecture
Lifetimes = N timesteps
Lifetimes = N timesteps
Lifetimes = N timesteps
Interacting environment
Environment Agent
940 64
𝑥1 → ⋯ → 𝑥20 → ⋯ → 𝑥100 → ⋯ → 𝑥𝑁
UPDATE
agent parameter 𝜃
COMPUTE & SAVE
LPG parameter 𝜂
SAMPLE ℰ~𝑝 ℰ , 𝜃~𝑝 𝜃 , 𝛼~𝑝(𝛼|ℰ)
UPDATE
𝑝(𝛼|ℰ)
𝑝(𝛼|ℰ)
UPDATE
LPG parameter 𝜂 using averaged 𝜂
4Experiments
4. Experiments
4. Experiments
Setting
- Baseline
1. A2C
2. LPG-V (only learns 𝜋 given 𝑦 (value function of TD(𝜆))
- Training Environments
1. Tabular grid worlds
2. Random grid worlds
3. Delayed chain MDP
4. Experiments
Specialising in Training Environments
4. Experiments
What does the prediction (y) look like?
4. Experiments
Does the prediction (y) capture true values and beyond?
Does the prediction(y) converge?
4. Experiments
Ablation Study
4. Experiments
Generalising from Toy Environments to Atari Games
Selected results

Contenu connexe

Tendances

Daa unit 2
Daa unit 2Daa unit 2
Daa unit 2
snehajiyani
 
Bubble sort
Bubble sortBubble sort
Bubble sort
Ayush Pandey
 
RCIM 2008 - Modello Scheduling
RCIM 2008 - Modello SchedulingRCIM 2008 - Modello Scheduling
RCIM 2008 - Modello SchedulingMarco Santambrogio
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
Jie-Han Chen
 
Job shop scheduling problem using genetic algorithm
Job shop scheduling problem using genetic algorithmJob shop scheduling problem using genetic algorithm
Job shop scheduling problem using genetic algorithm
Aerial Telecom Solutions (ATS) Pvt. Ltd.
 
Parallel programming
Parallel programmingParallel programming
Parallel programmingAnshul Sharma
 
Algorithm analysis
Algorithm analysisAlgorithm analysis
Algorithm analysis
Akshay Dagar
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
Young-Geun Choi
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball
 
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
Kusano Hitoshi
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
Lyft
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
DongHyun Kwak
 
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
Alexander Decker
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Chris Ohk
 
Data structure introduction
Data structure introductionData structure introduction
Data structure introduction
NavneetSandhu0
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
Sebastian Ruder
 
2020 12-2-detr
2020 12-2-detr2020 12-2-detr
2020 12-2-detr
JAEMINJEONG5
 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1kr0y
 
Data structure and algorithm notes
Data structure and algorithm notesData structure and algorithm notes
Data structure and algorithm notes
suman khadka
 
Deep Learning in Finance
Deep Learning in FinanceDeep Learning in Finance
Deep Learning in Finance
Altoros
 

Tendances (20)

Daa unit 2
Daa unit 2Daa unit 2
Daa unit 2
 
Bubble sort
Bubble sortBubble sort
Bubble sort
 
RCIM 2008 - Modello Scheduling
RCIM 2008 - Modello SchedulingRCIM 2008 - Modello Scheduling
RCIM 2008 - Modello Scheduling
 
Discrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RLDiscrete sequential prediction of continuous actions for deep RL
Discrete sequential prediction of continuous actions for deep RL
 
Job shop scheduling problem using genetic algorithm
Job shop scheduling problem using genetic algorithmJob shop scheduling problem using genetic algorithm
Job shop scheduling problem using genetic algorithm
 
Parallel programming
Parallel programmingParallel programming
Parallel programming
 
Algorithm analysis
Algorithm analysisAlgorithm analysis
Algorithm analysis
 
Chap 8. Optimization for training deep models
Chap 8. Optimization for training deep modelsChap 8. Optimization for training deep models
Chap 8. Optimization for training deep models
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
論文紹介 Combining Model-Based and Model-Free Updates for Trajectory-Centric Rein...
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
1.[1 5]implementation of pre compensation fuzzy for a cascade pid controller ...
 
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
Evolving Reinforcement Learning Algorithms, JD. Co-Reyes et al, 2021
 
Data structure introduction
Data structure introductionData structure introduction
Data structure introduction
 
Optimization for Deep Learning
Optimization for Deep LearningOptimization for Deep Learning
Optimization for Deep Learning
 
2020 12-2-detr
2020 12-2-detr2020 12-2-detr
2020 12-2-detr
 
Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1Writing distributed N-body code using distributed FFT - 1
Writing distributed N-body code using distributed FFT - 1
 
Data structure and algorithm notes
Data structure and algorithm notesData structure and algorithm notes
Data structure and algorithm notes
 
Deep Learning in Finance
Deep Learning in FinanceDeep Learning in Finance
Deep Learning in Finance
 

Similaire à PPT - Discovering Reinforcement Learning Algorithms

DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
Wangyu Han
 
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
Jisang Yoon
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
DongHyun Kwak
 
2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf
ishan743441
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
Willy Marroquin (WillyDevNET)
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
Seungeon Baek
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn
철민 권
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
The Statistical and Applied Mathematical Sciences Institute
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
Preferred Networks
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Universitat Politècnica de Catalunya
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
Emanuele Ghelfi
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAI
Anand Joshi
 
Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.
Wuhyun Rico Shin
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
NAVER Engineering
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
AminaRepo
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
Prabhu Kumar
 
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
IRJET Journal
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
Hye-min Ahn
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
MeetupDataScienceRoma
 
Design of predictive controller for smooth set point tracking for fast dynami...
Design of predictive controller for smooth set point tracking for fast dynami...Design of predictive controller for smooth set point tracking for fast dynami...
Design of predictive controller for smooth set point tracking for fast dynami...
eSAT Journals
 

Similaire à PPT - Discovering Reinforcement Learning Algorithms (20)

DDPG algortihm for angry birds
DDPG algortihm for angry birdsDDPG algortihm for angry birds
DDPG algortihm for angry birds
 
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
PPT - Adaptive Quantitative Trading : An Imitative Deep Reinforcement Learnin...
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf2-Algorithms and Complexit data structurey.pdf
2-Algorithms and Complexit data structurey.pdf
 
Playing Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement LearningPlaying Atari with Deep Reinforcement Learning
Playing Atari with Deep Reinforcement Learning
 
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptxvariBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
variBAD, A Very Good Method for Bayes-Adaptive Deep RL via Meta-Learning.pptx
 
Understanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-LearnUnderstanding GBM and XGBoost in Scikit-Learn
Understanding GBM and XGBoost in Scikit-Learn
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Imitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCSImitation Learning for Autonomous Driving in TORCS
Imitation Learning for Autonomous Driving in TORCS
 
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
Training Deep Networks with Backprop (D1L4 Insight@DCU Machine Learning Works...
 
Learning To Run
Learning To RunLearning To Run
Learning To Run
 
Dexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAIDexterous In-hand Manipulation by OpenAI
Dexterous In-hand Manipulation by OpenAI
 
Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.Paper review: Learned Optimizers that Scale and Generalize.
Paper review: Learned Optimizers that Scale and Generalize.
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Aaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement LearningAaa ped-24- Reinforcement Learning
Aaa ped-24- Reinforcement Learning
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
Comparative Analysis of Tuning Hyperparameters in Policy-Based DRL Algorithm ...
 
0415_seminar_DeepDPG
0415_seminar_DeepDPG0415_seminar_DeepDPG
0415_seminar_DeepDPG
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Design of predictive controller for smooth set point tracking for fast dynami...
Design of predictive controller for smooth set point tracking for fast dynami...Design of predictive controller for smooth set point tracking for fast dynami...
Design of predictive controller for smooth set point tracking for fast dynami...
 

Dernier

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
roli9797
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Enterprise Wired
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
vikram sood
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
slg6lamcq
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
dwreak4tg
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
jerlynmaetalle
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
GetInData
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
u86oixdj
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
AnirbanRoy608946
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
Roger Valdez
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
eddie19851
 

Dernier (20)

Analysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performanceAnalysis insight about a Flyball dog competition team's performance
Analysis insight about a Flyball dog competition team's performance
 
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdfUnleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
Unleashing the Power of Data_ Choosing a Trusted Analytics Platform.pdf
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
Global Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headedGlobal Situational Awareness of A.I. and where its headed
Global Situational Awareness of A.I. and where its headed
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
一比一原版(Adelaide毕业证书)阿德莱德大学毕业证如何办理
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
一比一原版(BCU毕业证书)伯明翰城市大学毕业证如何办理
 
Influence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business PlanInfluence of Marketing Strategy and Market Competition on Business Plan
Influence of Marketing Strategy and Market Competition on Business Plan
 
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdfEnhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
Enhanced Enterprise Intelligence with your personal AI Data Copilot.pdf
 
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
原版制作(Deakin毕业证书)迪肯大学毕业证学位证一模一样
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptxData_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
Data_and_Analytics_Essentials_Architect_an_Analytics_Platform.pptx
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
Everything you wanted to know about LIHTC
Everything you wanted to know about LIHTCEverything you wanted to know about LIHTC
Everything you wanted to know about LIHTC
 
Nanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdfNanandann Nilekani's ppt On India's .pdf
Nanandann Nilekani's ppt On India's .pdf
 

PPT - Discovering Reinforcement Learning Algorithms

  • 1. Discovering Reinforcement Learning Algorithms Oh et al. in <NeurIPS 2020> 발표자 : 윤지상 Graduate School of Information. Yonsei Univ. Machine Learning & Computational Finance Lab.
  • 2. 1. Introduction 2. LPG 3. Details in LPG Architecture 4. Experiments INDEX
  • 4. 1. Introduction RL에서 Learning to learning 이라고 알려져 있는 meta-learning은 특정 value function을 알고 있다면 policy update rule을 스스로 학습하고 unseen task에 적용이 가능하다는 연구들이 나오고 있다. Scratch부터 RL의 학습을 최적화하는 방향으로 스스로 찾을 수는 없을까?
  • 5. 1. Introduction This study contributes: 1. Agent의 policy와 semantic prediction vector를 학습하는 방법을 모델이 직접 찾을 수 있고 좋은 성능을 가질 수 있는 feasibility를 보여주었다. 2. Semantic prediction vector에 어떠한 가정도 넣지 않아 사용자의 설정을 더 최소화하고 meta-learning에 가까운 모델이 되었다. 3. 간단한 task들을 통해 만들어진 RL 학습 알고리즘이 복잡한 task에도 유의미한 성능을 보여주었다.
  • 7. 2. LPG Learned Policy Gradient (LPG) 1. 몇 번의 행동 후 특정 상황에서의 점프 타이밍을 배운다. 2. 여러 번의 의사결정으로 몬스터 속도, 지름길 등 게임 전략들을 배운다. 3. 게임이 끝나고 점수를 더 높이기 위해서 어떻게 하면 게임 전략을 더 많이 배울 수 있을지 고민한다. 4. 다른 게임에서 게임 전략을 더 많이 터득 할 수 있는 노하우를 적용한다.
  • 8. 2. LPG Learned Policy Gradient (LPG)
  • 9. 2. LPG Learned Policy Gradient (LPG) LPG parameterized by 𝜂 (Backward LSTM) agent parameterized by θ There are TWO learnable model 최종 목적 : optimized 𝜼 찾기
  • 10. 2. LPG Learned Policy Gradient (LPG) ① agent가 𝜃의 parameter를 이용해 2개의 값 출력 1. 문제에 대한 action을 뽑을 분포 policy 𝜋𝜃 2. 문제의 action을 선택할 정보를 추정한 prediction 𝑦𝜃 게임 중 행동에 대한 선택과 기준
  • 11. 2. LPG Learned Policy Gradient (LPG) ② agent가 𝑇 time-step 만큼 action을 취해 trajectory를 형성하고 LPG에서 나온 agent의 학습을 도와줄 정답 target 𝜋, 𝑦에 가깝게 agent의 𝜃 update 많은 행동으로 여 러 전략 터득
  • 12. 2. LPG Learned Policy Gradient (LPG) ③ 여러 상황 environment 들에 대해 각각의 agent들이 𝑇 time-step마다 학습되고 모든 environment가 끝나면 total reward가 최대가 되도록 LPG의 𝜂 update 점수를 더 높일 게임 노하우 학습
  • 13. 3 Details in LPG Architecture
  • 14. 3. Details in LPG Architecture 1) LPG Architecture 2) Agent Update (𝜃) 3) LPG Update (𝜂) 4) Balancing Agent Hyperparameters for Stabilisation (𝛼) 𝑝 ℰ 는 environment ℰ의 분포 𝑝 𝜃0 는 agent parameter 𝜃의 초기값 분포 𝐺는 lifetime 전체의 reward 합 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0 [𝐺] Objective :
  • 15. 1) LPG Architecture Backward LSTM인 LPG에는, input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝜑(𝑦𝜃 𝑠𝑡 ), 𝜑(𝑦𝜃 𝑠𝑡+1 )] output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚 - 𝑟𝑡 : reward - 𝑑𝑡 : episode 종료 여부 (binary value) - 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent - 𝑦𝜃 ∈ 0,1 𝑚 : 𝑚-dimensional categorical prediction vector (𝑚=30 사용) - 𝜑 : shared neural network (dim 16 → dim 1) 3. Details in LPG Architecture
  • 16. 1) LPG Architecture Backward LSTM인 LPG에는, input으로 𝑥𝑡 = [𝑟𝑡, 𝑑𝑡, 𝜋𝜃 𝑎𝑡 𝑠𝑡 , 𝑦𝜃 𝑠𝑡 , 𝑦𝜃 𝑠𝑡+1 ] output으로 𝜋 ∈ ℝ, 𝑦 ∈ 0,1 𝑚 - 𝑟𝑡 : reward - 𝑑𝑡 : episode 종료 여부 (binary value) - 𝜋𝜃 𝑎𝑡 𝑠𝑡 : policy from agent - 𝑦𝜃 ∈ 0,1 𝑚 : 𝑚-dimensional categorical prediction vector (𝑚=30 사용) - 𝜑 : shared neural network (dim 16 → dim 1) LPG에 input으로 action이 아니라 state 에서 action이 나올 확률을 넣기 때문에 다양한 environment에 적용 가능 3. Details in LPG Architecture
  • 17. 2) Agent Update (𝜽) 𝜋으로 agent가 𝜋을 취하도록 directly 𝜃 update. 𝑦으로 value function처럼 state를 semantic하게 표현하도록 indirectly 𝜃 update 𝑇 time-step만큼 trajectory 형성 후 𝜃 update (𝑇=20 사용) ∆𝜃 ∝ 𝔼𝜋𝜃 [∇𝜃𝑙𝑜𝑔𝜋𝜃 𝑎 𝑠 𝜋 − 𝛼𝑦∇𝜃𝐷𝐾𝐿(𝑦𝜃(𝑠) 𝑦)] categorical cross entropy KL-divergence 3. Details in LPG Architecture
  • 18. 3) LPG Update (𝜼) 𝜃0 → 𝜃𝑁까지 학습이 진행되고 ∆𝜂를 계산해야 하지만 memory 문제 때문에 𝜃𝐾(𝐾 < 𝑁) 만큼 agent 학습 후 ∆𝜂 계산 (𝐾 = 5 사용) (e.g., 𝑇=20, 𝐾=5 일 때, 20 time-step 마다 𝜃𝑛 → 𝜃𝑛+1 update, 20x5 (=100) time-step이 지나 𝜃𝑛+5까지 update 되면 𝐺를 계산하고 ∆𝜂 계산, Environment lifetime이 끝날 때까지 반복) ∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁 𝑎 𝑠 𝐺] 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0 [𝐺] objective gradient 3. Details in LPG Architecture
  • 19. 3) LPG Update (𝜼) ∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁 𝑎 𝑠 𝐺] 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ 𝔼𝜃0~𝑝 𝜃0 [𝐺] objective gradient ∆𝜂 ∝ 𝔼ℰ𝔼𝜃[∇𝜂𝑙𝑜𝑔𝜋𝜃𝑁 𝑎 𝑠 𝐺 + 𝛽0∇𝜂ℋ 𝜋𝜃𝑁 + 𝛽1∇𝜂ℋ 𝑦𝜃𝑁 − 𝛽2∇𝜂 𝜋 2 2 − 𝛽3∇𝜂 𝑦 2 2 ] 3. Details in LPG Architecture 안정적 학습을 위해 regularized term 추가
  • 20. 4) Balancing Agent Hyperparameters for Stabilisation (𝜶) 한번에 다양한 environment를 학습하게 되는데 모두 동일한 parameter(e.g., learning rate)를 적용하면 학습이 unstable하기 때문에 동적으로 parameter 설정 𝜂∗ = 𝑎𝑟𝑔max 𝜂 𝔼ℰ~𝑝 ℰ max 𝛼 𝔼𝜃0~𝑝 Θ [𝐺] 3. Details in LPG Architecture 𝛼~𝑝(𝛼|ℰ) 주어진 ℰ environment마다 G를 높이는 파라미터가 뽑힐 확률을 높인다. (𝛼=learning rate, KL-Divergence weight 사용)
  • 21. 3. Details in LPG Architecture Ablation Study Result
  • 22. 3. Details in LPG Architecture Lifetimes = N timesteps Lifetimes = N timesteps Lifetimes = N timesteps Interacting environment Environment Agent 940 64 𝑥1 → ⋯ → 𝑥20 → ⋯ → 𝑥100 → ⋯ → 𝑥𝑁 UPDATE agent parameter 𝜃 COMPUTE & SAVE LPG parameter 𝜂 SAMPLE ℰ~𝑝 ℰ , 𝜃~𝑝 𝜃 , 𝛼~𝑝(𝛼|ℰ) UPDATE 𝑝(𝛼|ℰ) 𝑝(𝛼|ℰ) UPDATE LPG parameter 𝜂 using averaged 𝜂
  • 25. 4. Experiments Setting - Baseline 1. A2C 2. LPG-V (only learns 𝜋 given 𝑦 (value function of TD(𝜆)) - Training Environments 1. Tabular grid worlds 2. Random grid worlds 3. Delayed chain MDP
  • 26. 4. Experiments Specialising in Training Environments
  • 27. 4. Experiments What does the prediction (y) look like?
  • 28. 4. Experiments Does the prediction (y) capture true values and beyond? Does the prediction(y) converge?
  • 30. 4. Experiments Generalising from Toy Environments to Atari Games Selected results