SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
1
off-policy
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
Guided Meta-Policy Search
Presenter:Tatsuya Matsushima @__tmats__ , Matsuo Lab
• off-policy arXiv
• Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables
[Rakelly+ 2019] (2019/3/19)
• Guided Meta-Policy Search [Mendonca+ 2019] (2019/4/1)
• MAML meta-training
off-policy
2
3
 (meta learning)
• : Wiki http://ibisforest.org/index.php?%E3%83%A1%E3%82%BF%E5%AD%A6%E7%BF%92
• [DL ]Meta-Learning Probabilistic Inference for Prediction ( )
• https://www.slideshare.net/DeepLearningJP2016/dlmetalearning-probabilistic-inference-for-
prediction-126167192
4
MAML
MAML (Model Agnostic Meta-Learning) [Finn+ 2017]
•
• adapt
• MAML
• 2
• 1 [Nichol+2018]
• meta-test 

5
min
θ ∑
𝒯
ℒ (θ − α∇θℒ (θ, 𝒟tr
𝒯), 𝒟val
𝒯 ) = min
θ ∑
𝒯
ℒ (ϕ 𝒯, 𝒟val
𝒯 )
θ ϕ 𝒯
ϕ 𝒯test
= θ − α∇θℒ (θ, 𝒟tr
𝒯test)
MAML
• loss loss( )
• MAML model-based [Nagabandi+ 2018] [Gupta+ 2018]
• [DL ]Meta Reinforcement Learning ( )
• https://www.slideshare.net/DeepLearningJP2016/dl-130067084
6
ℒRL (ϕ, 𝒟 𝒯i) = −
1
𝒟 𝒯i
∑
st,at∈𝒟
ri (st, at)
= − 𝔼st,at∼πϕ,q 𝒯i [
1
H
H
∑
t=1
ri (st, at)
]
( ) On-policy v.s. Off-policy
On-policy ( )
• ( )
•
• ) ε-greedy
Off-policy ( )
•
•
※ MAML train
test (= off-policy )
7
Efficient Off-Policy Meta-Reinforcement

Learning via Probabilistic Context Variables
8
Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context
Variables
• https://arxiv.org/abs/1903.08254 (Submitted on 19 Mar 2019)
• Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, Sergey Levine
• UC Berkeley (BAIR)
• Deep RL ”UC Berkeley”
•
• https://github.com/katerakelly/oyster
• (BAIR )PyTorch rlkit
9
TL; DR
• meta learning off-policy (PEARL)
• (context)
• permutation invariant
• 20-100
10
( MAML )
• meta-training adaptation on-policy
• MAML meta-train meta-test off-policy
• adapt
•
11
12
• off-policy RL (soft actor-critic, SAC [Haarnoja+ 2018]) 

context (PEARL)
• Meta-training adapt
• meta-train policy
context
• meta-test context policy
adapt
• policy off-policy meta-train meta-
test on-policy
13
MDP
•
•
•
• :
• :
• 1
•
•
14
p(𝒯)
𝒯 𝒯 = {p (s0), p (st+1 |st, at), r (st, at)}
𝒯 c 𝒯
n = (sn, an, rn, s′n)
c = c 𝒯
1:N
p(𝒯)
context
• adapt
•
• (Inference network)
•
• prior Gaussian
• meta-train meta-test
15
z
z
qϕ(z|c)
𝔼 𝒯
[
𝔼z∼qϕ(z|c 𝒯
) [
R(𝒯, z) + βDKL (qϕ (z|c 𝒯
) ∥p(z))]]
p(z)
qϕ(z|c) ϕ zz
context
• MDP
• permutation invariant
• Inference network
• Gaussian
16
{si, ai, s′i, ri}
qϕ (z|c1:N) ∝ ΠN
n=1Ψϕ (z|cn)
Ψϕ (z|cn) = 𝒩 (fμ
ϕ (cn), fσ
ϕ (cn))
off-policy
• policy 

• actor ciritic 

• 

• on-policy 

on-policy test
17
qϕ(z|c)
ℬ
𝒮c
off-policy
• Soft Actor-Critic (SAC) [Haarnoja+ 2018] context
• SAC maxEntRL( ) off-policy actor-critic
• actor critic reparameterization trick
• critic loss: 

• actor loss:
18
ℒcritic = 𝔼(s, a, r, s′
) ∼ ℬ
z ∼ qϕ(z|c)
[Qθ(s, a, z) − (r + V (s′, z))]
2
z
ℒactor = 𝔼s∼ℬ,a∼πθ
DKL
(
πθ(a|s, z)∥
exp (Qθ(s, a, z))
𝒵θ(s) )
19
• MuJoCo 6
• Half-Cheetah, Humanoid, Ant, Walker (Half-Cheetah Ant 2 )
•
• adapt
• 20-100 

• : meta-training
• :
20
• on-policy (MAESN[Gupta+ 2018])
• sparse navigation
• meta-test 

• 

• context
• MAESN
21
Ablation Study
•
• Half-Cheetah-Vel
• RNN
• RNN-tran: de-correlated
• RNN-traj:
• permutation invariant 

22
Ablation Study
•
• Half-Cheetah-Vel
•
• off-policy: off-policy( )
• off-policy RL-batch: policy
• 

(PEARL)
23
Ablation Study
• context
• sparse navigation
• context
• 

24
25
• off-policy (PEARL)
• context policy context
off-policy
• meta-training
26
Guided Meta-Policy Search
27
Guided Meta-Policy Search
• https://arxiv.org/abs/1904.00956 (Submitted on 1 Apr 2019)
• Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine, Chelsea
Finn
• UC Berkeley (BAIR)
• …
•
• https://github.com/RussellM2020/GMPS
• Website
• https://sites.google.com/berkeley.edu/guided-metapolicy-search
28
TL; DR
• meta learning off-policy (GMPS)
• meta-train RL
• meta-train meta-objective( ) imitation learning (behaviour cloning)
• meta-training task learning meta-learning 2
29
( MAML )
• meta-training adaptation on-policy
• [Rakelly+ 2019]
• meta-training meta-test
30
31
• meta-train meta-objective( ) (behaviour cloning)
• meta-training 2
• task learning: meta-training policy
• policy meta-test expert
• meta-learning: policy meta-level supervised
32
[Rakelly+ 2019]
•
•
•
33
p(𝒯)
𝒯 𝒯 = {p (s0), p (st+1 |st, at), r (st, at)}
p(𝒯)
task learning
• meta-training 

/ policy
•
meta-learning
• MAML
• adapt
• MAML
• (behaviour cloning)
34
𝒯i
{π*i
}
ℒRL (ϕi, 𝒟i)
ϕi 𝒯i
ℒBC (ϕi, 𝒟i) ≜ −
∑
(st,at)∈𝒟
log πϕ (at |st)
meta-learning
• meta-training 



• policy 

meta-objective
• 



behaviour cloning compounding error 

35
𝒯i
π*i
D*i
min
θ ∑
𝒯i
∑
𝒟val
i ∼𝒟*i
𝔼 𝒟tr
i ∼πθ [
ℒBC (θ − α∇θℒRL (θ, 𝒟tr
i ), 𝒟val
i )]
θ
𝒯i
ϕi
D*i
• meta-learning task learning meta-learning
• policy
•
• meta-training
• ) reward shaping
• MAML
36
policy
• policy 

contextual policy
• ( ID )
• meta-training
• meta-test meta-training
• soft actor-critic(SAC) [Haarnoja+ 2018]
37
πθ (at |st, ω)
ω
• Behaviour cloning meta-objective 

• 

• 



•
• Behaviour cloning
38
θ
ϕi
πθ
ϕi = θ + α𝔼τ∼πθ
[
πθ(τ)
πθinit
(τ)
∇θlog πθ(τ)Ai(τ)
]
Ai
θ ← θ − β∇θℒBC (ϕi, 𝒟val
i )
39
•
• Pushing (full state)
•
•
• Pushing (vision)
•
• Door opening
•
•
• (Ant)
•
https://sites.google.com/berkeley.edu/guided-metapolicy-search 40
•
• meta-training task context( )
• SAC
• : meta-training :
41
•
• Door Opening Ant
•
• pushing
•
42
43
• off-policy (GMPS)
• meta-training task learning meta-learning 2
(behaviour cloning)
• meta-training
44
45
• 2
• one-step update adapt (BAIR )
• ) MAML[Finn+ 2017]
• adapt (DeepMind )
• ) Neural Processes[Garnelo+ 2018], GQN[Eslami+ 2018]
•
•
• [DL ]Meta-Learning Probabilistic Inference for Prediction
• https://www.slideshare.net/DeepLearningJP2016/dlmetalearning-probabilistic-inference-for-
prediction-126167192
• pro-con
46
Appendix
47
References
[Eslami+ 2018] Eslami, S. M. Ali, Danilo Jimenez Rezende, Frédéric Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman,
Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil C.
Rabinowitz, Helen King, Chloe Hillier, Matthew M Botvinick, Daan Wierstra, Koray Kavukcuoglu and Demis Hassabis. “Neural scene
representation and rendering.” Science 360 (2018): 1204-1210. http://science.sciencemag.org/content/360/6394/1204
{Finn+ 2017] Chelsea Finn, Pieter Abbeel and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,”
Proceedings of the 34th International Conference on Machine Learning, PMLR 70:1126-1135, 2017. http://proceedings.mlr.press/v70/
finn17a.html
[Garnelo+ 2018] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola and Danilo J. Rezende, S.M. Ali Eslami and Yee Whye
Teh. “Neural Processes”. https://arxiv.org/abs/1807.01622.
[Gupta+ 2018] Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel and Sergey Levine. ”Meta-Reinforcement Learning of
Structured Exploration Strategies”. In Advances in Neural Information Processing Systems, 2018. https://nips.cc/Conferences/2018/
Schedule?showEvent=12658
[Haarnoja+ 2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel and Sergey Levine. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep
Reinforcement Learning with a Stochastic Actor”. Proceedings of the 35th International Conference on Machine Learning, PMLR
80:1861-1870, 2018. http://proceedings.mlr.press/v80/haarnoja18b.html
[Mendonca+ 2019] Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine and Chelsea Finn. “Guided Meta-
Policy Search”. https://arxiv.org/abs/1904.00956
[Nagabandi+ 2018] Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine and Chelsea Finn.
“Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning”. https://arxiv.org/abs/1803.11347
[Nichol+2018] Alex Nichol, Joshua Achiam and John Schulman. “On First-Order Meta-Learning Algorithms”. https://arxiv.org/abs/1803.02999
[Rakelly+ 2019] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn ands Sergey Levine. “Efficient Off-Policy Meta-Reinforcement
Learning via Probabilistic Context Variables”. https://arxiv.org/abs/1903.08254
48

Contenu connexe

Tendances

[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...Deep Learning JP
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual FeaturesARISE analytics
 
機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化gree_tech
 
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)RyuichiKanoh
 
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation LearningDeep Learning JP
 
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜SSII
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learningharmonylab
 
深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」
深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」
深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」Tatsuya Matsushima
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究についてMasahiro Suzuki
 
強化学習 DQNからPPOまで
強化学習 DQNからPPOまで強化学習 DQNからPPOまで
強化学習 DQNからPPOまでharmonylab
 
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)Shota Imai
 
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)Morpho, Inc.
 
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-Deep Learning JP
 
【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?Masanao Ochi
 
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会Shunichi Sekiguchi
 
[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling
[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling
[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence ModelingDeep Learning JP
 
[DL輪読会]大規模分散強化学習の難しい問題設定への適用
[DL輪読会]大規模分散強化学習の難しい問題設定への適用[DL輪読会]大規模分散強化学習の難しい問題設定への適用
[DL輪読会]大規模分散強化学習の難しい問題設定への適用Deep Learning JP
 
機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)Kota Matsui
 
OSS強化学習フレームワークの比較
OSS強化学習フレームワークの比較OSS強化学習フレームワークの比較
OSS強化学習フレームワークの比較gree_tech
 
最適化超入門
最適化超入門最適化超入門
最適化超入門Takami Sato
 

Tendances (20)

[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
[DL輪読会]近年のオフライン強化学習のまとめ —Offline Reinforcement Learning: Tutorial, Review, an...
 
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
【論文読み会】Deep Clustering for Unsupervised Learning of Visual Features
 
機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化機械学習モデルのハイパパラメータ最適化
機械学習モデルのハイパパラメータ最適化
 
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
勾配ブースティングの基礎と最新の動向 (MIRU2020 Tutorial)
 
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
 
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜
SSII2021 [TS2] 深層強化学習 〜 強化学習の基礎から応用まで 〜
 
MixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised LearningMixMatch: A Holistic Approach to Semi- Supervised Learning
MixMatch: A Holistic Approach to Semi- Supervised Learning
 
深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」
深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」
深層強化学習入門 2020年度Deep Learning基礎講座「強化学習」
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
強化学習 DQNからPPOまで
強化学習 DQNからPPOまで強化学習 DQNからPPOまで
強化学習 DQNからPPOまで
 
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
 
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
(文献紹介)Deep Unrolling: Learned ISTA (LISTA)
 
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
[DL輪読会]`強化学習のための状態表現学習 -より良い「世界モデル」の獲得に向けて-
 
【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?【論文紹介】How Powerful are Graph Neural Networks?
【論文紹介】How Powerful are Graph Neural Networks?
 
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
 
[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling
[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling
[DL輪読会]Decision Transformer: Reinforcement Learning via Sequence Modeling
 
[DL輪読会]大規模分散強化学習の難しい問題設定への適用
[DL輪読会]大規模分散強化学習の難しい問題設定への適用[DL輪読会]大規模分散強化学習の難しい問題設定への適用
[DL輪読会]大規模分散強化学習の難しい問題設定への適用
 
機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)
 
OSS強化学習フレームワークの比較
OSS強化学習フレームワークの比較OSS強化学習フレームワークの比較
OSS強化学習フレームワークの比較
 
最適化超入門
最適化超入門最適化超入門
最適化超入門
 

Similaire à [DL輪読会] off-policyなメタ強化学習

Dominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender SystemsDominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender SystemsDominik Kowald
 
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
 [DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se... [DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...Deep Learning JP
 
SEM on MIDUS Dataset
SEM on MIDUS DatasetSEM on MIDUS Dataset
SEM on MIDUS DatasetKan Yuenyong
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureNecip Oguz Serbetci
 
Feature Model-Guided Online Reinforcement Learning for Self-Adaptive Services
Feature Model-Guided Online Reinforcement Learning for Self-Adaptive ServicesFeature Model-Guided Online Reinforcement Learning for Self-Adaptive Services
Feature Model-Guided Online Reinforcement Learning for Self-Adaptive ServicesAndreas Metzger
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIJack Clark
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsGiulio Carducci
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning SyllabusAndres Mendez-Vazquez
 
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesIntuit Inc.
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...MOVING Project
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptxShree Shree
 
Modelling User Interaction utilising Information Foraging Theory (and a bit o...
Modelling User Interaction utilising Information Foraging Theory (and a bit o...Modelling User Interaction utilising Information Foraging Theory (and a bit o...
Modelling User Interaction utilising Information Foraging Theory (and a bit o...Ingo Frommholz
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Wat is Learning Analytics en hoe kan het in het (hoger) onderwijs worden inge...
Wat is Learning Analytics en hoe kan het in het (hoger) onderwijs worden inge...Wat is Learning Analytics en hoe kan het in het (hoger) onderwijs worden inge...
Wat is Learning Analytics en hoe kan het in het (hoger) onderwijs worden inge...Marcel Schmitz
 

Similaire à [DL輪読会] off-policyなメタ強化学習 (20)

Dominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender SystemsDominik Kowald PhD Defense Recommender Systems
Dominik Kowald PhD Defense Recommender Systems
 
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
 [DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se... [DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
[DL輪読会]"CyCADA: Cycle-Consistent Adversarial Domain Adaptation"&"Learning Se...
 
SEM on MIDUS Dataset
SEM on MIDUS DatasetSEM on MIDUS Dataset
SEM on MIDUS Dataset
 
Hierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic ArchitectureHierarchical Reinforcement Learning with Option-Critic Architecture
Hierarchical Reinforcement Learning with Option-Critic Architecture
 
Feature Model-Guided Online Reinforcement Learning for Self-Adaptive Services
Feature Model-Guided Online Reinforcement Learning for Self-Adaptive ServicesFeature Model-Guided Online Reinforcement Learning for Self-Adaptive Services
Feature Model-Guided Online Reinforcement Learning for Self-Adaptive Services
 
Deep Meta Learning
Deep Meta Learning Deep Meta Learning
Deep Meta Learning
 
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Semantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media PostsSemantic Analysis to Compute Personality Traits from Social Media Posts
Semantic Analysis to Compute Personality Traits from Social Media Posts
 
From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)From data lakes to actionable data (adventures in data curation)
From data lakes to actionable data (adventures in data curation)
 
Introduction Machine Learning Syllabus
Introduction Machine Learning SyllabusIntroduction Machine Learning Syllabus
Introduction Machine Learning Syllabus
 
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using EnsemblesStrata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
Strata 2013: Tutorial-- How to Create Predictive Models in R using Ensembles
 
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
Keeping Linked Open Data Caches Up-to-date by Predicting the Life-time of RDF...
 
Course outline
Course outlineCourse outline
Course outline
 
Licentiate Defense Slide
Licentiate Defense SlideLicentiate Defense Slide
Licentiate Defense Slide
 
04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx04-Data-Analysis-Overview.pptx
04-Data-Analysis-Overview.pptx
 
Modelling User Interaction utilising Information Foraging Theory (and a bit o...
Modelling User Interaction utilising Information Foraging Theory (and a bit o...Modelling User Interaction utilising Information Foraging Theory (and a bit o...
Modelling User Interaction utilising Information Foraging Theory (and a bit o...
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Wat is Learning Analytics en hoe kan het in het (hoger) onderwijs worden inge...
Wat is Learning Analytics en hoe kan het in het (hoger) onderwijs worden inge...Wat is Learning Analytics en hoe kan het in het (hoger) onderwijs worden inge...
Wat is Learning Analytics en hoe kan het in het (hoger) onderwijs worden inge...
 
2015 03-28-eb-final
2015 03-28-eb-final2015 03-28-eb-final
2015 03-28-eb-final
 

Plus de Deep Learning JP

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving PlannersDeep Learning JP
 
【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについてDeep Learning JP
 
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...Deep Learning JP
 
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-ResolutionDeep Learning JP
 
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxivDeep Learning JP
 
【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLMDeep Learning JP
 
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo... 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...Deep Learning JP
 
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place RecognitionDeep Learning JP
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?Deep Learning JP
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究についてDeep Learning JP
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )Deep Learning JP
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...Deep Learning JP
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"Deep Learning JP
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "Deep Learning JP
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Learning JP
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"Deep Learning JP
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...Deep Learning JP
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...Deep Learning JP
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...Deep Learning JP
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...Deep Learning JP
 

Plus de Deep Learning JP (20)

【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
 
【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて【DL輪読会】事前学習用データセットについて
【DL輪読会】事前学習用データセットについて
 
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
【DL輪読会】 "Learning to render novel views from wide-baseline stereo pairs." CVP...
 
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
【DL輪読会】Zero-Shot Dual-Lens Super-Resolution
 
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
【DL輪読会】BloombergGPT: A Large Language Model for Finance arxiv
 
【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM【DL輪読会】マルチモーダル LLM
【DL輪読会】マルチモーダル LLM
 
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo... 【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
【 DL輪読会】ToolLLM: Facilitating Large Language Models to Master 16000+ Real-wo...
 
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
【DL輪読会】AnyLoc: Towards Universal Visual Place Recognition
 
【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
 

Dernier

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 

Dernier (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 

[DL輪読会] off-policyなメタ強化学習

  • 1. 1 off-policy Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables Guided Meta-Policy Search Presenter:Tatsuya Matsushima @__tmats__ , Matsuo Lab
  • 2. • off-policy arXiv • Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables [Rakelly+ 2019] (2019/3/19) • Guided Meta-Policy Search [Mendonca+ 2019] (2019/4/1) • MAML meta-training off-policy 2
  • 3. 3
  • 4.  (meta learning) • : Wiki http://ibisforest.org/index.php?%E3%83%A1%E3%82%BF%E5%AD%A6%E7%BF%92 • [DL ]Meta-Learning Probabilistic Inference for Prediction ( ) • https://www.slideshare.net/DeepLearningJP2016/dlmetalearning-probabilistic-inference-for- prediction-126167192 4
  • 5. MAML MAML (Model Agnostic Meta-Learning) [Finn+ 2017] • • adapt • MAML • 2 • 1 [Nichol+2018] • meta-test 
 5 min θ ∑ 𝒯 ℒ (θ − α∇θℒ (θ, 𝒟tr 𝒯), 𝒟val 𝒯 ) = min θ ∑ 𝒯 ℒ (ϕ 𝒯, 𝒟val 𝒯 ) θ ϕ 𝒯 ϕ 𝒯test = θ − α∇θℒ (θ, 𝒟tr 𝒯test)
  • 6. MAML • loss loss( ) • MAML model-based [Nagabandi+ 2018] [Gupta+ 2018] • [DL ]Meta Reinforcement Learning ( ) • https://www.slideshare.net/DeepLearningJP2016/dl-130067084 6 ℒRL (ϕ, 𝒟 𝒯i) = − 1 𝒟 𝒯i ∑ st,at∈𝒟 ri (st, at) = − 𝔼st,at∼πϕ,q 𝒯i [ 1 H H ∑ t=1 ri (st, at) ]
  • 7. ( ) On-policy v.s. Off-policy On-policy ( ) • ( ) • • ) ε-greedy Off-policy ( ) • • ※ MAML train test (= off-policy ) 7
  • 8. Efficient Off-Policy Meta-Reinforcement
 Learning via Probabilistic Context Variables 8
  • 9. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables • https://arxiv.org/abs/1903.08254 (Submitted on 19 Mar 2019) • Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, Sergey Levine • UC Berkeley (BAIR) • Deep RL ”UC Berkeley” • • https://github.com/katerakelly/oyster • (BAIR )PyTorch rlkit 9
  • 10. TL; DR • meta learning off-policy (PEARL) • (context) • permutation invariant • 20-100 10
  • 11. ( MAML ) • meta-training adaptation on-policy • MAML meta-train meta-test off-policy • adapt • 11
  • 12. 12
  • 13. • off-policy RL (soft actor-critic, SAC [Haarnoja+ 2018]) 
 context (PEARL) • Meta-training adapt • meta-train policy context • meta-test context policy adapt • policy off-policy meta-train meta- test on-policy 13
  • 14. MDP • • • • : • : • 1 • • 14 p(𝒯) 𝒯 𝒯 = {p (s0), p (st+1 |st, at), r (st, at)} 𝒯 c 𝒯 n = (sn, an, rn, s′n) c = c 𝒯 1:N p(𝒯)
  • 15. context • adapt • • (Inference network) • • prior Gaussian • meta-train meta-test 15 z z qϕ(z|c) 𝔼 𝒯 [ 𝔼z∼qϕ(z|c 𝒯 ) [ R(𝒯, z) + βDKL (qϕ (z|c 𝒯 ) ∥p(z))]] p(z) qϕ(z|c) ϕ zz
  • 16. context • MDP • permutation invariant • Inference network • Gaussian 16 {si, ai, s′i, ri} qϕ (z|c1:N) ∝ ΠN n=1Ψϕ (z|cn) Ψϕ (z|cn) = 𝒩 (fμ ϕ (cn), fσ ϕ (cn))
  • 17. off-policy • policy 
 • actor ciritic 
 • 
 • on-policy 
 on-policy test 17 qϕ(z|c) ℬ 𝒮c
  • 18. off-policy • Soft Actor-Critic (SAC) [Haarnoja+ 2018] context • SAC maxEntRL( ) off-policy actor-critic • actor critic reparameterization trick • critic loss: 
 • actor loss: 18 ℒcritic = 𝔼(s, a, r, s′ ) ∼ ℬ z ∼ qϕ(z|c) [Qθ(s, a, z) − (r + V (s′, z))] 2 z ℒactor = 𝔼s∼ℬ,a∼πθ DKL ( πθ(a|s, z)∥ exp (Qθ(s, a, z)) 𝒵θ(s) )
  • 19. 19
  • 20. • MuJoCo 6 • Half-Cheetah, Humanoid, Ant, Walker (Half-Cheetah Ant 2 ) • • adapt • 20-100 
 • : meta-training • : 20
  • 21. • on-policy (MAESN[Gupta+ 2018]) • sparse navigation • meta-test 
 • 
 • context • MAESN 21
  • 22. Ablation Study • • Half-Cheetah-Vel • RNN • RNN-tran: de-correlated • RNN-traj: • permutation invariant 
 22
  • 23. Ablation Study • • Half-Cheetah-Vel • • off-policy: off-policy( ) • off-policy RL-batch: policy • 
 (PEARL) 23
  • 24. Ablation Study • context • sparse navigation • context • 
 24
  • 25. 25
  • 26. • off-policy (PEARL) • context policy context off-policy • meta-training 26
  • 28. Guided Meta-Policy Search • https://arxiv.org/abs/1904.00956 (Submitted on 1 Apr 2019) • Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine, Chelsea Finn • UC Berkeley (BAIR) • … • • https://github.com/RussellM2020/GMPS • Website • https://sites.google.com/berkeley.edu/guided-metapolicy-search 28
  • 29. TL; DR • meta learning off-policy (GMPS) • meta-train RL • meta-train meta-objective( ) imitation learning (behaviour cloning) • meta-training task learning meta-learning 2 29
  • 30. ( MAML ) • meta-training adaptation on-policy • [Rakelly+ 2019] • meta-training meta-test 30
  • 31. 31
  • 32. • meta-train meta-objective( ) (behaviour cloning) • meta-training 2 • task learning: meta-training policy • policy meta-test expert • meta-learning: policy meta-level supervised 32
  • 33. [Rakelly+ 2019] • • • 33 p(𝒯) 𝒯 𝒯 = {p (s0), p (st+1 |st, at), r (st, at)} p(𝒯)
  • 34. task learning • meta-training 
 / policy • meta-learning • MAML • adapt • MAML • (behaviour cloning) 34 𝒯i {π*i } ℒRL (ϕi, 𝒟i) ϕi 𝒯i ℒBC (ϕi, 𝒟i) ≜ − ∑ (st,at)∈𝒟 log πϕ (at |st)
  • 35. meta-learning • meta-training 
 
 • policy 
 meta-objective • 
 
 behaviour cloning compounding error 
 35 𝒯i π*i D*i min θ ∑ 𝒯i ∑ 𝒟val i ∼𝒟*i 𝔼 𝒟tr i ∼πθ [ ℒBC (θ − α∇θℒRL (θ, 𝒟tr i ), 𝒟val i )] θ 𝒯i ϕi D*i
  • 36. • meta-learning task learning meta-learning • policy • • meta-training • ) reward shaping • MAML 36
  • 37. policy • policy 
 contextual policy • ( ID ) • meta-training • meta-test meta-training • soft actor-critic(SAC) [Haarnoja+ 2018] 37 πθ (at |st, ω) ω
  • 38. • Behaviour cloning meta-objective 
 • 
 • 
 
 • • Behaviour cloning 38 θ ϕi πθ ϕi = θ + α𝔼τ∼πθ [ πθ(τ) πθinit (τ) ∇θlog πθ(τ)Ai(τ) ] Ai θ ← θ − β∇θℒBC (ϕi, 𝒟val i )
  • 39. 39
  • 40. • • Pushing (full state) • • • Pushing (vision) • • Door opening • • • (Ant) • https://sites.google.com/berkeley.edu/guided-metapolicy-search 40
  • 41. • • meta-training task context( ) • SAC • : meta-training : 41
  • 42. • • Door Opening Ant • • pushing • 42
  • 43. 43
  • 44. • off-policy (GMPS) • meta-training task learning meta-learning 2 (behaviour cloning) • meta-training 44
  • 45. 45
  • 46. • 2 • one-step update adapt (BAIR ) • ) MAML[Finn+ 2017] • adapt (DeepMind ) • ) Neural Processes[Garnelo+ 2018], GQN[Eslami+ 2018] • • • [DL ]Meta-Learning Probabilistic Inference for Prediction • https://www.slideshare.net/DeepLearningJP2016/dlmetalearning-probabilistic-inference-for- prediction-126167192 • pro-con 46
  • 48. References [Eslami+ 2018] Eslami, S. M. Ali, Danilo Jimenez Rezende, Frédéric Besse, Fabio Viola, Ari S. Morcos, Marta Garnelo, Avraham Ruderman, Andrei A. Rusu, Ivo Danihelka, Karol Gregor, David P. Reichert, Lars Buesing, Theophane Weber, Oriol Vinyals, Dan Rosenbaum, Neil C. Rabinowitz, Helen King, Chloe Hillier, Matthew M Botvinick, Daan Wierstra, Koray Kavukcuoglu and Demis Hassabis. “Neural scene representation and rendering.” Science 360 (2018): 1204-1210. http://science.sciencemag.org/content/360/6394/1204 {Finn+ 2017] Chelsea Finn, Pieter Abbeel and Sergey Levine. “Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks,” Proceedings of the 34th International Conference on Machine Learning, PMLR 70:1126-1135, 2017. http://proceedings.mlr.press/v70/ finn17a.html [Garnelo+ 2018] Marta Garnelo, Jonathan Schwarz, Dan Rosenbaum, Fabio Viola and Danilo J. Rezende, S.M. Ali Eslami and Yee Whye Teh. “Neural Processes”. https://arxiv.org/abs/1807.01622. [Gupta+ 2018] Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel and Sergey Levine. ”Meta-Reinforcement Learning of Structured Exploration Strategies”. In Advances in Neural Information Processing Systems, 2018. https://nips.cc/Conferences/2018/ Schedule?showEvent=12658 [Haarnoja+ 2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel and Sergey Levine. “Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor”. Proceedings of the 35th International Conference on Machine Learning, PMLR 80:1861-1870, 2018. http://proceedings.mlr.press/v80/haarnoja18b.html [Mendonca+ 2019] Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Sergey Levine and Chelsea Finn. “Guided Meta- Policy Search”. https://arxiv.org/abs/1904.00956 [Nagabandi+ 2018] Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S. Fearing, Pieter Abbeel, Sergey Levine and Chelsea Finn. “Learning to Adapt in Dynamic, Real-World Environments Through Meta-Reinforcement Learning”. https://arxiv.org/abs/1803.11347 [Nichol+2018] Alex Nichol, Joshua Achiam and John Schulman. “On First-Order Meta-Learning Algorithms”. https://arxiv.org/abs/1803.02999 [Rakelly+ 2019] Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn ands Sergey Levine. “Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables”. https://arxiv.org/abs/1903.08254 48