Guided policy search

Guide to Guided Policy Search
GPS의 이론설명 + 코드구현 가이드입니다.
개인 연구 정리목적으로 작성한것이라 투박한 부분이 많습니다 :)
박재현 (ka2hyeon@gmail.com)
서울대학교 지능제어시스템 연구실

Introduction to GPS
- DQN, DDPG와 같은 Reinforcement learning 알고리즘에서
global policy를 direct하게 찾는것은 너무 오래 걸린다.
- 반면 optimal control을 풀어 Local policy를 찾는것은 빠르다.
→Local policy를 찾고, 이를 사용하여 Global policy를 supervised learning 시키자
(Supervised reinforcement learning이라는 표현도 쓰고
더 발전된 형태가 Trajectory-centric reinforcement learning 로 명명되기는 함)
- 그래서 ‘Guided Policy Search’라고 작명되었다.

Introduction to GPS
- Model-based optimal control을 푼다고 해보자.
- 시스템 모델은 우리가 모른다.
- 직관적으로 [1. 모델학습 2.옵컨풀기 3.옵컨결과를 신경망에 학습] 이 과정을 반복하면 될것 같
다.
- 하지만, 모델학습이 불완전하면 옵컨도 개차반으로 풀린다. 개차반으로 풀린 옵컨결과대로 움직
이면, trajectory도 개차반이다. 개차반인 trajectory로 모델을 학습시키면, 모델도 개차반이된다.
- 이런 악순환이 반복되며 1. 수렴성이 떨어지고 2. variance가 너무 커진다 (trajectory가 확확 바
뀌어서 실제 응용하기 힘듬)

Introduction to GPS
- Sergey levine은 이 문제를 옵티멀 컨트롤을 푸는 과정에서, K-L divergence 제약을 둠으로써 해
결했다.
- KL-divergence 제약덕분에, 매 iteration마다의 trajectory변화가 제약되고, 이전iteration과 다음
iteration상의 모델의 차이도 제약이 생겨 안정성을 증대시킨다.
- 이것은 특히 모델학습을 linear regression류의 local model을 사용하는 경우에 필수적이다.
- 왜냐면, 이전 iteration과 다음 iteration과 trajectory 차이가 크면 local 모델이 더이상 유효하지
않기 떄문이다.
- 참고로 local model을 안쓰면, iLQR과같이 모델선형화 기반으로 빠르게 작동하는 옵티멀 컨트
롤을 풀지 못한다.
• 이 K-L divergence constaint가 별거아닌것 같지만, 직접 시스템에 구현해보면 진짜진짜 중요하
다.

누가 만들었나?
Sergey Levine
• GPS 창안하여 박사학위 받음
• Peter Abeel 밑에서 포닥 -> 현재 assistant professor(2016~)
• Google Brain
• 연구분야는 deep learning + robotics에 걸쳐있음
• 현재 AI robotics를 이끄는 대부중 한명

관련 논문들....
각자 사용하고 있는 variation이 조금씩 다르다.
수년에 걸쳐 발전하고 있다
• Chebotar, Yevgen, et al. "Path integral guided policy search." Robotics and Automation (ICRA),
2017 IEEE International Conference on. IEEE, 2017.
• Sergey Levine*, Chelsea Finn*, Trevor Darrell, Pieter Abbeel. End-to-End Training of Deep
Visuomotor Policies. JMLR 2016.
• William Montgomery, Sergey Levine. Guided Policy Search as Approximate Mirror Descent. NIPS
2016.
• Marvin Zhang, Zoe McCarthy, Chelsea Finn, Sergey Levine, Pieter Abbeel. Learning Deep Neural
Network Policies with Continuous Memory States. ICRA 2016.
• Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, Pieter Abbeel. Deep Spatial
Autoencoders for Visuomotor Learning. ICRA 2016.
• Sergey Levine, Nolan Wagener, Pieter Abbeel. Learning Contact-Rich Manipulation Skills with
Guided Policy Search. ICRA 2015.
• Sergey Levine, Pieter Abbeel. Learning Neural Network Policies with Guided Policy Search under
Unknown Dynamics. NIPS 2014.
• Levine, Sergey, and Vladlen Koltun. "Guided policy search." Proceedings of the 30th International
Conference on Machine Learning (ICML-13). 2013.

기본 IDEA
주어진 global policy로 부터..
• 1. optimal control로 local optimal trajectory를 뽑는다.
* 이때, optimal control의 cost는, 원래 system cost + global policy와의 격차 이다.
• 2. 뽑힌 local optimal trajectory로 global policy를 학습한다.
* 가 GPS의 핵심이다.
• *과정이 없다면, 학습된 global policy가 불안정하다. (불완전한 imitation임)
• *과정을 했을때 도망치는 토끼를 소쿠리에 잡아넣듯이 학습하는거라 안정하다.
• *과정을 햇을때, 학습된 polic가 global polic라는 것을 수학적으로 ‘어느정도’ 증명해놓았
다.
• 즉, GPS는 constraint optimization의 일환이다.

모든 GPS의 목표
• min
𝜃,𝑝1,…,𝑝 𝑁
𝑖=1
𝑁
𝑡=1
𝑇
𝐸 𝑝 𝑖 𝑥 𝑡,𝑢 𝑡
𝑙 𝑥𝑡, 𝑢 𝑡 𝑠𝑢𝑐ℎ 𝑡ℎ𝑎𝑡 𝑝𝑖 𝑢 𝑡 𝑥𝑡 = 𝜋 𝜃 𝑢 𝑡, 𝑥𝑡 ∀𝑥𝑡, 𝑢 𝑡, 𝑡, 𝑖
• 𝑝𝑖 : i-th optimal control trajectory
• 𝑙 𝑥𝑡, 𝑢 𝑡 : system cost
• 𝜋 𝜃 𝑢 𝑡, 𝑥𝑡 : global policy
• 앞서 설명했다 싶이, polic변화에 제약이 걸린 상태로 system cost를 최적화하는것이다.

GPS Implementation
• 가장 발전한 형태중 하나인
W. Montgomery(2016)의 Mirror Descent GPS 를 중심으로 구현
- 원래 GPS보다 더 깔끔함
- 구현하기 더 편함
- 수렴이 더 잘됨
- 성능이 더 좋음

MDGPS - Overview
기존 GPS의 복잡한 과정을 C스텝과 S스텝으로 단순화 하였다.
특히, C스텝에서 뽑아낸 trajector를 S스텝에서 학습한다.
(기존 GPS는 C스텝과 S스텝 사이에 최적화 한번 더한다)

MDGPS - Overview
• 1. 기존의 control policy로 trajectory roll-out을 여러개 뽑는다.
* off-line이라면 기존 optimal control policy로부터
* on-line이라면 기존 neuralnet policy로 부터
2. [모델학습] roll-out trajector로 부터 Model을 Fit 한다.
3. [C-step] 학습된 모델을 바탕으로 optimal control 푼다.
4. [S-step] optimal contro결과로 나온 controlle로 1의 roll-out trajector에서의 optimal
input을 계산 뒤, policy에 학습시킨다.
1~4과정을 반복한다.
* 여기서 구현은, 바로앞장과는 약간은 다르다.

C-Step 구현
C-step은 학습된 모델을 바탕으로 optimal control을 푸는 스텝이다.

C스텝 구현
- 풀꺼:
- Lagrangian form:
- 위의 Lagrangian form을 최적화 할때
variable 𝑝𝑖와 covariable 𝜂와 에 대해서
최적화식전개를 하는게 보통이다.
이렇게
- 하지만 MDGPS에서는 다르게 풀었다.
(1)
(2)
- 전형적인 optimization 문제이다.

- GPS에서는 constraint를 엄격하게 만족할 필요가 없다. trajector간의 K-L divergence를 줄이는게
목표니까.
- rough하게 constrained optimization 문제를 빠르게 푸는알고리즘 중 하나가 ‘mirror descent
method’이다.
*A. Beck and M. Teboulle. Mirror descent and nonlinear projected subgradient methods for
convex optimization. Operations Research Letters, 31(3):167–175, May 2003
- mirror descent method에서는 아래와 같이 식을 바꾼뒤
- 𝜂를 특정값으로 둠 -> 최적화 품 -> 𝜂조정 -> 최적화 품 이것을 반복한다
- 최적화를 풀었는데 constraint값이 큰 solution이 나왔다면 𝜂값을 줄인다. 그러면 constraint의 비
중이 커져서 다음 최적화시 constraint값이 작아진 solution이 나온다. 역과정은 그 반대로 𝜂를
키우면 된다. 이렇게 계속 반복하다보면 원하는 constraint값을 갖는 solution을 찾을 수 있다.
C스텝 구현

- 𝜂를 특정값으로 둠 -> 최적화 품 -> 𝜂조정 -> 최적화 품 (반복)
이제 여기서 ‘최적화’를 iLQR로 푸는방법에 대해서 설명한다.
- 선형화 기반의 빠른 알고리즘인, iLQR을 사용한다. 자세한것은 iLQR검색해보면 많이 나
온다.
C스텝 구현

C스텝 구현
- iLQR로 최적화 푸는 방법
- Differential Dynamic Programming(DDP)에서
iLQR은 Q함수의 2계미분값을 무시한다. -> 속도가 빠르다

C스텝 구현 - iLQR
1. given initial trajectory
2. backward로 controller update
3. forward로 trajectory 갱신
4. 𝜂값 갱신
5. 2~4 반복

C스텝 구현 - KL divergence
- 위에가 논문 내용인데, 저식을 direct하게 구현하기 힘들다
- Cost에 K-L divergence 항을 넣을려면, trajectory에서 K-L divergence 계산하는 텀이 필요하다

C스텝 구현 - KL divergence
Gaussian Trajectory Assumption
𝑝 𝑢 𝑡 𝑥 𝑡 ~N(p(ut), 𝑄 𝑢𝑢
−1
(t)) - local optimal trajectory
𝑞 𝑢 𝑡|𝑥 𝑡 ~𝑁 𝜋 𝜃 𝑥 𝑡 , 𝑉 𝑥 𝑡 - global policy tajectory
∆𝑢 𝑡 = (p(ut)-𝜋 𝜃 𝑥 𝑡 )
 𝐾𝐿 𝑡 𝑝 𝑞 =
1
2
trace V−1
∗ Quu
−1
+ ∆𝑢 𝑇
𝑉−1
∆𝑢 − dim 𝑢 + log
V
Quu
−1
 𝐾𝐿 = 𝐾𝐿 𝑡
아래식을 이용하여 두 Gaussian trajectory의 KL 구현하면 참쉽다

C스텝 구현 - 𝜂조정
• 3. 𝐾𝐿 = KL(x_traj, u_traj)
• 4. if 𝐾𝐿 < 0.9𝜖 :
• 5. 𝜂m𝑎𝑥
← 𝜂
• 6. 𝜂 ← 𝑀𝐴𝑋(0.1 ∗ 𝜂m𝑎𝑥
, 𝜂m𝑎𝑥 ∗ 𝜂min)
• 7. elseif 0.9𝜖 < 𝐾𝐿 < 1.1𝜖 :
• 8. End
• 9. elseif 1.1𝜖 < 𝐾𝐿:
• 10. 𝜂m𝑖𝑛 ← 𝜂
• 11. 𝜂 ← 𝑀𝐼𝑁(0.1 ∗ 𝜂m𝑖𝑛, 𝜂m𝑎𝑥 ∗ 𝜂min)
이제 𝜂값을 어떻게 조정하는지 해보자.
bracket search라고 하는방식으로 하면 효율적이다.
(그분들 github에 올려놓은 코드에서 이방법을 쓰고있다)
Levine (2014) 에
10%에 근접했을때 그만둔다고 되어있음

S-Step 구현
optimal control 결과를 neural network(혹은 그에 상당하는
parameterized global model)에 학습

S-스텝 구현 (Mean)
neural network에서 위의 loss를 최소화 하도록 supervised learning 하면 됨

S-스텝 구현 (Variance)
variance 새로 parameterized training 해도 되는데,, 그렇게 중요한 항이 아니여서
아래와 같이 상수로 두어도 됨 - Levine&Finn(2017) “end-to-end ...”에서 사용한 방법

𝜖 조정
초기논문 (S.Levine(2013)”Guided policy search”) 에서는 heuristic하게 epsilon을 결정하였으
나...
S.Levine(2015)”Learning contact...”에서는 다음과 같이 fancy하게 결정함
Levine, Sergey, Nolan Wagener, and Pieter Abbeel. "Learning contact-rich manipulation skills with guided
policy search." Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015.
𝑙 𝑘−1
𝑘−1
: the cost under the previous dynamics and previous controller
𝑙 𝑘−1
𝑘
: the cost under the previous dynamics and current controller
𝑙 𝑘
𝑘
: the cost under the current dynamics and controller

Model learning 구현
Roll-out된 trajector를 바탕으로 모델을 학습하는 것이다.

i-iteration
K-samples
T time step T time step T time step
하나의 온전한 trajectory
보통 5~10번정도 뽑는듯 하다
Policy update가 일어나는 한 횟차

Linear model vs non-linear model
• Linear model learning
- linear regression
- Gaussian conditioning
• Non-linear model learning
- Gaussian Mixture Model
- Neural network
- ...
Linear한 모델만을 배울 수 있어
매 time step단위 model을 새우고 학습 해야
한다.
모든 iteration에서 얻어진 모든 sample의 모
든 time-step 데이터를 모아서
하나의 Global model에 학습해야 한다

[Linear model learning] Linear regression model
Linear dynamic model to learn:
𝑥′ = 𝑓𝑥 𝑥 + 𝑓𝑢 𝑢 + 𝑓𝑐
Linear regression model:
𝑥′ = 𝑓𝑥 𝑓𝑢 𝑓𝑐
𝑥
𝑢
1

Linear model - Linear regression model
• i-th iteration의 t-step에서의 linear model 추정하기:
1. K samples의 t-step에서의 𝑥′, [𝑥, 𝑢]를 모은다 –
2. linear regression으로 𝑓𝑥 𝑓𝑢 𝑓𝑐 의 least square solution을 구한다
장점 : 선형성이 가장 잘 보존되는 모델이다 (iLQR 잘푼다)
단점 : [x u 1]의 차원이 n차원이면, least square solutio을 풀기 위해
최소 n개의 trajector가 필요하다.

Linear model - Linear regression model
• 𝑧 =
𝑥𝑢
𝑥′ 𝜇 =
𝜇 𝑥𝑢
𝜇 𝑥′
Σ =
Σ 𝑥𝑢,𝑥𝑢 Σ 𝑥𝑢𝑥′
Σ 𝑥′ 𝑥𝑢 Σ 𝑥′ 𝑥′
• 이때 z~N(𝜇, Σ) 즉, z는 Gaussian 분포를 이루고있다 가정하면
𝑝 𝑥′
𝑥𝑢 = N 𝜇, Σ
where 𝜇 = 𝜇 𝑥′ + Σ 𝑥′ 𝑥𝑢Σ 𝑥𝑢,𝑥𝑢
−1 𝑥𝑢 − 𝜇 𝑥𝑢
Σ = Σ 𝑥′ 𝑥′-Σ 𝑥′ 𝑥𝑢Σ 𝑥𝑢,𝑥𝑢Σ 𝑥′ 𝑥𝑢
즉, x’의 Gaussian 분포를 예측할 수 있음 EM(expectation maximum)으로 계산된것

Linear model – Conditional Gaussain
• i-th iteration의 t-step에서의 linear model 추정하기:
1. K samples의 t-step에서의 𝑥′, [𝑥, 𝑢]를 모은다 –
2. 𝑧 = [𝑥; 𝑢; 𝑥′]으로 둔다음에 z의 평균 분산을 계산한다.
𝜇 =
𝜇 𝑥𝑢
𝜇 𝑥′
Σ =
Σ 𝑥𝑢,𝑥𝑢 Σ 𝑥𝑢𝑥′
Σ 𝑥′ 𝑥𝑢 Σ 𝑥′ 𝑥′
𝑓𝑥𝑢 = Σ 𝑥′ 𝑥𝑢Σ 𝑥𝑢,𝑥𝑢
−1  𝑓𝑥, 𝑓𝑢쪼개쓴다.
𝑓𝑐 = 𝜇 𝑥′ − 𝑓𝑥𝑢 𝜇 𝑥𝑢
𝐹 = Σ 𝑥′ 𝑥′ − 𝑓𝑥𝑢Σ 𝑥𝑢,𝑥𝑢 𝑓𝑥𝑢
𝑇 𝑥′ = 𝑓𝑥 𝑥 + 𝑓𝑢 𝑢 + 𝑓𝑐
with 분산 F

Linear model – Conditional Gaussain
• Conditional Gaussian에서 EM으로 계산한 결과나
Linear regression 에서 least square solution으로 계산한 결과나
결국엔 같다. 둘은 동일한 알고리즘!
• t스텝 데이터갯수가 부족하면 linear regression에서는 least-square solution 구할때 뻑나고,
Gaussian conditioning은 Σ 𝑥𝑢,𝑥𝑢
−1 구할때 뻑난다. (같은문제)

Global Model – Gaussian Mixture Model
• Gaussian conditioning을 확장
• z~N(𝜇, Σ) 라는 가정이 아니라, z~ΣwiN(𝜇𝑖, Σi) 로 가정하여
비선형 모델도 다룰 수 있다.
• 학습데이터를 입력하면 𝑤𝑖, 𝜇𝑖, Σ𝑖가 학습된다.
• Global model이니까 학습데이터는 t-step 상관없이,
iteration 상관없이, 그냥 다 모아서 학습에쓰면 된다.

• Trajectory data를 학습시키면 오른쪽과 같이
학습이 된다(cluster = 3)
• {𝑤𝑖, 𝜇𝑖, Σ𝑖}가 3개 나온다. 각각 클러스터의
Gaussian 분포만 생각하면 선형이다.
• 즉 1 로 clustering된애들은 선형화 모델이
거의 비슷할꺼다. (실제로는 2,3에대한 weight도
적은값이나마 있어서 반영됨)
• 이 GMM모델은 trajectory를 대략 3가지 방식으로 선형화한다고
rough하게 말할 수 있다.
1
2
3

• GMM 의 conditiong은 아래와 같은식으로 하면 된다.
http://www.cs.columbia.edu/~jebara/htmlpapers/ARL/node51.html
- 모든 𝑥𝑢에 상관없이 동일한 선형모델이 결정되는 Gaussian conditiong과 달리
xu가 주어졌을때 𝑃(𝑥𝑢|𝜇 𝑥𝑢, Σ 𝑥𝑢)를 계산하고, 이것이 weight 역할을 해서 분포를 바꾼다.
- xu가 주어지고 𝑃(𝑥𝑢|𝜇 𝑥𝑢, Σ 𝑥𝑢)이 계산되면 결국엔 y(= 𝑥′
)는 Gaussian Mixture Model을 따르므로
평균 분산을 정할 수 있다.

• GMM의 장점
GMM은 선형화된 모델을 엮는 방식의 개념이다.
즉 모델선형화가 필요한 iLQR과 궁합이 정말정말정말 잘 맞는다.
• GMM의 단점
정확도가 떨어진다. Neural Network보다 모델오차가 크다
S.Levine(2015)”learning contact-rich...” 논문 보면 contact 같은 비선형 강한 dynamics 있으면
GMM만으로 힘들다.

GMM prior + Linear regression
• GMM의 장점 – Global model / 선형화된 모델
• GMM의 단점 – 한 지점(t-step)에서의 정확도 떨어짐
• Linear regression 장점 – 선형화된 모델 / 한지점 (t-step)에서의 정확도 강함
• Linear regression 단점 – 로컬 모델 / 한지점에서의 data만 사용해서
data-efficiency 약함
• 두개를 섞으면 장단점이 완벽하게 보완된다!

GMM prior + Linear regression
현재 주어진 {𝑥𝑡, 𝑢 𝑡, 𝑥𝑡
′
}xK-samples 에서
1. GMM에서 [𝑥𝑡; 𝑢 𝑡] 을 사용하여 𝑥′ 의 𝜇 , Σ 계산(GMM conditioning 이용)
2. K개 sampl들의 {[𝑥 𝑡; 𝑢 𝑡; 𝑥𝑡
′
]}의 𝜇, Σ 계산
3. normal inverse-Wishart prior (Φ, 𝜇0, 𝑛0, 𝑚0) 구성
4. Local model 계산
- inverse-Wishart는 별것이 아니고,
Gaussian과 합쳤을 때posetrior가
Gaussian이 나오게 하는 prior
(Wikipedia참조)
- 𝑛0, 𝑚0는 데이터갯수 관계된 항이지만
보통 1이라 둠 (F.Justin(2016))
- 𝜇0 = 𝜇 , Φ = 𝑛0Σ

NN + linear regression 모델 러닝
• Prior을 GMM으로 안주고 Neural network로 줄 수도 있음
Fu, Justin, Sergey Levine, and Pieter Abbeel. "One-shot learning of manipulation skills with online
dynamics adaptation and neural network priors." Intelligent Robots and Systems (IROS), 2016 IEEE/RSJ
International Conference on. IEEE, 2016.
Neural network 학습 후 Jaccobian 구해서 prior 계산
-> 하지만 GMM에 비해 선형성이 심하게 떨어짐
-> Contexture neural network쓰면 개선되는듯
(논문 참조)

모델러닝 정리
• 대부분의 Guided policy Learning 적용논문에는
GMM prior + linear regression을 쓰고 있음... 궁합이 잘맞는듯 (아래논문)
- Sergey Levine*, Chelsea Finn*, Trevor Darrell, Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. JMLR 2016.
- William Montgomery, Sergey Levine. Guided Policy Search as Approximate Mirror Descent. NIPS 2016.
- Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, Pieter Abbeel. Deep Spatial Autoencoders for Visuomotor
Learning. ICRA 2016.
- Levine, Sergey, Nolan Wagener, and Pieter Abbeel. "Learning contact-rich manipulation skills with guided policy
search." Robotics and Automation (ICRA), 2015 IEEE International Conference on. IEEE, 2015.
- Sergey Levine, Pieter Abbeel. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics. NIPS
2014.
• Neural network, GPR prior로 모델러닝하면 iLQR이 잘 안풀릴때가많음(경험담),
선형성이 떨어지는 듯

Guided policy search의 확장성

영상을 policy의 입력으로!
• GPS에서는 optimal control을 풀때의 state와
policy optimization을 할 때의 observation이랑 같을 필요가 없음
• 즉 optimal control의 state는, 로봇의 물리적인 변수들로 두고 풀고
policy optimization에서는 roll-out동안 수집되었던 image를
학습한다면? 영상을 policy의 입력으로 쓸수 있음
• 영상을 입력으로 쓰는 Deep reinforcement learnin이 수Million의 roll-out이 필요하지만, 본
알고리즘은 물리적인 변수로 optimal contro을 푸는것이라 수십회 수준에서 학습 가능

영상을 policy의 입력으로!
• Sergey Levine*, Chelsea Finn*, Trevor Darrell, Pieter Abbeel. End-to-End Training of Deep
Visuomotor Policies. JMLR 2016.
• 위 논문이 바로 그것

Neural network parameter가 output으로 나옴
• [모델학습-> optimal control의 결과를 재생]과는 다르게,
optimal control의 결과가 학습된 neural network의
parameter가 결과로 나온다.
• 따라서 추가적인 parameter차원의 optimization이 가능함하고 end-to-end learning이 가능
한 알고리즘임!
Sergey Levine*, Chelsea Finn*, Trevor Darrell, Pieter Abbeel. End-to-End Training of Deep Visuomotor
Policies. JMLR 2016.
Singh, Avi, Larry Yang, and Sergey Levine. "GPLAC: Generalizing Vision-Based Robotic Skills using Weakly
Labeled Images." arXiv preprint arXiv:1708.02313 (2017).

Neural network parameter가 output으로 나옴
• (example)
Singh, Avi, Larry Yang, and Sergey Levine. "GPLAC: Generalizing Vision-Based Robotic Skills using
Weakly Labeled Images." arXiv preprint arXiv:1708.02313 (2017).
학습된 policy가 neural network라서 end-to-end learning이 가능함

• Chebotar, Yevgen, et al. "Path integral guided policy search." Robotics and Automation (ICRA), 2017 IEEE International
Conference on. IEEE, 2017.
• Singh, Avi, Larry Yang, and Sergey Levine. "GPLAC: Generalizing Vision-Based Robotic Skills using Weakly Labeled
Images." arXiv preprint arXiv:1708.02313 (2017).
• Sergey Levine*, Chelsea Finn*, Trevor Darrell, Pieter Abbeel. End-to-End Training of Deep Visuomotor Policies. JMLR
2016.
• William Montgomery, Sergey Levine. Guided Policy Search as Approximate Mirror Descent. NIPS 2016.
• Marvin Zhang, Zoe McCarthy, Chelsea Finn, Sergey Levine, Pieter Abbeel. Learning Deep Neural Network Policies with
Continuous Memory States. ICRA 2016.
• Chelsea Finn, Xin Yu Tan, Yan Duan, Trevor Darrell, Sergey Levine, Pieter Abbeel. Deep Spatial Autoencoders for
Visuomotor Learning. ICRA 2016.
• Sergey Levine, Nolan Wagener, Pieter Abbeel. Learning Contact-Rich Manipulation Skills with Guided Policy Search.
ICRA 2015.
• Sergey Levine, Pieter Abbeel. Learning Neural Network Policies with Guided Policy Search under Unknown Dynamics.
NIPS 2014.
• Levine, Sergey, and Vladlen Koltun. "Guided policy search." Proceedings of the 30th International Conference on
Machine Learning (ICML-13). 2013.

Guided policy search

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Guided policy search

Similaire à Guided policy search (20)

Guided policy search