Learning To Run

•

0 j'aime•137 vues

We present our approach for the NIPS 2017 "Learning To Run" challenge. The goal of the challenge is to develop a controller able to run in a complex environment, by training a model with Deep Reinforcement Learning methods. We follow the approach of the team Reason8 (3rd place). We begin from the algorithm that performed better on the task, DDPG. We implement and benchmark several improvements over vanilla DDPG, including parallel sampling, parameter noise, layer normalization and domain specific changes. We were able to reproduce results of the Reason8 team, obtaining a model able to run for more than 30m.

Ingénierie

Learning To Run
Deep Learning Course
Emanuele Ghelfi Leonardo Arcari Emiliano Gagliardi
https://github.com/MultiBeerBandits/learning-to-run
March 31, 2019
Politecnico di Milano

Our Goal
The goal of this project is to replicate the results of Reason8 team
in the NIPS 2017 Learning To Run competition 1.
• Given a human musculoskeletal model and a physics-based
simulation environment
• Develop a controller that runs as fast as possible
1
https://www.crowdai.org/challenges/nips-2017-learning-to-run
1

Reinforcement Learning
Reinforcement Learning (RL) deals with sequential decision making
problems. At each timestep the agent observes the world state,
selects an action and receives a reward.
πs a
Agent
r
∼  (⋅ ∣ s, a)s
′
Goal: Maximize the expected discounted sum of rewards:
Jπ = E
[∑H
t=0 γtr(st, at)
]
.
2

Deep Reinforcement Learning
The policy πθ is encoded in a neural network with weights θ.
s a
Agent
r
(a ∣ s)πθ
∼  (⋅ ∣ s, a)s
′
How? Gradient ascent over policy parameters: θ′ = θ + η∇θJπ
(Policy gradient theorem).
3

Learning To Run
s ∈ ℝ
34
(s)πθ
a ∈ [0, 1]
18
∼  (⋅ ∣ s, a)s
′
• State space represents kinematic quantities of joints and links.
• Actions represents muscles activations.
• Reward is proportional to the speed of the body. A penalization is given
when the pelvis height is below a threshold, and the episode restarts. 4

Deep Deterministic Policy Gradient - DDPG
• State of the art algorithm in Deep Reinforcement Learning.
• Off-policy.
• Actor-critic method.
• Combines in an effective way Deterministic Policy Gradient
(DPG) and Deep Q-Network (DQN).
5

Deep Deterministic Policy Gradient - DDPG
Main characteristics of DDPG:
• Deterministic actor π(s) : S → A.
• Replay Buffer to solve the sample independence problem while
training.
• Separated target networks with soft-updates to improve
convergence stability.
6

DDPG Improvements
We implemented several improvements over vanilla DDPG:
• Parameter noise (with layer normalization) and action noise to
improve exploration.
• State and action flip (data augmentation).
• Relative Positions (feature engineering).
7

DDPG Improvements
Dispatch sampling
jobs
Samples
ready
no
yes
Train
Store in replay
buffer
Dispatch
evaluation job
Evaluation
ready
no
yes
Display statistics
Time expired
no
yes
Sampling workers
dispatch
Testing workers
dispatch
Replay buffer
dispatch
8

DDPG Improvements
yes
no
yes
Sampling workers Testing workers
Replay buffer
dispatch
Actori
s a
πθi
9

Results - Thread number impact
0 2 4 6 8 10 12 14
Training step 10 5
-5
0
5
10
15
20
25
30
35Distance(m)
20 Threads
10 Threads
10

Results - Ablation study
0 2 4 6 8 10 12 14
Training step 10 5
-5
0
5
10
15
20
25
30
35
Distance(m)
Flip - PN
Flip - No PN
No Flip - PN
No Flip - No PN
0 2 16 18 69 71 74 97
Training time (h)
11

Results - Full state vs Reduced State
0 2 4 6 8 10 12 14
Training step 10 5
-5
0
5
10
15
20
25
30
35Distance(m)
reduced
full
12

Actor-Critic networks
Elu Elu σ
s ∈ ℝ
34
64 64 a ∈ [0, 1]
18
T anh T anh Linear
64 32
a ∈ [0, 1]
18
s ∈ ℝ
34
Actor Critic
1
13

Recommandé

Optimization in deep learningRakshith Sathish

1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn

Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon

Competition winning learning ratesMLconf

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf

Reinforcement LearningDongHyun Kwak

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya

Tmprjcokeblue8

Recommandé

Optimization in deep learningRakshith Sathish

1118_Seminar_Continuous_Deep Q-Learning with Model based accelerationHye-min Ahn

Multi PPT - Agent Actor-Critic for Mixed Cooperative-Competitive EnvironmentsJisang Yoon

Competition winning learning ratesMLconf

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf

Reinforcement LearningDongHyun Kwak

Reinforcement Learning (DLAI D7L2 2017 UPC Deep Learning for Artificial Intel...Universitat Politècnica de Catalunya

Tmprjcokeblue8

0415_seminar_DeepDPGHye-min Ahn

An introduction to deep reinforcement learningBig Data Colombia

Lecture 7.3 btbtmathematics

An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara

Random Keys Genetic Alogrithims Applied to Conflicting Objectives for Optimiz...Uday Haral

An introduction to reinforcement learningSubrat Panda, PhD

Frontier in reinforcement learningJie-Han Chen

Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk

Kernel, RKHS, and Gaussian ProcessesSungjoon Choi

Reinforcement learningDongHyun Kwak

Introduction of Deep Reinforcement LearningNAVER Engineering

B4UConference_machine learning_deeplearningHoa Le

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya

DDPG algortihm for angry birdsWangyu Han

Efficient aggregation for graph summarizationaftab alam

Deep Q-learning from Demonstrations DQfDAmmar Rashed

Reinforcement LearningSalem-Kabbani

Rohan's Masters presentationrohan_anil

consistency regularization for generative adversarial networks_reviewYoonho Na

Transfer Learning: Breve introducción a modelos pre-entrenados.Fernando Constantino

Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt

Learning visual representation without human labelKai-Wen Zhao

Contenu connexe

Tendances

0415_seminar_DeepDPGHye-min Ahn

An introduction to deep reinforcement learningBig Data Colombia

Lecture 7.3 btbtmathematics

An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara

Random Keys Genetic Alogrithims Applied to Conflicting Objectives for Optimiz...Uday Haral

An introduction to reinforcement learningSubrat Panda, PhD

Frontier in reinforcement learningJie-Han Chen

Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015Chris Ohk

Kernel, RKHS, and Gaussian ProcessesSungjoon Choi

Tendances (9)

0415_seminar_DeepDPG

An introduction to deep reinforcement learning

Lecture 7.3 bt

An Introduction to Reinforcement Learning - The Doors to AGI

Random Keys Genetic Alogrithims Applied to Conflicting Objectives for Optimiz...

An introduction to reinforcement learning

Frontier in reinforcement learning

Continuous Control with Deep Reinforcement Learning, lillicrap et al, 2015

Kernel, RKHS, and Gaussian Processes

Similaire à Learning To Run

Reinforcement learningDongHyun Kwak

Introduction of Deep Reinforcement LearningNAVER Engineering

B4UConference_machine learning_deeplearningHoa Le

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018Universitat Politècnica de Catalunya

DDPG algortihm for angry birdsWangyu Han

Efficient aggregation for graph summarizationaftab alam

Deep Q-learning from Demonstrations DQfDAmmar Rashed

Reinforcement LearningSalem-Kabbani

Rohan's Masters presentationrohan_anil

consistency regularization for generative adversarial networks_reviewYoonho Na

Transfer Learning: Breve introducción a modelos pre-entrenados.Fernando Constantino

Using SigOpt to Tune Deep Learning Models with Nervana CloudSigOpt

Learning visual representation without human labelKai-Wen Zhao

The Frontier of Deep Learning in 2020 and BeyondNUS-ISS

Intro to Deep Reinforcement LearningKhaled Saleh

Applying Machine Learning for Mobile Games by Neil Patrick Del GallegoDEVCON

MILA DL & RL summer school highlights Natalia Díaz Rodríguez

cs330_2021_lifelong_learning.pdfKuan-Tsae Huang

Imitation Learning for Autonomous Driving in TORCSPreferred Networks

Online advertising and large scale model fittingWush Wu

Similaire à Learning To Run (20)

Reinforcement learning

Introduction of Deep Reinforcement Learning

B4UConference_machine learning_deeplearning

Reinforcement Learning (Reloaded) - Xavier Giró-i-Nieto - UPC Barcelona 2018

DDPG algortihm for angry birds

Efficient aggregation for graph summarization

Deep Q-learning from Demonstrations DQfD

Reinforcement Learning

Rohan's Masters presentation

consistency regularization for generative adversarial networks_review

Transfer Learning: Breve introducción a modelos pre-entrenados.

Using SigOpt to Tune Deep Learning Models with Nervana Cloud

Learning visual representation without human label

The Frontier of Deep Learning in 2020 and Beyond

Intro to Deep Reinforcement Learning

Applying Machine Learning for Mobile Games by Neil Patrick Del Gallego

MILA DL & RL summer school highlights

cs330_2021_lifelong_learning.pdf

Imitation Learning for Autonomous Driving in TORCS

Online advertising and large scale model fitting

Dernier

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b

Generative AI or GenAI technology based PPTbhaskargani46

AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248

Online food ordering system project report.pdfKamal Acharya

Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptxMuhammadAsimMuhammad6

HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture

Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...drmkjayanthikannan

DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal

Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X79953056974 Low Rate Call Girls In Saket, Delhi NCR

GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948

COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA

Thermal Engineering -unit - III & IV.pptDineshKumar4165

Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697

Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1

Verification of thevenin's theorem for BEEE Lab (1).pptxchumtiyababu

Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies

NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...Amil baba

School management system project Report.pdfKamal Acharya

Design For Accessibility: Getting it right from the startQuintin Balsdon

Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6

Dernier (20)

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Generative AI or GenAI technology based PPT

AIRCANVAS[1].pdf mini project for btech students

Online food ordering system project report.pdf

Orlando’s Arnold Palmer Hospital Layout Strategy-1.pptx

HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx

Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...

DC MACHINE-Motoring and generation, Armature circuit equation

Call Girls in South Ex (delhi) call me [🔝9953056974🔝] escort service 24X7

GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE

COST-EFFETIVE and Energy Efficient BUILDINGS ptx

Thermal Engineering -unit - III & IV.ppt

Engineering Drawing focus on projection of planes

Wadi Rum luxhotel lodge Analysis case study.pptx

Verification of thevenin's theorem for BEEE Lab (1).pptx

Standard vs Custom Battery Packs - Decoding the Power Play

NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...

School management system project Report.pdf

Design For Accessibility: Getting it right from the start

Computer Lecture 01.pptxIntroduction to Computers

Learning To Run

1. Learning To Run Deep Learning Course Emanuele Ghelfi Leonardo Arcari Emiliano Gagliardi https://github.com/MultiBeerBandits/learning-to-run March 31, 2019 Politecnico di Milano

2. Our Goal

3. Our Goal The goal of this project is to replicate the results of Reason8 team in the NIPS 2017 Learning To Run competition 1. • Given a human musculoskeletal model and a physics-based simulation environment • Develop a controller that runs as fast as possible 1 https://www.crowdai.org/challenges/nips-2017-learning-to-run 1

4. Background

5. Reinforcement Learning Reinforcement Learning (RL) deals with sequential decision making problems. At each timestep the agent observes the world state, selects an action and receives a reward. πs a Agent r ∼  (⋅ ∣ s, a)s ′ Goal: Maximize the expected discounted sum of rewards: Jπ = E [∑H t=0 γtr(st, at) ] . 2

6. Deep Reinforcement Learning The policy πθ is encoded in a neural network with weights θ. s a Agent r (a ∣ s)πθ ∼  (⋅ ∣ s, a)s ′ How? Gradient ascent over policy parameters: θ′ = θ + η∇θJπ (Policy gradient theorem). 3

7. Learning To Run

8. Learning To Run s ∈ ℝ 34 (s)πθ a ∈ [0, 1] 18 ∼  (⋅ ∣ s, a)s ′ • State space represents kinematic quantities of joints and links. • Actions represents muscles activations. • Reward is proportional to the speed of the body. A penalization is given when the pelvis height is below a threshold, and the episode restarts. 4

9. Deep Deterministic Policy Gradient - DDPG • State of the art algorithm in Deep Reinforcement Learning. • Off-policy. • Actor-critic method. • Combines in an effective way Deterministic Policy Gradient (DPG) and Deep Q-Network (DQN). 5

10. Deep Deterministic Policy Gradient - DDPG Main characteristics of DDPG: • Deterministic actor π(s) : S → A. • Replay Buffer to solve the sample independence problem while training. • Separated target networks with soft-updates to improve convergence stability. 6

11. DDPG Improvements We implemented several improvements over vanilla DDPG: • Parameter noise (with layer normalization) and action noise to improve exploration. • State and action flip (data augmentation). • Relative Positions (feature engineering). 7

12. DDPG Improvements Dispatch sampling jobs Samples ready no yes Train Store in replay buffer Dispatch evaluation job Evaluation ready no yes Display statistics Time expired no yes Sampling workers dispatch Testing workers dispatch Replay buffer dispatch 8

13. DDPG Improvements yes no yes Sampling workers Testing workers Replay buffer dispatch Actori s a πθi 9

14. Results

15. Results - Thread number impact 0 2 4 6 8 10 12 14 Training step 10 5 -5 0 5 10 15 20 25 30 35Distance(m) 20 Threads 10 Threads 10

16. Results - Ablation study 0 2 4 6 8 10 12 14 Training step 10 5 -5 0 5 10 15 20 25 30 35 Distance(m) Flip - PN Flip - No PN No Flip - PN No Flip - No PN 0 2 16 18 69 71 74 97 Training time (h) 11

17. Thank you all! 11

18. Backup slides

19. Results - Full state vs Reduced State 0 2 4 6 8 10 12 14 Training step 10 5 -5 0 5 10 15 20 25 30 35Distance(m) reduced full 12

20. Actor-Critic networks Elu Elu σ s ∈ ℝ 34 64 64 a ∈ [0, 1] 18 T anh T anh Linear 64 32 a ∈ [0, 1] 18 s ∈ ℝ 34 Actor Critic 1 13