This presentation discusses using reinforcement learning to teach modular robots locomotion. The key challenges are the high-dimensional state and action spaces, as well as the lack of domain knowledge. The presentation proposes using policy gradient reinforcement learning with finite differences to learn locomotion policies from raw sensor data. It suggests that incorporating domain knowledge through task manifolds and curriculum learning could help address the "curse of dimensionality" and speed up the learning process. The goals are to apply these techniques to learn locomotion, map tasks to policies, and develop a "robot school" curriculum.
3. Project Goals
• Combine deliberative and reactive
algorithms
• Show stability and completeness
• Demonstrate multi-robot coverage on
iCreate robots.
4. Coverage Problem
• Cover Entire Area
• Deliberative Algorithm Plans
Next Point to visit.
• Reactive Algorithm pushes
robot to that point.
• Reactive Algorithm Adds 2
constraints:
• Maintain Communication Distance
• Collision Avoidance
6. Demo for single vehicle
• Implimented on iCreate.
• 5 points to visit.
• Deliberative Algorithm
Selects Point.
• Reactive Algorithm uses
potential field to reach point.
• Point reached when within
some minimum distance.
VIDEO
7. Multi-robot Case
• 2 Robot Coverage
• Blue is free to move VIDEO
• Green must stay in
communication range.
• Matlab Simulation.
9. Positioning System
• Problems with Stargazer.
• Periods of no measurement
• Occasional Bad Measurements
• State Estimation (SPF)
• Combine Stargazer with Odometry
• Reject Bad Measurements
10. SPF Explanation
• Sigma Point Filter uses
Stargazer and Odometry
measures to predict robot
position.
• Non-guassian Noise
• Implimented and Tested on
robot platform.
• Performs very well in the
presence of no measurements
or bad measurement.
12. Roomba Pac-Man
• Implimented 5 Robot Demo along
with Jack Elston.
• Re-creation of Pac-Man Game.
• Demonstrate NetUAS system.
• Showcase most of concepts
from class.
25. Introduction
Robot State Machine
Gradients for “Grasping” the Object
Gradient for Moving the Object
Convergence Simulation Results
Continuing Work
26. Place a single beacon on an object and
another at the object’s destination. Multiple
robots cooperate to move the object.
Goals:
Minimal/No Robot Communication
Object has an Unknown Geometry
Use Gradients for Reactive Navigation
27.
28. Each Robot Knows:
◦ Distance/Direction to Object
◦ Distance/Direction to Destination
◦ Distance/Direction to All Other Robots
◦ Bumper Sensor to Detect Collision
Robots Do Not Know
◦ Object Geometry
◦ Actions other Robots are taking
29.
30. Related “Grasping” Work:
◦ Grasping with hand – Maximize torque [Liu et al]
◦ Cage objects for pushing [Fink et al]
◦ Tug Boats Manipulating Barge [Esposito]
◦ ALL require known geometry
My Hybrid Approach
◦ Even distribution around object
◦ Alternate between Convergence and Repulsion
Gradients
◦ Similar to Cow Herding example from class.
31. Pull towards object:
γ = ri − robj
€ Avoid nearby robots:
sign(d c − ri −r j )+1
( ri − rj − dc2 ) 2
2 2
N 4
1+ d
β = ∏1− 4 c 2
j=1 dc ( ri − rj − dc2 ) 2 + 1
€
33. Repel from all robots:
N
2
β = ∏ ri − rj − dr2
j=1
1
Cost =
(1+ β )1/ κ r
€
€
34.
35. Related Work
◦ Formations [Tanner and Kumar]
◦ Flocking [Lindhé et al]
◦ Pushing objects [Fink et al, Esposito]
◦ No catastrophic failure if out of position.
My Approach:
◦ Head towards destination in steps
◦ Keep close to object.
◦ Communicate “through” object
◦ Maintain orientation.
Assuming forklift on Robot can rotate 360º
36. Next Step Vector:
rObjCenter − rObjDest
rγ i = rideali + dm
rObjCenter − rObjDest
Pull to destination:
€
γ1 = ri − rγ i
€
37. Valley Perpendicular to Travel Vector:
rObjCenterx − rObjDestx
m=−
rObjCentery − rObjDesty + .0001
mrix − riy − mrγ x + rγ y
γ2 = 2
€ (m + 1)
44. Modular Robots
Learning
Contributions
Conclusion
A Young Modular Robot’s Guide to Locomotion
Ben Pearre
Computer Science
University of Colorado at Boulder, USA
December 6, 2009
Ben Pearre A Young Modular Robot’s Guide to Locomotion
45. Modular Robots
Learning
Contributions
Conclusion
Outline
Modular Robots
Learning
The Problem
The Policy Gradient
Domain Knowledge
Contributions
Going forward
Steering
Curriculum Development
Conclusion
Ben Pearre A Young Modular Robot’s Guide to Locomotion
46. Modular Robots
Learning
Contributions
Conclusion
Modular Robots
How to get these to move?
Ben Pearre A Young Modular Robot’s Guide to Locomotion
47. Modular Robots
The Problem
Learning
The Policy Gradient
Contributions
Domain Knowledge
Conclusion
The Learning Problem
Given unknown sensations and actions, learn a task:
◮ Sensations s ∈ Rn
◮ State x ∈ Rd
◮ Action u ∈ Rp
◮ Reward r ∈ R
◮ Policy π(x, θ) = Pr(u|x, θ) : R|θ| × R|u|
Example policy:
u(x, θ) = θ0 + θi (x − bi )T Di (x − bi ) + N (0, σ)
i
What does that mean for locomotion?
Ben Pearre A Young Modular Robot’s Guide to Locomotion
48. Modular Robots
The Problem
Learning
The Policy Gradient
Contributions
Domain Knowledge
Conclusion
Policy Gradient Reinforcement Learning: Finite Difference
Vary θ:
◮ Measure performance J0 of π(θ)
◮ Measure performance J1...n of π(θ + ∆1...n θ)
◮ Solve regression, move θ along gradient.
−1
gradient = ∆ΘT ∆Θ ˆ
∆ΘT J
∆θ1 J1 − J0
where ∆Θ = . and J =
ˆ .
. .
. .
∆θn Jn − J0
Ben Pearre A Young Modular Robot’s Guide to Locomotion
49. Modular Robots
The Problem
Learning
The Policy Gradient
Contributions
Domain Knowledge
Conclusion
Policy Gradient Reinforcement Learning: Likelihood Ratio
Vary u:
◮ Measure performance J(π(θ)) of π(θ) with noise. . .
◮ Compute log-probability of generated trajectory Pr(τ |θ)
H H
Gradient = ∇θ log πθ (uk |xk ) rl
k=0 l=0
Ben Pearre A Young Modular Robot’s Guide to Locomotion
50. Modular Robots
The Problem
Learning
The Policy Gradient
Contributions
Domain Knowledge
Conclusion
Why is RL slow?
“Curse of Dimensionality”
◮ Exploration
◮ Learning rate
◮ Domain representation
◮ Policy representation
◮ Over- and under-actuation
◮ Domain knowledge
Ben Pearre A Young Modular Robot’s Guide to Locomotion
51. Modular Robots
The Problem
Learning
The Policy Gradient
Contributions
Domain Knowledge
Conclusion
Domain Knowledge
Infinite space of policies to explore.
◮ RL is model-free. So what?
◮ Representation is bias.
◮ Bias search towards “good” solutions
◮ Learn all of physics. . . and apply it?
◮ Previous experience in this domain?
◮ Policy implemented by <programmer, agent> “autonomous”?
How would knowledge of this domain help?
Ben Pearre A Young Modular Robot’s Guide to Locomotion
52. Modular Robots
The Problem
Learning
The Policy Gradient
Contributions
Domain Knowledge
Conclusion
Dimensionality Reduction
Task learning as domain-knowledge acquisition:
◮ Experience with a domain
◮ Skill at completing some task
◮ Skill at completing some set of tasks?
◮ Taskspace Manifold
Ben Pearre A Young Modular Robot’s Guide to Locomotion
53. Modular Robots
Going forward
Learning
Steering
Contributions
Curriculum Development
Conclusion
Goals
1. Apply PGRL to a new domain.
2. Learn mapping from task manifold to policy manifold.
3. Robot school?
Ben Pearre A Young Modular Robot’s Guide to Locomotion
54. Modular Robots
Going forward
Learning
Steering
Contributions
Curriculum Development
Conclusion
1: Learning to locomote
◮ Sensors: Force feedback on
servos? Or not.
◮ Policy: u ∈ R8 controls
servos
ui = N (θi , σ)
◮ Reward: forward speed
◮ Domain knowledge: none
Demo?
Ben Pearre A Young Modular Robot’s Guide to Locomotion
55. Modular Robots
Going forward
Learning
Steering
Contributions
Curriculum Development
Conclusion
1: Learning to locomote
Learning to move
10
steer bow
5 steer stern
bow
port fwd
0
θ
stbd fwd
port aft
−5 stbd aft
stern
−10
0 500 1000 1500 2000 2500
s
0.4
effort
10−step forward speed
0.3
0.2
v
0.1
0
−0.1
0 500 1000 1500 2000 2500
s
Ben Pearre A Young Modular Robot’s Guide to Locomotion
56. Modular Robots
Going forward
Learning
Steering
Contributions
Curriculum Development
Conclusion
2: Learning to get to a target
◮ Sensors: Bearing to goal.
◮ Policy: u ∈ R8 controls servos
◮ Policy parameters: θ ∈ R16
µi (x, θ) = θi · s (1)
1
= [ θi,0 θi,1 ] (2)
φ
= N (µi , σ)
ui (3)
1
∇θi log π(x, θ) = (ui − θi · s) · s (4)
σ2
Ben Pearre A Young Modular Robot’s Guide to Locomotion
57. Modular Robots
Going forward
Learning
Steering
Contributions
Curriculum Development
Conclusion
2: Task space → policy space
◮ 16-DOF learning FAIL!
Time to complete task
◮ Try simpler task: 300
◮ Learn to locomote with 250
θ ∈ R16
200
seconds
◮ Try bootstrapping:
150
1. Learn to locomote with 8
DOF 100
2. Add new sensing and 50
0 20 40 60 80 100 120
control DOF task
◮ CHEATING! Why?
Ben Pearre A Young Modular Robot’s Guide to Locomotion
58. Modular Robots
Going forward
Learning
Steering
Contributions
Curriculum Development
Conclusion
Curriculum development for manifold discovery?
◮ ´
Etude in Locomotion
◮ Task-space manifold for locomotion
θ ∈ξ·[ 0 0 1 −1 1 −1 1 1 ]T
◮ Stop exploring in task nullspace
◮ FAST!
◮ ´
Etude in Steering
◮ Can task be completed on locomotion manifold?
◮ One possible approximate solution uses the bases
T
0 0 1 −1 1 −1 1 1
1 −1 0 0 0 0 0 0
◮ Can second basis be learned?
Ben Pearre A Young Modular Robot’s Guide to Locomotion
59. Modular Robots
Going forward
Learning
Steering
Contributions
Curriculum Development
Conclusion
3: How to teach a robot?
How to teach an animal?
1. Reward basic skills
2. Develop control along useful DOFs
3. Make skill more complex
4. A good solution NOW!
Ben Pearre A Young Modular Robot’s Guide to Locomotion
60. Modular Robots
Learning
Contributions
Conclusion
Conclusion
Exorcising the Curse of Dimensionality
◮ PGRL works for low-DOF problems.
◮ Task-space dimension < state-space dimension.
◮ Learn f: task-space manifold → policy-space manifold.
Ben Pearre A Young Modular Robot’s Guide to Locomotion