Lucio marcenaro tue summer_school

An introduction to cognitive robotics
EMJD ICE Summer School - 2013
Lucio Marcenaro – University of
Genova (ITALY)

Cognitive robotics?
• Robots with intelligent behavior
– Learn and reason
– Complex goals
– Complex world
• Robots ideal vehicles for developing and
testing cognitive:
– Learning
– Adaptation
– Classification

Cognitive robotics
• Traditional behavior modeling approaches
problematic and untenable.
• Perception, action and the notion of symbolic
representation to be addressed in cognitive
robotics.
• Cognitive robotics views animal cognition as a
starting point for the development of robotic
information processing.

Cognitive robotics
• “Immobile” Robots and Engineering
Operations
– Robust space probes, ubiquitous computing
• Robots That Navigate
– Hallway robots, Field robots, Underwater
explorers, stunt air vehicles
• Cooperating Robots
– Cooperative Space/Air/Land/Underwater vehicles,
distributed traffic networks, smart dust.

Outline
• Lego Mindstorms
• Simple Line Follower
• Advanced Line Follower
• Learning to follow the line
• Conclusions

The NXT Unit – an embedded system
• 64K RAM, 256K Flash
• 32-bit ARM7 microcontroller
• 100 x 64 pixel LCD graphical
display
• Sound channel with 8-bit
resolution
• Bluetooth wireless
communications
• Stores multiple programs
– Programs selectable using buttons

The NXT unit
(Motor ports)
(Sensor ports)

• Built-in rotation sensors
NXT Motors

NXT Rotation Sensor
• Built in to motors
• Measure degrees
or rotations
• Reads + and -
• Degrees: accuracy
+/- 1
• 1 rotation =
360 degrees

Viewing Sensors
• Connect sensor
• Turn on NXT
• Choose “View”
• Select sensor type
• Select port

NXT Sound Sensor
• Sound sensor can measure in dB and dBA
– dB: in detecting standard [unadjusted]
decibels, all sounds are measured with
equal sensitivity. Thus, these sounds may
include some that are too high or too low
for the human ear to hear.
– dBA: in detecting adjusted decibels, the
sensitivity of the sensor is adapted to the
sensitivity of the human ear. In other words,
these are the sounds that your ears are able
to hear.
• Sound Sensor readings on the NXT are
displayed in percent [%]. The lower the percent
the quieter the sound.
http://mindstorms.lego.com/Overview/Sound_Sensor.aspx

NXT Ultrasonic/Distance Sensor
• Measures
distance/proximity
• Range: 0-255 cm
• Precision: +/- 3cm
• Can report in
centimeters or
inches
http://mindstorms.lego.com/Overview/Ultrasonic_Sensor.aspx

17
NXT Non-standard sensors:
HiTechnic.com
• Compass
• Gyroscope
• Accellerometer/tilt sensor,
• Color sensor
• IRSeeker
• Prototype board with A/D converter
for the I2C bus

LEGO Mindstorms for NXT
(NXT-G)
NXT-G graphical programming
language
Based on the LabVIEW programming language G
Program by drawing a flow chart

NXT-G PC program interface
Toolbar
Workspace
Configuration
Panel
Help & Navigation
Controller
Palettes
Tutorials Web
Portal
Sequence Beam

Issues of the standard firmware
• Only one data type
• Unreliable bluetooth communication
• Limited multi-tasking
• Complex motor control
• Simplistic memory management
• Not suitable for large programs
• Not suitable for development of own tools or
blocks

Other programming languages and
environments
– Java leJOS
– Microsoft Robotics Studio
– RobotC
– NXC - Not eXactly C
– NXT Logo
– Lego NXT Open source firmware and software
development kit

leJOS
• A Java Virtual Machine for NXT
• Freely available
– http://lejos.sourceforge.net/
• Replaces the NXT-G firmware
• LeJOS plug-in is available for the Eclipse free
development environment
• Faster than NXT-G

Example leJOS Program
sonar = new UltrasonicSensor(SensorPort.S4);
Motor.A.forward();
Motor.B.forward();
while (true) {
if (sonar.getDistance() < 25) {
Motor.A.forward();
Motor.B.backward();
} else {
Motor.A.forward();
Motor.B.forward();
}
}

Event-driven Control in leJOS
• The Behavior interface
– boolean takeControl()
– void action()
– void suppress()
• Arbitrator class
– Constructor gets an array of Behavior objects
• takeControl() checked for highest index first
– start() method begins event loop

Event-driven example
class Go implements Behavior {
private Ultrasonic sonar =
new Ultrasonic(SensorPort.S4);
public boolean takeControl() {
return sonar.getDistance() > 25;
}

public void action() {
Motor.A.forward();
Motor.B.forward();
}
public void suppress() {
Motor.A.stop();
Motor.B.stop();
}
}

class Spin implements Behavior {
private Ultrasonic sonar =
new Ultrasonic(SensorPort.S4);
public boolean takeControl() {
return sonar.getDistance() <= 25;
}

public void action() {
Motor.A.forward();
Motor.B.backward();
}
public void suppress() {
Motor.A.stop();
Motor.B.stop();
}
}

public class FindFreespace {
public static void main(String[] a) {
Behavior[] b = new Behavior[]
{new Go(), new Spin()};
Arbitrator arb =
new Arbitrator(b);
arb.start();
}
}

Simple Line Follower
• Use light-sensor as a switch
• If measured value > threshold: ON state (white
surface)
• If measured value < threshold: OFF state
(black surface)

• Robot not traveling inside the line but along
the edge
• Turning left until an “OFF” to “ON” transition
is detected
• Turning right until an “ON” to “OFF” transition
is detected

NXTMotor rightM = new NXTMotor(MotorPort.A);
NXTMotor leftM = new NXTMotor(MotorPort.C);
ColorSensor cs = new ColorSensor(SensorPort.S2, Color.RED);
while (!Button.ESCAPE.isDown())
{
int currentColor = cs.getLightValue();
LCD.drawInt(currentColor, 5, 11, 3);
if (currentColor < 30)
{
rightM.setPower(50);
leftM.setPower(10);
}
else
{
rightM.setPower(10);
leftM.setPower(50);
}
}

Advanced Line Follower
• Use light-sensor as an
Analog sensor
• Sensor ranges btween 0
– 100
• Takes the average light
detected over a small
area

• Subtract the current reading of the sensor
from what the sensor should be reading
– Use this value to directly control direction and
power of the wheels
• Multiply this value for a constant: how
strongly the wheels should turn to correct its
path?
• Add a value to be sure that the robot is always
moving forward

NXTMotor rightM = new NXTMotor(MotorPort.A);
NXTMotor leftM = new NXTMotor(MotorPort.C);
int targetValue = 30;
int amplify = 7;
int targetPower = 50;
ColorSensor cs = new ColorSensor(SensorPort.S2, Color.RED);
rightM.setPower(targetPower);
leftM.setPower(targetPower);
while (!Button.ESCAPE.isDown())
{
int currentColor = cs.getLightValue();
int difference = currentColor - targetValue;
int ampDiff = difference * amplify;
int rightPower = ampDiff + targetPower;
int leftPower = targetPower;
rightM.setPower(rightPower);
leftM.setPower(leftPower);
}

• DEMO

Learn how to follow
• Goal
– Make robots do what we want
– Minimize/eliminate programming
• Proposed Solution: Reinforcement Learning
– Specify desired behavior using rewards
– Express rewards in terms of sensor states
– Use machine learning to induce desired actions
• Target Platform
– Lego Mindstorms NXT

Example: Grid World
• A maze-like problem
– The agent lives in a grid
– Walls block the agent’s path
• Noisy movement: actions do not
always go as planned:
– 80% of the time, preferred action is
taken
(if there is no wall there)
– 10% of the time, North takes the agent
West; 10% East
– If there is a wall in the direction the
agent would have been taken, the agent
stays put
• The agent receives rewards each time
step
– Small “living” reward each step (can be
negative)
– Big rewards come at the end (good or
bad)
• Goal: maximize sum of rewards

Markov Decision Processes
• An MDP is defined by:
– A set of states s  S
– A set of actions a  A
– A transition function T(s,a,s’)
• Prob that a from s leads to s’
• i.e., P(s’ | s,a)
• Also called the model (or
dynamics)
– A reward function R(s, a, s’)
• Sometimes just R(s) or R(s’)
– A start state
– Maybe a terminal state
• MDPs are non-deterministic
search problems
– Reinforcement learning: MDPs
where we don’t know the
transition or reward functions

What is Markov about MDPs?
• “Markov” generally means that given the
present state, the future and the past are
independent
• For Markov decision processes, “Markov”
means:
Andrej Andreevič Markov
(1856-1922)

Solving MDPs: policies
• In deterministic single-agent search problems, want an
optimal plan, or sequence of actions, from start to a goal
• In an MDP, we want an optimal policy *: S → A
– A policy  gives an action for each state
– An optimal policy maximizes expected utility if followed
– An explicit policy defines a reflex agent
Optimal policy when
R(s, a, s’) = -0.03 for all
non-terminals s

Example Optimal Policies
R(s) = -2.0R(s) = -0.4
R(s) = -0.03R(s) = -0.01

MDP Search Trees
• Each MDP state gives an expectimax-like search tree
a
s
s’
s, a
(s,a,s’) called a transition
T(s,a,s’) = P(s’|s,a)
R(s,a,s’)
s,a,s’
s is a state
(s, a) is a
q-state

Utilities of Sequences
• In order to formalize
optimality of a policy,
need to understand
utilities of sequences of
rewards
• What preferences should
an agent have over
reward sequences?
• More or less?
– [1,2,2] or [2,3,4]
• Now or later?
– [1,0,0] or [0,0,1]

Discounting
• It’s reasonable to maximize the sum of
rewards
• It’s also reasonable to prefer rewards now to
rewards later
• One solution:values of rewards decay
exponentially

Discounting
• Typically discount rewards
by  < 1 each time step
– Sooner rewards have higher
utility than later rewards
– Also helps the algorithms
converge
• Example: discount of 0.5:
– U([1,2,3])=1*1+0.5*2+0.25*3
– U([1,2,3])<U([3,2,1])

Stationary Preferences
• Theorem if we assume stationary preferences:
• Then: there are only two ways to define utilities
– Additive utility:
– Discounted utility:

Quiz: Discounting
• Given:
– Actions: East, West and Exit (available in exit states a, e)
– Transitions: deterministic
• Quiz 1: For =1, what is the optimal policy?
• Quiz 2: For =0.1, what is the optimal policy?
• Quiz 3: For which  are East and West equally good
when in state d?
10 1
a b c d e
10 1
10 1

Infinite Utilities?!
• Problem: infinite state sequences have infinite rewards
• Solutions:
– Finite horizon:
• Terminate episodes after a fixed T steps (e.g. life)
• Gives nonstationary policies ( depends on time left)
– Discounting: for 0 <  < 1
• Smaller  means smaller “horizon” – shorter term focus
• Absorbing state: guarantee that for every policy, a terminal
state will eventually be reached

Recap: Defining MDPs
• Markov decision processes:
– States S
– Start state s0
– Actions A
– Transitions P(s’|s,a) (or T(s,a,s’))
– Rewards R(s,a,s’) (and discount )
• MDP quantities so far:
– Policy = Choice of action for each state
– Utility (or return) = sum of discounted rewards
a
s
s, a
s,a,s’
s’

Optimal Quantities
• Why? Optimal values define
optimal policies!
• Define the value (utility) of a
state s:
V*(s) = expected utility starting in s
and acting optimally
• Define the value (utility) of a
q-state (s,a):
Q*(s,a) = expected utility starting in
s, taking action a and thereafter
acting optimally
• Define the optimal policy:
*(s) = optimal action from state s
a
s
s, a
s,a,s’
s’

Gridworld V*(s)
• Optimal value function V*(s)

Gridworld Q*(s,a)
• Optimal Q function Q*(s,a)

Values of States
• Fundamental operation: compute the value of
a state
– Expected utility under optimal action
– Average sum of (discounted) rewards
• Recursive definition of value
a
s
s, a
s,a,s’
s’

Why Not Search Trees?
• We’re doing way too much work with
search trees
• Problem: States are repeated
– Idea: Only compute needed quantities once
• Problem: Tree goes on forever
– Idea: Do a depth-limited computations, but
with increasing depths until change is small
– Note: deep parts of the tree eventually don’t
matter if  < 1

Time-limited Values
• Key idea: time-limited values
• Define Vk(s) to be the optimal value of s if the
game ends in k more time steps
– Equivalently, it’s what a depth-k search tree would
give from s

Value Iteration
• Problems with the recursive computation:
– Have to keep all the Vk
*(s) around all the time
– Don’t know which depth k(s) to ask for when planning
• Solution: value iteration
– Calculate values for all states, bottom-up
– Keep increasing k until convergence

Value Iteration
• Idea:
– Start with V0
*(s) = 0, which we know is right (why?)
– Given Vi
*, calculate the values for all states for depth i+1:
– This is called a value update or Bellman update
– Repeat until convergence
• Complexity of each iteration: O(S2A)
• Theorem: will converge to unique optimal values
– Basic idea: approximations get refined towards optimal values
– Policy may converge long before values do

Practice: Computing Actions
• Which action should we chose from state s:
– Given optimal values V?
– Given optimal q-values Q?
– Lesson: actions are easier to select from Q’s!

Utilities for Fixed Policies
• Another basic operation: compute the
utility of a state s under a fixed (general
non-optimal) policy
• Define the utility of a state s, under a
fixed policy :
V(s) = expected total discounted rewards
(return) starting in s and following 
• Recursive relation (one-step look-ahead
/ Bellman equation):
(s)
s
s, (s)
s, (s),s’
s’

Policy Evaluation
• How do we calculate the V’s for a fixed policy?
• Idea one: modify Bellman updates
• Efficiency: O(S2) per iteration
• Idea two: without the maxes it’s just a linear system,
solve with Matlab (or whatever)

Policy Iteration
• Problem with value iteration:
– Considering all actions each iteration is slow: takes |A| times longer than
policy evaluation
– But policy doesn’t change each iteration, time wasted
• Alternative to value iteration:
– Step 1: Policy evaluation: calculate utilities for a fixed policy (not optimal
utilities!) until convergence (fast)
– Step 2: Policy improvement: update policy using one-step look-ahead with
resulting converged (but not optimal!) utilities (slow but infrequent)
– Repeat steps until policy converges
• This is policy iteration
– It’s still optimal!
– Can converge faster under some conditions

Policy Iteration
• Policy evaluation: with fixed current policy , find values with
simplified Bellman updates:
– Iterate until values converge
• Policy improvement: with fixed utilities, find the best action
according to one-step look-ahead

Comparison
• In value iteration:
– Every pass (or “backup”) updates both utilities (explicitly, based on
current utilities) and policy (possibly implicitly, based on current
policy)
• In policy iteration:
– Several passes to update utilities with frozen policy
– Occasional passes to update policies
• Hybrid approaches (asynchronous policy iteration):
– Any sequences of partial updates to either policy entries or utilities
will converge if every state is visited infinitely often

Reinforcement Learning
• Basic idea:
– Receive feedback in the form of rewards
– Agent’s utility is defined by the reward function
– Must learn to act so as to maximize expected rewards
– All learning is based on observed samples of outcomes

Reinforcement Learning
• Reinforcement learning:
– Still assume an MDP:
• A set of states s  S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
– Still looking for a policy (s)
– New twist: don’t know T or R
• I.e. don’t know which states are good or what the actions do
• Must actually try actions and states out to learn

Model-Based Learning
• Model-Based Idea:
– Learn the model empirically through experience
– Solve for values as if the learned model were correct
• Step 1: Learn empirical MDP model
– Count outcomes for each s,a
– Normalize to give estimate of T(s,a,s’)
– Discover R(s,a,s’) when we experience (s,a,s’)
• Step 2: Solve the learned MDP
– Iterative policy evaluation, for example
(s)
s
s, (s)
s, (s),s’
s’

Example: Model-Based Learning
• Episodes:
x
y
T(<3,3>, right, <4,3>) = 1 / 3
T(<2,3>, right, <3,3>) = 2 / 2
+100
-100
 = 1
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)

Model-Free Learning
• Want to compute an expectation weighted by P(x):
• Model-based: estimate P(x) from samples, compute expectation
• Model-free: estimate expectation directly from samples
• Why does this work? Because samples appear with the right frequencies!

Example: Direct Estimation
• Episodes:
x
y
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)
V(2,3) ~ (96 + -103) / 2 = -3.5
V(3,3) ~ (99 + 97 + -102) / 3 = 31.3
 = 1, R = -1
+100
-100

Sample-Based Policy Evaluation?
• Who needs T and R? Approximate the
expectation with samples (drawn from T!) (s)
s
s, (s)
s1’s2’ s3’
s, (s),s’
s’
Almost! But we only
actually make progress
when we move to i+1.

Temporal-Difference Learning
• Big idea: learn from every experience!
– Update V(s) each time we experience (s,a,s’,r)
– Likely s’ will contribute updates more often
• Temporal difference learning
– Policy still fixed!
– Move values toward value of whatever successor
occurs: running average!
(s)
s
s, (s)
s’
Sample of V(s):
Update to V(s):
Same update:

Exponential Moving Average
• Exponential moving average
– Makes recent samples more important
– Forgets about the past (distant past values were wrong anyway)
– Easy to compute from the running average
• Decreasing learning rate can give converging averages

Example: TD Policy Evaluation
Take  = 1,  = 0.5
(1,1) up -1
(1,2) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(3,3) right -1
(4,3) exit +100
(done)
(1,1) up -1
(1,2) up -1
(1,3) right -1
(2,3) right -1
(3,3) right -1
(3,2) up -1
(4,2) exit -100
(done)

Problems with TD Value Learning
• TD value leaning is a model-free way to do
policy evaluation
• However, if we want to turn values into a
(new) policy, we’re sunk:
• Idea: learn Q-values directly
• Makes action selection model-free too!
a
s
s, a
s,a,s’
s’

Active Learning
• Full reinforcement learning
– You don’t know the transitions T(s,a,s’)
– You don’t know the rewards R(s,a,s’)
– You can choose any actions you like
– Goal: learn the optimal policy
– … what value iteration did!
• In this case:
– Learner makes choices!
– Fundamental tradeoff: exploration vs. exploitation
– This is NOT offline planning! You actually take actions in the world and
find out what happens…

Detour: Q-Value Iteration
• Value iteration: find successive approx optimal values
– Start with V0
*(s) = 0, which we know is right (why?)
– Given Vi
*, calculate the values for all states for depth i+1:
• But Q-values are more useful!
– Start with Q0
*(s,a) = 0, which we know is right (why?)
– Given Qi
*, calculate the q-values for all q-states for depth i+1:

Q-Learning
• Q-Learning: sample-based Q-value iteration
• Learn Q*(s,a) values
– Receive a sample (s,a,s’,r)
– Consider your old estimate:
– Consider your new sample estimate:
– Incorporate the new estimate into a running average:

Q-Learning Properties
• Amazing result: Q-learning converges to optimal policy
– If you explore enough
– If you make the learning rate small enough
– … but not decrease it too quickly!
– Basically doesn’t matter how you select actions (!)
• Neat property: off-policy learning
– learn optimal policy without following it (some caveats)

Q-Learning
• Discrete sets of states and actions
– States form an N-dimensional array
• Unfolded into one dimension in practice
– Individual actions selected on each time step
• Q-values
– 2D array (indexed by state and action)
– Expected rewards for performing actions

Q-Learning
• Table of expected rewards (“Q-values”)
– Indexed by state and action
• Algorithm steps
– Calculate state index from sensor values
– Calculate the reward
– Update previous Q-value
– Select and perform an action
• Q(s,a) = (1 - α) Q(s,a) + α (r + γ max(Q(s',a)))

• Certain sensors provide continuous values
• Sonar
• Motor encoders
• Q-Learning requires discrete inputs
• Group continuous values into discrete “buckets”
• [Mahadevan and Connell, 1992]
• Q-Learning produces discrete actions
• Forward
• Back-left/Back-right
Q-Learning and Robots

Creating Discrete Inputs
• Basic approach
– Discretize continuous values into sets
– Combine each discretized tuple into a single index
• Another approach
– Self-Organizing Map
– Induces a discretization of continuous values
– [Touzet 1997] [Smith 2002]

Q-Learning Main Loop
• Select action
• Change motor speeds
• Inspect sensor values
– Calculate updated state
– Calculate reward
• Update Q values
• Set “old state” to be the updated state

Calculating the State (Motors)
• For each motor:
– 100% power
– 93.75% power
– 87.5% power
• Six motor states

Calculating the State (Sensors)
• No disparity: STRAIGHT
• Left/Right disparity
– 1-5: LEFT_1, RIGHT_1
– 6-12: LEFT_2, RIGHT_2
– 13+: LEFT_3, RIGHT_3
• Seven total sensor states
• 63 states overall

Calculating Reward
• No disparity => highest value
• Reward decreases with increasing disparity

Action Set for Line Follow
• MAINTAIN
– Both motors unchanged
• UP_LEFT, UP_RIGHT
– Accelerate motor by one motor state
• DOWN_LEFT, DOWN_RIGHT
– Decelerate motor by one motor state
• Five total actions

Conclusions
• Lego Mindstorms NXT as a conveniente
platform for «cognitive robotics»
• Executing a task with «rules»
• Learning hot to execute a task
– MDP
– Reinforcement learning
• Q-learning applied to Lego Mindstorms

Lucio marcenaro tue summer_school

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (14)

Similaire à Lucio marcenaro tue summer_school

Similaire à Lucio marcenaro tue summer_school (20)

Plus de Jun Hu

Plus de Jun Hu (20)

Dernier

Dernier (20)

Lucio marcenaro tue summer_school