Lecture21

Introduction to Machine
Learning
Lecture 21
Reinforcement Learning

Albert Orriols i Puig
http://www.albertorriols.net
htt // lb t i l t
aorriols@salle.url.edu

Artificial Intelligence – Machine Learning
g g
Enginyeria i Arquitectura La Salle
Universitat Ramon Llull

Recap of Lectures 5-18
Supervised learning
p g
Data classification
Labeled data
Build a model that
covers all the space

Unsupervised learning
Clustering
Unlabeled data
Group similar objects
G i il bj t

Association rule analysis
Unlabeled data
Get the most frequent/important associations

Genetic Fuzzy Systems
Slide 2
Artificial Intelligence Machine Learning

Today’s Agenda

Introduction
Some examples before going farther

Slide 3

Introduction
What does reinforcement learning aim at?
g
Learning from interaction (with environment)

Goal-directed learning

GOAL
State

Environment
Environment
Action

Agent
agent

Learning what to do and its effect
Trial-and-error search and delayed reward
Slide 4

Introduction

Learn a reactive behaviors
Behaviors as a mapping between perceptions and actions
The
Th agent has to exploit what it already knows in order to
th t l it h t l dk i dt
obtain reward, but it also has to explore in order to make
better action selections in the future.
Dilemma − neither exploitation nor exploration can be
e a e t e e p o tat o o e p o at o ca
pursued exclusively without failing at the task.

Slide 5

How Can We Learn It?
Look-up tables
p Rules
1. 3.

Perception Action
State 1 Action 1
State 2 Action 2
State 3 Action 3
… …

Neural Net orks
Ne ral Networks Finite t
Fi it automata
t
2. 4.

Slide 6


Slide 7

Reward function
Agent
r:S → R
State Action
or
Reward
st at
r:S×A→ R
rt

Environment

Agent and environment interact at discrete time steps t=0,1,2, …

The agent
g
observes state at step t: st ε S
produces action at at step t: at ε A(st)
gets resulting reward: rt+1 ε R
goes to the next step st+1

Slide 8

Agent

State Action
Reward
st at
rt

Environment

Trace of a trial

…r …
at rt+1 at+1 rt+2 at+2 rt+3 at+3
t
st st+1 st+2 st+3

Agent goal:
Maximize the total amount of reward t receives

Therefore, that means maximizing not only the immediate reward,
but cumulative reward in the long run
Slide 9

Example of RL
Example: Recycling robot
State
charge level of battery

Actions
look for cans, wait for can, go recharge

Reward
R d
positive for finding cans, negative for running out of battery

Slide 10

More precisely…
Restricting to Markovian Decision Process (MDP)
g ( )
Finite set of situations
Finite t f ti
Fi it set of actions
Transition probabilities

Reward probabilities

This means that
The agent needs to have complete information of the world
State st+1 only depends on state st and action at
Slide 11

Recycling Robot Example

1 − β , −3 β , R search
wait
1, R
wait search

recharge
1, 0
High
g Low

search wait

α ,R 1 − α ,R
search wait
search
1R
1,

Slide 12

Recycling Robot Example
S = {high, low}
g
A (high) = {wait, search}
A (low ) = {wait, search, recharge}

R search : expected # cans while searching
R wait : expected # cans while waiting
R search > R wait

Slide 13

Breaking the Markovian Property
Possible problems that do not satisfy MDP
p y
When action and states are not finite
Solution: Discretize the set of actions and states
When transition probabilities do not depend only on the current
state
Possible solution: represent states as structures build up
over time from sequences of sensations
q
This is POMDP Partial observable MDP
Use POMDP algorithms to solve these problems
g

Slide 14

Elements of Reinforcement Learning

Slide 15

Elements of RL

Policy: what to do
Reward: what’s good
Value: What’s good because it p ed cts reward
a ue at s t predicts e a d
Model: What follows what

Slide 16

Components of an RL Agent
Policy (behavior)
Mapping from states to actions
π*: S A
Reward
Local reward in state t:
rt
Model
Probability of transition from state s to s’ by executing action a
s
T(s,a,s’)
And
The transitions probabilities depend only on these parameters
This is not known by the agent
Slide 17

Value functions
Vπ(s): Long-term reward estimation from state s following policy
π
Qπ(s,a): Long-term reward estimation from state s executing
ac o
action a and then following po cy π
ad e oo g policy
A simple example
A maze

Note t at t e age t does not know its o
ote that the agent ot o ts own pos t o It ca o y
position. t can only
perceive what it has in the surrounding states
Slide 18

Value functions
Vπ(s): Long-term reward estimation from state s following policy
π
Qπ(s,a): Long-term reward estimation from state s executing
ac o
action a and then following po cy π
ad e oo g policy
A simple example
A maze

Note t at t e age t does not know its o
ote that the agent ot o ts own pos t o It ca o y
position. t can only
perceive what it has in the surrounding states
Slide 19

Pursuing the goal: Maximize long term reward

Slide 20

Goals and Rewards
Ok, but I need to maximize my long term reward. How I
, y g
get the long term reward?
Long term reward defined in terms of the goal of the agent
The agent receives the local reward at each time step

How?
Intuitive idea: Sum all the rewards obtained so far

Problem: It can increase heavily in non-ending tasks

Slide 21

Goals and Rewards
How can we deal with non-ending tasks?
g
Weighted addition of local rewards

The γ parameter (0 < γ < 1) is the discounting factor
e pa a ete ) s t e d scou t g acto

…r …
at rt+1 at+1 rt+2 at+2 rt+3 at+3
t
st st+1 st+2 st+3

Note t e b as for immediate rewards
ote the bias o ed ate e a ds
If you want to avoid it, set γ close to 1
Slide 22

Some examples

Slide 23

Pole balancing
Balance the pole
p
The car can move forward
a d backward
and bac a d
Avoid failure:
the pole falling beyond
a certain critical angle
the car hitting the end of the track
g

Reward
-1 upon failure
-ak, for k steps before failure
a

Slide 24

Mountain Car Problem
Objective
j
Get to the top of the hill as
qu c y
quickly as poss b e
possible

State d fi iti
St t definition:
Car position and speed

Actions
Forward, reverse, none

Reward
-1 for each step that are not the on the top of the hill
-number of steps before reaching the top of the hill
Slide 25

Next Class

How t l
H to learn th policies
the li i

Slide 26

Lecture21

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Plus de Albert Orriols-Puig

Plus de Albert Orriols-Puig (12)

Dernier

Dernier (20)

Lecture21