Presentazione Tesi Laurea Triennale in Informatica

Apprendimento per Rinforzo e
Applicazione ai Problemi di
Pianificazione del Percorso
Relatore: Cristina Baroglio Candidato: Luca Marignati
12/07/2019
Tesi di Laurea in Informatica
Torino

Context RL problem TD method
Q-Learning
and Sarsa
Software Tests Conclusions
Future
developments
Outline 2

Machine
Learning
• Supervised Learning
• Non-Supervised Learning
• Reinforcement Learning
ParadigmsContext
3

• π S → A∶
• Find optimal policy π*
Policy
• R (S, A) → reward∶
Reward function
Value function
• Optional
• Model-free approach
Model
• π S → A∶
• Find optimal policy π*
Policy
• R (S, A) → reward∶
Reward function
Value function
• Optional
• Model-free approach
Model
Other
elements
5

Methods guided
by two time
instants
instant t and instant t + 1
Model-free
Learn directly from
experience
Bootstrapping
Step-by-step incremental
approach
Off-policy/On-
policy method
Q-Learning/Sarsa
Temporal
difference
method
6

Based on Q(s,a)
• Similar to  But are focused on state-action pair
• Value of a state's utility  Quality value
• Describes the gain or loss by performing the action a in the
state s
• Total long term reward (environment knowledge)
• Bellman equation
•
8

Initialize Q(s,a) arbitrarily
Repeat (For each episode)
Repeat (for each step of episode)
Choose a from St using policy derived from Q
(e.g.
St = St+1
Initialize St
Take action at, observe R, St+1
9
Q-Learning:
off-policy
Update Q-Value

Initialize Q(s,a) arbitrarily
Initialize St
Repeat (For each episode)
Repeat (for each step of episode)
Choose at+1 from St+1 using policy derived
from Q (e.g.
Take action at, observe R, St+1
Choose at from St using policy derived from Q
(e.g.
St = St+1; at = at+1
Sarsa:
on-policy
Update Q-Value
Similar structure
Change the update rule

Different approaches for value update rule
*(1) off-policy feature
• Action at  current policy (e.g. -greedy policy)
• Action at+1  greedy policy starting from the state st+1
*(2)  on-policy feature
• Action at  current policy (e.g. -greedy policy)
• Action at+1  current policy (e.g. -greedy policy)

Practical
problem
Path planning
12

Tools
• Languages
• JavaScript/JQuery
• HTML5 (CANVAS e API)
• Libraries
• Bootstrap  Responsive layout
• Chart.js  Algorithm performance
• FontAwesome  Icon management
13

Problem’s description
1. Single-agent system
2. Variants of environment (Grid 12x4/10x10)
3. Finite states and actions
• Finite states (48/100)
• Limited number of actions  {up, down, right, left}
4. Target  reach the goal state
5. Episodic task
6. Reward function
• −1  non-terminal States (Neutral States);
• −100  States of defeat (The Cliff)
• +100  Goal State

DEMO
WEB  https://www.reinforcementlearning.it
LOCAL  http://rl/
15

CONFIGURATION
1) Set parameter
2) Choose algorithm
3) Number of victories
4) Number of defeats
Section1

Section2
VISUALIZATION OF THE ENVIRONMENT
BUTTONS
1) Start/Stop/Accelerate learn
2) Set a Goal State
3) Set a Defeat State
4) Modal for choose positions

Section3
INFORMATION OF RESULTS
1) Average reward
2) Average moves
PERFORMANCE OF ALGORITHM
1) Chart.js
2) Verification of lerning
3) Convergence of optimal path
Q-VALUES FOR STATES

Environment configuration
19
Key Value
goalstate x: 690, y: 210
deathstate_1 x: 150, y: 270
… …
startstate x: 30, y: 210
Object representation: Key-Value Structure
Terminal State

Implementative choices (1)
• Tabular description
 Limited state space
 Cells: Q(s,a)-values (initialized to 0  no knowledge)
 Key-Value Structure
Pos. Up Down Right Left
3030 - 0 0 -
3090 0 0 0 -
9030 - 0 0 0
9090 0 0 0 0
15030 - 0 0 0
15090 0 0 0 0
… … … … …
… … … … …
… … … … …
630150 0 0 0 0
630210 0 - 0 0
690150 0 0 - 0
690210 0 - - 0

Implementative choices (2)
• ε−greedy policy (e.g. ε = 0,1)
Exploration and Exploitation Compromise
How are actions chosen?

Tests
• #Test1: Grid 12x4
• #Test2: Grid 10x10 – simple environment
• #Test3: Grid 10x10 – complex environment
• #Test4: Grid 10x10 – dynamic environment

Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Common features
Step 1: Input environments
Different grid environmental
configurations (12x4 - 10x10)
Different degrees of difficulty

Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Step 2: Algorithm’s choice
Common
features

Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Step 3: Trial-and-error learning
Common features

Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Step 4: Convergent to the optimal path
Common features

1. Can an agent learn without having examples of correct behavior? 
Difference with Supervised Learning
2. Study of methods for Reinforcement Learning and understanding of the
basic principles that characterize them (notions of agent, environment,
MDP, ...)
3. Focused on the study of TD methods (Sarsa and Q-Learning)
4. Analysis of a practical problem: Path Planning
5. Software JavaScript  the agent is able to adapt to any type of
environment provided as input in order to achieve the set objective
6. Different nature of the Sarsa and Q-Learning algorithms
Conclusions

Sarsa Q-Learning
Safe path Speed path
Prudent policy Risky attitude
Not suitable for complex
environments
Suitable for any type of
environment
Optimize the agent's
performance
Train agents in simulated
environments
Expensive mistakes  keep
the risk away
Errors do not involve large
losses
Model-free  expensive adaptation changes (TD property)
Conclusions

• Real problem RL
• Partial Observable Markov Decision Problems (POMDP)
• Model-based algorithm
• Better learning policy (e.g. Soft-max)
• Change Q-table with Artificial Neural Networks
(e.g. chess game  states space = 10120)
• Continuous tasks (not episodic)
• Multi-agent system (opponent agent)
Future developments

Domande?
Relatore: Cristina Baroglio Candidato: Luca Marignati
12/07/2019
Grazie per l’attenzione!
Torino

Presentazione Tesi Laurea Triennale in Informatica

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Presentazione Tesi Laurea Triennale in Informatica

Similaire à Presentazione Tesi Laurea Triennale in Informatica (20)

Plus de Luca Marignati

Plus de Luca Marignati (6)

Dernier

Dernier (20)

Presentazione Tesi Laurea Triennale in Informatica

Notes de l'éditeur