Università degli Studi di Torino
Dipartimento di Informatica
Titolo: Apprendimento per Rinforzo e Applicazione ai Problemi di Pianificazione del Percorso
Topic: Machine Learning
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Presentazione Tesi Laurea Triennale in Informatica
1. Apprendimento per Rinforzo e
Applicazione ai Problemi di
Pianificazione del Percorso
Relatore: Cristina Baroglio Candidato: Luca Marignati
12/07/2019
Tesi di Laurea in Informatica
Torino
2. Context RL problem TD method
Q-Learning
and Sarsa
Software Tests Conclusions
Future
developments
Outline 2
5. • π S → A∶
• Find optimal policy π*
Policy
• R (S, A) → reward∶
Reward function
Value function
• Optional
• Model-free approach
Model
• π S → A∶
• Find optimal policy π*
Policy
• R (S, A) → reward∶
Reward function
Value function
• Optional
• Model-free approach
Model
Other
elements
5
6. Methods guided
by two time
instants
instant t and instant t + 1
Model-free
Learn directly from
experience
Bootstrapping
Step-by-step incremental
approach
Off-policy/On-
policy method
Q-Learning/Sarsa
Temporal
difference
method
6
8. Based on Q(s,a)
• Similar to But are focused on state-action pair
• Value of a state's utility Quality value
• Describes the gain or loss by performing the action a in the
state s
• Total long term reward (environment knowledge)
• Bellman equation
•
8
9. Initialize Q(s,a) arbitrarily
Repeat (For each episode)
Repeat (for each step of episode)
Choose a from St using policy derived from Q
(e.g.
St = St+1
Initialize St
Take action at, observe R, St+1
9
Q-Learning:
off-policy
Update Q-Value
10. Initialize Q(s,a) arbitrarily
Initialize St
Repeat (For each episode)
Repeat (for each step of episode)
Choose at+1 from St+1 using policy derived
from Q (e.g.
Take action at, observe R, St+1
Choose at from St using policy derived from Q
(e.g.
St = St+1; at = at+1
Sarsa:
on-policy
Update Q-Value
Similar structure
Change the update rule
11. Different approaches for value update rule
*(1) off-policy feature
• Action at current policy (e.g. -greedy policy)
• Action at+1 greedy policy starting from the state st+1
*(2) on-policy feature
• Action at current policy (e.g. -greedy policy)
• Action at+1 current policy (e.g. -greedy policy)
14. Problem’s description
1. Single-agent system
2. Variants of environment (Grid 12x4/10x10)
3. Finite states and actions
• Finite states (48/100)
• Limited number of actions {up, down, right, left}
4. Target reach the goal state
5. Episodic task
6. Reward function
• −1 non-terminal States (Neutral States);
• −100 States of defeat (The Cliff)
• +100 Goal State
17. Section2
VISUALIZATION OF THE ENVIRONMENT
BUTTONS
1) Start/Stop/Accelerate learn
2) Set a Goal State
3) Set a Defeat State
4) Modal for choose positions
18. Section3
INFORMATION OF RESULTS
1) Average reward
2) Average moves
PERFORMANCE OF ALGORITHM
1) Chart.js
2) Verification of lerning
3) Convergence of optimal path
Q-VALUES FOR STATES
27. 1. Can an agent learn without having examples of correct behavior?
Difference with Supervised Learning
2. Study of methods for Reinforcement Learning and understanding of the
basic principles that characterize them (notions of agent, environment,
MDP, ...)
3. Focused on the study of TD methods (Sarsa and Q-Learning)
4. Analysis of a practical problem: Path Planning
5. Software JavaScript the agent is able to adapt to any type of
environment provided as input in order to achieve the set objective
6. Different nature of the Sarsa and Q-Learning algorithms
Conclusions
28. Sarsa Q-Learning
Safe path Speed path
Prudent policy Risky attitude
Not suitable for complex
environments
Suitable for any type of
environment
Optimize the agent's
performance
Train agents in simulated
environments
Expensive mistakes keep
the risk away
Errors do not involve large
losses
Model-free expensive adaptation changes (TD property)
Conclusions
29. • Real problem RL
• Partial Observable Markov Decision Problems (POMDP)
• Model-based algorithm
• Better learning policy (e.g. Soft-max)
• Change Q-table with Artificial Neural Networks
(e.g. chess game states space = 10120)
• Continuous tasks (not episodic)
• Multi-agent system (opponent agent)
Future developments