SlideShare une entreprise Scribd logo
1  sur  30
Apprendimento per Rinforzo e
Applicazione ai Problemi di
Pianificazione del Percorso
Relatore: Cristina Baroglio Candidato: Luca Marignati
12/07/2019
Tesi di Laurea in Informatica
Torino
Context RL problem TD method
Q-Learning
and Sarsa
Software Tests Conclusions
Future
developments
Outline 2
Machine
Learning
• Supervised Learning
• Non-Supervised Learning
• Reinforcement Learning
ParadigmsContext
3
Agent Environment
Actors
4
• π S → A∶
• Find optimal policy π*
Policy
• R (S, A) → reward∶
Reward function
Value function
• Optional
• Model-free approach
Model
• π S → A∶
• Find optimal policy π*
Policy
• R (S, A) → reward∶
Reward function
Value function
• Optional
• Model-free approach
Model
Other
elements
5
Methods guided
by two time
instants
instant t and instant t + 1
Model-free
Learn directly from
experience
Bootstrapping
Step-by-step incremental
approach
Off-policy/On-
policy method
Q-Learning/Sarsa
Temporal
difference
method
6
Q-Learning Sarsa
Algorithms
7
Based on Q(s,a)
• Similar to  But are focused on state-action pair
• Value of a state's utility  Quality value
• Describes the gain or loss by performing the action a in the
state s
• Total long term reward (environment knowledge)
• Bellman equation
•  
8
Initialize Q(s,a) arbitrarily
Repeat (For each episode)
Repeat (for each step of episode)
Choose a from St using policy derived from Q
(e.g.
St = St+1
Initialize St
Take action at, observe R, St+1
9
Q-Learning:
off-policy
Update Q-Value
Initialize Q(s,a) arbitrarily
Initialize St
Repeat (For each episode)
Repeat (for each step of episode)
Choose at+1 from St+1 using policy derived
from Q (e.g.
Take action at, observe R, St+1
Choose at from St using policy derived from Q
(e.g.
St = St+1; at = at+1
Sarsa:
on-policy
Update Q-Value
Similar structure
Change the update rule
Different approaches for value update rule
*(1) off-policy feature
• Action at  current policy (e.g. -greedy policy)
• Action at+1  greedy policy starting from the state st+1
*(2)  on-policy feature
• Action at  current policy (e.g. -greedy policy)
• Action at+1  current policy (e.g. -greedy policy)
Practical
problem
Path planning
12
Tools
• Languages
• JavaScript/JQuery
• HTML5 (CANVAS e API)
• Libraries
• Bootstrap  Responsive layout
• Chart.js  Algorithm performance
• FontAwesome  Icon management
13
Problem’s description
1. Single-agent system
2. Variants of environment (Grid 12x4/10x10)
3. Finite states and actions
• Finite states (48/100)
• Limited number of actions  {up, down, right, left}
4. Target  reach the goal state
5. Episodic task
6. Reward function
• −1  non-terminal States (Neutral States);
• −100  States of defeat (The Cliff)
• +100  Goal State
DEMO
WEB  https://www.reinforcementlearning.it
LOCAL  http://rl/
15
CONFIGURATION
1) Set parameter
2) Choose algorithm
3) Number of victories
4) Number of defeats
Section1
Section2
VISUALIZATION OF THE ENVIRONMENT
BUTTONS
1) Start/Stop/Accelerate learn
2) Set a Goal State
3) Set a Defeat State
4) Modal for choose positions
Section3
INFORMATION OF RESULTS
1) Average reward
2) Average moves
PERFORMANCE OF ALGORITHM
1) Chart.js
2) Verification of lerning
3) Convergence of optimal path
Q-VALUES FOR STATES
Environment configuration
19
Key Value
goalstate x: 690, y: 210
deathstate_1 x: 150, y: 270
deathstate_2 x: 210, y: 210
deathstate_3 x: 270, y: 420
deathstate_4 x: 330, y: 210
deathstate_5 x: 390, y: 530
… …
startstate x: 30, y: 210
Object representation: Key-Value Structure
Terminal State
Implementative choices (1)
• Tabular description
 Limited state space
 Cells: Q(s,a)-values (initialized to 0  no knowledge)
 Key-Value Structure
Pos. Up Down Right Left
3030 - 0 0 -
3090 0 0 0 -
9030 - 0 0 0
9090 0 0 0 0
15030 - 0 0 0
15090 0 0 0 0
… … … … …
… … … … …
… … … … …
630150 0 0 0 0
630210 0 - 0 0
690150 0 0 - 0
690210 0 - - 0
Implementative choices (2)
• ε−greedy policy (e.g. ε = 0,1)
Exploration and Exploitation Compromise
How are actions chosen?
Tests
• #Test1: Grid 12x4
• #Test2: Grid 10x10 – simple environment
• #Test3: Grid 10x10 – complex environment
• #Test4: Grid 10x10 – dynamic environment
Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Common features
Step 1: Input environments
Different grid environmental
configurations (12x4 - 10x10)
Different degrees of difficulty
Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Step 2: Algorithm’s choice
Common
features
Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Step 3: Trial-and-error learning
Common features
Input
environments
Algorithm’s
choice
Trial-And-Error
Convergent
to the
optimal path
Step 4: Convergent to the optimal path
Common features
1. Can an agent learn without having examples of correct behavior? 
Difference with Supervised Learning
2. Study of methods for Reinforcement Learning and understanding of the
basic principles that characterize them (notions of agent, environment,
MDP, ...)
3. Focused on the study of TD methods (Sarsa and Q-Learning)
4. Analysis of a practical problem: Path Planning
5. Software JavaScript  the agent is able to adapt to any type of
environment provided as input in order to achieve the set objective
6. Different nature of the Sarsa and Q-Learning algorithms
Conclusions
Sarsa Q-Learning
Safe path Speed path
Prudent policy Risky attitude
Not suitable for complex
environments
Suitable for any type of
environment
Optimize the agent's
performance
Train agents in simulated
environments
Expensive mistakes  keep
the risk away
Errors do not involve large
losses
Model-free  expensive adaptation changes (TD property)
Conclusions
• Real problem RL
• Partial Observable Markov Decision Problems (POMDP)
• Model-based algorithm
• Better learning policy (e.g. Soft-max)
• Change Q-table with Artificial Neural Networks
(e.g. chess game  states space = 10120)
• Continuous tasks (not episodic)
• Multi-agent system (opponent agent)
Future developments
Domande?
Relatore: Cristina Baroglio Candidato: Luca Marignati
12/07/2019
Grazie per l’attenzione!
Torino

Contenu connexe

Tendances

[ Capella Day 2019 ] Model-based safety analysis on Capella using Component F...
[ Capella Day 2019 ] Model-based safety analysis on Capella using Component F...[ Capella Day 2019 ] Model-based safety analysis on Capella using Component F...
[ Capella Day 2019 ] Model-based safety analysis on Capella using Component F...
Obeo
 
3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning
Data Science Milan
 
Procida giusy ctf
Procida giusy ctfProcida giusy ctf
Procida giusy ctf
lab13unisa
 
Crime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means ClusteringCrime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means Clustering
Reuben George
 
Driver drowsinees detection and alert.pptx slide
Driver drowsinees detection and alert.pptx slideDriver drowsinees detection and alert.pptx slide
Driver drowsinees detection and alert.pptx slide
kavinakshi
 

Tendances (20)

Slides for Ph.D. Thesis Defense of Dheryta Jaisinghani at IIIT-Delhi, INDIA
Slides for Ph.D. Thesis Defense of Dheryta Jaisinghani at IIIT-Delhi, INDIASlides for Ph.D. Thesis Defense of Dheryta Jaisinghani at IIIT-Delhi, INDIA
Slides for Ph.D. Thesis Defense of Dheryta Jaisinghani at IIIT-Delhi, INDIA
 
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
Recap: Designing a more Efficient Estimator for Off-policy Evaluation in Band...
 
[ Capella Day 2019 ] Model-based safety analysis on Capella using Component F...
[ Capella Day 2019 ] Model-based safety analysis on Capella using Component F...[ Capella Day 2019 ] Model-based safety analysis on Capella using Component F...
[ Capella Day 2019 ] Model-based safety analysis on Capella using Component F...
 
Tesi Triennale Slide
Tesi Triennale SlideTesi Triennale Slide
Tesi Triennale Slide
 
“Driver Monitoring Systems: Present and Future,” a Presentation from XPERI
“Driver Monitoring Systems: Present and Future,” a Presentation from XPERI“Driver Monitoring Systems: Present and Future,” a Presentation from XPERI
“Driver Monitoring Systems: Present and Future,” a Presentation from XPERI
 
Detection of phishing websites
Detection of phishing websitesDetection of phishing websites
Detection of phishing websites
 
3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning3D Point Cloud analysis using Deep Learning
3D Point Cloud analysis using Deep Learning
 
CapellaDays2022 | ThermoFisher - ESI TNO | A method for quantitative evaluati...
CapellaDays2022 | ThermoFisher - ESI TNO | A method for quantitative evaluati...CapellaDays2022 | ThermoFisher - ESI TNO | A method for quantitative evaluati...
CapellaDays2022 | ThermoFisher - ESI TNO | A method for quantitative evaluati...
 
Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)Lidar for Autonomous Driving II (via Deep Learning)
Lidar for Autonomous Driving II (via Deep Learning)
 
fusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving IIfusion of Camera and lidar for autonomous driving II
fusion of Camera and lidar for autonomous driving II
 
Security of Machine Learning
Security of Machine LearningSecurity of Machine Learning
Security of Machine Learning
 
Sirius Web Advanced : Customize and Extend the Platform
Sirius Web Advanced : Customize and Extend the PlatformSirius Web Advanced : Customize and Extend the Platform
Sirius Web Advanced : Customize and Extend the Platform
 
Procida giusy ctf
Procida giusy ctfProcida giusy ctf
Procida giusy ctf
 
CapellaDays2022 | NavalGroup | Closing the gap between traditional engineerin...
CapellaDays2022 | NavalGroup | Closing the gap between traditional engineerin...CapellaDays2022 | NavalGroup | Closing the gap between traditional engineerin...
CapellaDays2022 | NavalGroup | Closing the gap between traditional engineerin...
 
Crime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means ClusteringCrime Pattern Detection using K-Means Clustering
Crime Pattern Detection using K-Means Clustering
 
Driver drowsinees detection and alert.pptx slide
Driver drowsinees detection and alert.pptx slideDriver drowsinees detection and alert.pptx slide
Driver drowsinees detection and alert.pptx slide
 
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
Naive Bayes Classifier Tutorial | Naive Bayes Classifier Example | Naive Baye...
 
STPA Analysis of Automotive Safety Using Arcadia and Capella
STPA Analysis of Automotive Safety Using Arcadia and CapellaSTPA Analysis of Automotive Safety Using Arcadia and Capella
STPA Analysis of Automotive Safety Using Arcadia and Capella
 
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
Is that a Time Machine? Some Design Patterns for Real World Machine Learning ...
 
K Nearest Neighbors
K Nearest NeighborsK Nearest Neighbors
K Nearest Neighbors
 

Similaire à Presentazione Tesi Laurea Triennale in Informatica

Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Emil Lupu
 

Similaire à Presentazione Tesi Laurea Triennale in Informatica (20)

Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Horizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at ScaleHorizon: Deep Reinforcement Learning at Scale
Horizon: Deep Reinforcement Learning at Scale
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Kaggle Gold Medal Case Study
Kaggle Gold Medal Case StudyKaggle Gold Medal Case Study
Kaggle Gold Medal Case Study
 
Deep Reinforcement learning
Deep Reinforcement learningDeep Reinforcement learning
Deep Reinforcement learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Intro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft VenturesIntro to Machine Learning by Microsoft Ventures
Intro to Machine Learning by Microsoft Ventures
 
Nbvtalkonfeatureselection
NbvtalkonfeatureselectionNbvtalkonfeatureselection
Nbvtalkonfeatureselection
 
Policy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detectionPolicy Based reinforcement Learning for time series Anomaly detection
Policy Based reinforcement Learning for time series Anomaly detection
 
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in MLDutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - History and Developments in ML
 
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and RefinementGoal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
Goal Decomposition and Abductive Reasoning for Policy Analysis and Refinement
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
Argumentation in Artificial Intelligence: From Theory to Practice (Practice)
 
An introduction to variable and feature selection
An introduction to variable and feature selectionAn introduction to variable and feature selection
An introduction to variable and feature selection
 
An efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game LearningAn efficient use of temporal difference technique in Computer Game Learning
An efficient use of temporal difference technique in Computer Game Learning
 
Simulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous DrivingSimulation To Reality: Reinforcement Learning For Autonomous Driving
Simulation To Reality: Reinforcement Learning For Autonomous Driving
 
Data analytcis-first-steps
Data analytcis-first-stepsData analytcis-first-steps
Data analytcis-first-steps
 
What Can RL do.pptx
What Can RL do.pptxWhat Can RL do.pptx
What Can RL do.pptx
 
Experimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles BakerExperimental Design for Distributed Machine Learning with Myles Baker
Experimental Design for Distributed Machine Learning with Myles Baker
 
Rapid pruning of search space through hierarchical matching
Rapid pruning of search space through hierarchical matchingRapid pruning of search space through hierarchical matching
Rapid pruning of search space through hierarchical matching
 

Plus de Luca Marignati

Plus de Luca Marignati (6)

Who Controls the Internet? Illusions of a Borderless World di Jack Goldsmith ...
Who Controls the Internet? Illusions of a Borderless World di Jack Goldsmith ...Who Controls the Internet? Illusions of a Borderless World di Jack Goldsmith ...
Who Controls the Internet? Illusions of a Borderless World di Jack Goldsmith ...
 
Cookie
CookieCookie
Cookie
 
Jade - Programming of intelligent agents
Jade - Programming of intelligent agentsJade - Programming of intelligent agents
Jade - Programming of intelligent agents
 
Advanced Database Models and Architectures: Big Data: MySQL VS MongoDB
Advanced Database Models and Architectures: Big Data: MySQL VS MongoDBAdvanced Database Models and Architectures: Big Data: MySQL VS MongoDB
Advanced Database Models and Architectures: Big Data: MySQL VS MongoDB
 
Dal modello a memoria condivisa al modello a rete, impossibilità del consenso...
Dal modello a memoria condivisa al modello a rete, impossibilità del consenso...Dal modello a memoria condivisa al modello a rete, impossibilità del consenso...
Dal modello a memoria condivisa al modello a rete, impossibilità del consenso...
 
BenOr Simulation - A randomized algorithm for solving the consensus problem ...
BenOr Simulation  - A randomized algorithm for solving the consensus problem ...BenOr Simulation  - A randomized algorithm for solving the consensus problem ...
BenOr Simulation - A randomized algorithm for solving the consensus problem ...
 

Dernier

Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 

Dernier (20)

Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptxS1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
S1S2 B.Arch MGU - HOA1&2 Module 3 -Temple Architecture of Kerala.pptx
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Moment Distribution Method For Btech Civil
Moment Distribution Method For Btech CivilMoment Distribution Method For Btech Civil
Moment Distribution Method For Btech Civil
 
Online electricity billing project report..pdf
Online electricity billing project report..pdfOnline electricity billing project report..pdf
Online electricity billing project report..pdf
 
Design For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the startDesign For Accessibility: Getting it right from the start
Design For Accessibility: Getting it right from the start
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 

Presentazione Tesi Laurea Triennale in Informatica

  • 1. Apprendimento per Rinforzo e Applicazione ai Problemi di Pianificazione del Percorso Relatore: Cristina Baroglio Candidato: Luca Marignati 12/07/2019 Tesi di Laurea in Informatica Torino
  • 2. Context RL problem TD method Q-Learning and Sarsa Software Tests Conclusions Future developments Outline 2
  • 3. Machine Learning • Supervised Learning • Non-Supervised Learning • Reinforcement Learning ParadigmsContext 3
  • 5. • π S → A∶ • Find optimal policy π* Policy • R (S, A) → reward∶ Reward function Value function • Optional • Model-free approach Model • π S → A∶ • Find optimal policy π* Policy • R (S, A) → reward∶ Reward function Value function • Optional • Model-free approach Model Other elements 5
  • 6. Methods guided by two time instants instant t and instant t + 1 Model-free Learn directly from experience Bootstrapping Step-by-step incremental approach Off-policy/On- policy method Q-Learning/Sarsa Temporal difference method 6
  • 8. Based on Q(s,a) • Similar to  But are focused on state-action pair • Value of a state's utility  Quality value • Describes the gain or loss by performing the action a in the state s • Total long term reward (environment knowledge) • Bellman equation •   8
  • 9. Initialize Q(s,a) arbitrarily Repeat (For each episode) Repeat (for each step of episode) Choose a from St using policy derived from Q (e.g. St = St+1 Initialize St Take action at, observe R, St+1 9 Q-Learning: off-policy Update Q-Value
  • 10. Initialize Q(s,a) arbitrarily Initialize St Repeat (For each episode) Repeat (for each step of episode) Choose at+1 from St+1 using policy derived from Q (e.g. Take action at, observe R, St+1 Choose at from St using policy derived from Q (e.g. St = St+1; at = at+1 Sarsa: on-policy Update Q-Value Similar structure Change the update rule
  • 11. Different approaches for value update rule *(1) off-policy feature • Action at  current policy (e.g. -greedy policy) • Action at+1  greedy policy starting from the state st+1 *(2)  on-policy feature • Action at  current policy (e.g. -greedy policy) • Action at+1  current policy (e.g. -greedy policy)
  • 13. Tools • Languages • JavaScript/JQuery • HTML5 (CANVAS e API) • Libraries • Bootstrap  Responsive layout • Chart.js  Algorithm performance • FontAwesome  Icon management 13
  • 14. Problem’s description 1. Single-agent system 2. Variants of environment (Grid 12x4/10x10) 3. Finite states and actions • Finite states (48/100) • Limited number of actions  {up, down, right, left} 4. Target  reach the goal state 5. Episodic task 6. Reward function • −1  non-terminal States (Neutral States); • −100  States of defeat (The Cliff) • +100  Goal State
  • 16. CONFIGURATION 1) Set parameter 2) Choose algorithm 3) Number of victories 4) Number of defeats Section1
  • 17. Section2 VISUALIZATION OF THE ENVIRONMENT BUTTONS 1) Start/Stop/Accelerate learn 2) Set a Goal State 3) Set a Defeat State 4) Modal for choose positions
  • 18. Section3 INFORMATION OF RESULTS 1) Average reward 2) Average moves PERFORMANCE OF ALGORITHM 1) Chart.js 2) Verification of lerning 3) Convergence of optimal path Q-VALUES FOR STATES
  • 19. Environment configuration 19 Key Value goalstate x: 690, y: 210 deathstate_1 x: 150, y: 270 deathstate_2 x: 210, y: 210 deathstate_3 x: 270, y: 420 deathstate_4 x: 330, y: 210 deathstate_5 x: 390, y: 530 … … startstate x: 30, y: 210 Object representation: Key-Value Structure Terminal State
  • 20. Implementative choices (1) • Tabular description  Limited state space  Cells: Q(s,a)-values (initialized to 0  no knowledge)  Key-Value Structure Pos. Up Down Right Left 3030 - 0 0 - 3090 0 0 0 - 9030 - 0 0 0 9090 0 0 0 0 15030 - 0 0 0 15090 0 0 0 0 … … … … … … … … … … … … … … … 630150 0 0 0 0 630210 0 - 0 0 690150 0 0 - 0 690210 0 - - 0
  • 21. Implementative choices (2) • ε−greedy policy (e.g. ε = 0,1) Exploration and Exploitation Compromise How are actions chosen?
  • 22. Tests • #Test1: Grid 12x4 • #Test2: Grid 10x10 – simple environment • #Test3: Grid 10x10 – complex environment • #Test4: Grid 10x10 – dynamic environment
  • 23. Input environments Algorithm’s choice Trial-And-Error Convergent to the optimal path Common features Step 1: Input environments Different grid environmental configurations (12x4 - 10x10) Different degrees of difficulty
  • 27. 1. Can an agent learn without having examples of correct behavior?  Difference with Supervised Learning 2. Study of methods for Reinforcement Learning and understanding of the basic principles that characterize them (notions of agent, environment, MDP, ...) 3. Focused on the study of TD methods (Sarsa and Q-Learning) 4. Analysis of a practical problem: Path Planning 5. Software JavaScript  the agent is able to adapt to any type of environment provided as input in order to achieve the set objective 6. Different nature of the Sarsa and Q-Learning algorithms Conclusions
  • 28. Sarsa Q-Learning Safe path Speed path Prudent policy Risky attitude Not suitable for complex environments Suitable for any type of environment Optimize the agent's performance Train agents in simulated environments Expensive mistakes  keep the risk away Errors do not involve large losses Model-free  expensive adaptation changes (TD property) Conclusions
  • 29. • Real problem RL • Partial Observable Markov Decision Problems (POMDP) • Model-based algorithm • Better learning policy (e.g. Soft-max) • Change Q-table with Artificial Neural Networks (e.g. chess game  states space = 10120) • Continuous tasks (not episodic) • Multi-agent system (opponent agent) Future developments
  • 30. Domande? Relatore: Cristina Baroglio Candidato: Luca Marignati 12/07/2019 Grazie per l’attenzione! Torino

Notes de l'éditeur

  1. <number>
  2. <number>
  3. <number>
  4. <number>
  5. <number>
  6. <number>
  7. <number>
  8. <number>
  9. <number>
  10. <number>
  11. <number>
  12. <number>
  13. <number>
  14. <number>