SlideShare une entreprise Scribd logo
1  sur  29
Introduction to Reinforcement Learning
- Utkarsh Garg
How do we learn to do Stuff?
• When any living organism, gets exposed to a specific
stimulus (or a situation), there is an effect of strengthening
the future behaviour of that organism when it has been
exposed to a specific stimulus prompting it to execute the
learned behaviour.
• The organism’s behaviour is controlled by detectable
changes in the environment, which is something external
that influences an activity. For example, our bodies can
detect touch, sound, vision, etc.
• The organism’s brain uses reinforcement or punishment to
modify the likelihood of a behaviour. As well, it involves
voluntary behaviour that can be described with the following
example on animal behaviour:
• dog can be trained to jump higher when rewarded by
dog treats, meaning its behaviour was reinforced by
treats to perform specific actions
With the advancements in Robotics Arm Manipulation, Google Deep Mind
beating a professional Alpha Go Player, and recently the OpenAI team beating a
professional DOTA player, the field of reinforcement learning has really exploded
in recent years
Before we understand how these systems were able to accomplish something
like above, lets first learn about the building blocks of Reinforcement learning.
Let’s learn to crawl before we run!
 A maze-like problem
 The agent lives in a grid
 Walls block the agent’s path
 Noisy movement: actions do not always go as planned
 80% of the time, the action North takes the agent North
(if there is no wall there)
 10% of the time, North takes the agent West; 10% East
 If there is a wall in the direction the agent would have
been taken, the agent stays put
 The agent receives rewards each time step
 Small “living” reward each step (can be negative)
 Big rewards come at the end (good or bad)
 Goal: maximize sum of rewards
Grid World
Deterministic Grid World Stochastic Grid World
• Need to reach from point A to B
• Each segment shows time in hrs. A to C takes 4 mins
• The shortest path in this problem is ACDEGHB
• This is a deterministic problem
• Let’s say we introduce some traffic with some
probabilities in each path
• There is 25% chance it will take 10 mins and 75%
chance it will take 3 mins to reach point C from point
A. Similar some probabilities for other segments
• Now, if we run the simulation multiple times, the
shortest time path would be different for each
iteration due to randomness in traffic introduced in
the system. This is called a Stochastic process
• Finding the shortest time route is not straight forward
anymore. In real world we may not know these
probabilities as well. Our goal is now to find the most
probable shortest path.
Another Example
Reinforcement
Learning
• Reinforcement learning (RL) is an area of machine learning
concerned with how software agents ought to take actions in
an environment so as to maximize some notion of cumulative
reward.
A simple example of the above system:
 Imagine a baby is given a TV remote control at your home
(environment)
 The baby (agent) will first observe the TV and its state (if its
on/of, what channel etc.)
 Then the curious baby will take certain actions like hitting
the remote control (action) and observe how would the TV
response (next state)
 As a non-responding TV is dull, the baby dislike it (receiving
a negative reward) and will take less actions that will lead
to such a result (updating the policy) and vice versa.
 The baby will repeat the process until he/she finds a policy
(what to do under different circumstances) that he/she is
happy with (maximizing the total (discounted) rewards).
BREAKOUT
Reward and Policy
• The reward structure of our system depends on how and what we want our system to learn
R(s) = -2.0R(s) = -0.4
R(s) = -0.03R(s) = -0.01
• We not only want the system to greedily get whatever the
highest reward it is getting right now but we also want it to
consider the future reward.
Why?
It leads to better strategies!
• Therefore, we want to:
• Maximize the sum of rewards
• Prefer rewards now more than later since we deal with a stochastic
process and we never know if the action we take leads to the target state
with the reward
Calculating Rewards
In the picture on the left,
• the two paths are policies
• Each circle is a state and each diamond a reward
• The agent needs to decide the optimal path (or policy) so
that it maximizes its total reward
• If it was a deterministic process, both paths would lead to
equal sum of rewards
• But since we are dealing with a Stochastic process, we
cannot wait for the 4th circle as the policy may not take us to
max reward
One way to model this is to exponentially decay future
rewards:
𝛾(gamma) is the decaying factor. Therefore, the reward
equation becomes:
Total discounted reward = r_1 + 𝛾 r_2 + 𝛾² r_3 + 𝛾³ r_4 + 𝛾⁴
r_5+ …
The above equation gives us a quantitative basis to say that the
agent would prefer path 1 as the value of Total discounted
award is more than the second case.
Done with basics.
Let’s go Deeper
Q - Learning
What is Q?
• Q-value: Q(s,a) is the value of total discounted rewards, when the agent
takes an action a and then follows the most optimal path (that is why we
have max over all actions in below equation).
• And Q*(s,a) is this value for the best action at state s.
By having this value for all combinations of states and actions,
Q table
Reward Value
1 Step -0.04
Power +0.5
Mines -10
End +1 or -1
𝛾 = 0.9
Learned Q Values
Exploration Vs Exploitation
• There is an important concept of the exploration and
exploitation trade off in reinforcement learning.
• Exploration is all about finding more information about
an environment, whereas exploitation is exploiting
already known information to maximize the rewards.
• Real Life Example: Say you go to the same restaurant
(which you like) every day. You are basically exploiting.
But on the other hand, if you search for new restaurant
every time before going to any one of them, then it’s
exploration. Exploration is very important for the
search of future rewards which might be higher than
the near rewards i.e. you may find a new restaurant
even better than when you were exploiting.
Generalization across States
• Basic Q-Learning keeps a table of all q-values
• In realistic situations, we cannot possibly learn about every
single state!
• Too many states to visit them all in training
• Too many states to hold the q-tables in memory
• Instead, we want to generalize:
• Learn about some small number of training states from
experience
• Generalize that experience to new, similar situations
• This is a fundamental idea in machine learning, and we’ll see it
over and over again
State space
• Discretized vertical distance from lower pipe
• Discretized horizontal distance from next pair of pipes
• Life: Dead or Living
Actions
• Click
• Do nothing
Rewards
• +1 if Flappy Bird still alive
• -1000 if Flappy Bird is dead
• 6-7 hours of Q-learning
Generalization Example 1
Let’s say we discover
through experience
that this state is bad:
In naïve q-learning,
we know nothing
about this state:
Or even this one!
Generalization Example 2
• Solution: describe a state using a vector of features
(properties)
• Features are functions from states to real numbers (often
0/1) that capture important properties of the state
• Example features:
• Distance to closest ghost
• Distance to closest dot
• Number of ghosts
• 1 / (dist to dot)2
• Is Pacman in a tunnel? (0/1)
• …… etc.
• Is it the exact state on this slide?
• Can also describe a q-state (s, a) with features (e.g. action
moves closer to food)
• Now instead of a Q table, we have these features using
which we can train any supervised learning algo to learn
the Q values and hence the right actions
Feature Based Representation
Generalization Example 3 (play video)
4 Actions available:
• The avg angle of the
blades
• Difference in angle
between front and back
• Difference in angle
between left and right
• Angle for the tail rotor
Task:
Learn to hover
States:
• Data from various sensors
Note! The most efficient policy it
found was to fly inverted!
Going even
Deeper…
Deep Q Networks (DQN)
Alpha Go
• In 2016, initial version of alphago lee beat 17 times world champion lee
sedol.
• Just a year later, alphago zero beat unlike its predecessor was trained
without any data from real human games
• It learned only by playing against itself. The 2016 version was defeated
100-0 by alphago zero.
• Go has shown us that AI has started to move beyond what
humans can tell it to do.
• This was shown when the alphago made the move37. For
humans or the world champion, it was a seemingly bad
move, but it turn out to be a game changing move which led
to alphago’s victory
Arch Link : https://applied-
data.science/static/main/res/alpha_go_zero_cheat_sheet.png
Alpha Go Training Graph
Self Driving Cars
Supervised learning based self driving car (with simulator)
https://www.youtube.com/watch?v=EaY5QiZwSP4&t=1111s
The reinforcement learning way to do this!
https://wayve.ai/blog/learning-to-drive-in-a-day-with-
reinforcement-learning
Landing Spacex Rockets
https://www.youtube.com/watch?v=4_igzo4qNmQ
Thank You

Contenu connexe

Similaire à Intro to Reinforcement Learning

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning재연 윤
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learningMegha Sharma
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Hogeon Seo
 
Search-Beyond-Classical-no-exercise-answers.pdf
Search-Beyond-Classical-no-exercise-answers.pdfSearch-Beyond-Classical-no-exercise-answers.pdf
Search-Beyond-Classical-no-exercise-answers.pdfMrRRThirrunavukkaras
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introductionConnorShorten2
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement LearningUsman Qayyum
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningElias Hasnat
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Julia Maddalena
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learningDing Li
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.pptcharusharma165
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-LearningKuppusamy P
 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.pptssuser43a599
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningRuth Yakubu
 

Similaire à Intro to Reinforcement Learning (20)

Demystifying deep reinforement learning
Demystifying deep reinforement learningDemystifying deep reinforement learning
Demystifying deep reinforement learning
 
Reinforcement learning in Machine learning
 Reinforcement learning in Machine learning Reinforcement learning in Machine learning
Reinforcement learning in Machine learning
 
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
Review :: Demystifying deep reinforcement learning (written by Tambet Matiisen)
 
Search-Beyond-Classical-no-exercise-answers.pdf
Search-Beyond-Classical-no-exercise-answers.pdfSearch-Beyond-Classical-no-exercise-answers.pdf
Search-Beyond-Classical-no-exercise-answers.pdf
 
Rl chapter 1 introduction
Rl chapter 1 introductionRl chapter 1 introduction
Rl chapter 1 introduction
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning Survey of Modern Reinforcement Learning
Survey of Modern Reinforcement Learning
 
Introduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement LearningIntroduction to Deep Reinforcement Learning
Introduction to Deep Reinforcement Learning
 
Organizational behaviour
Organizational behaviourOrganizational behaviour
Organizational behaviour
 
Deep einforcement learning
Deep einforcement learningDeep einforcement learning
Deep einforcement learning
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
reiniforcement learning.ppt
reiniforcement learning.pptreiniforcement learning.ppt
reiniforcement learning.ppt
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
(ppt
(ppt(ppt
(ppt
 
Reinforcement learning, Q-Learning
Reinforcement learning, Q-LearningReinforcement learning, Q-Learning
Reinforcement learning, Q-Learning
 
02LocalSearch.pdf
02LocalSearch.pdf02LocalSearch.pdf
02LocalSearch.pdf
 
YijueRL.ppt
YijueRL.pptYijueRL.ppt
YijueRL.ppt
 
RL_online _presentation_1.ppt
RL_online _presentation_1.pptRL_online _presentation_1.ppt
RL_online _presentation_1.ppt
 
Making smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement LearningMaking smart decisions in real-time with Reinforcement Learning
Making smart decisions in real-time with Reinforcement Learning
 

Dernier

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVRajaP95
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college projectTonystark477637
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performancesivaprakash250
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxpranjaldaimarysona
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Christo Ananth
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Dr.Costas Sachpazis
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 

Dernier (20)

HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IVHARMONY IN THE NATURE AND EXISTENCE - Unit-IV
HARMONY IN THE NATURE AND EXISTENCE - Unit-IV
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur EscortsCall Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
Call Girls Service Nagpur Tanvi Call 7001035870 Meet With Nagpur Escorts
 
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Meera Call 7001035870 Meet With Nagpur Escorts
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
Processing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptxProcessing & Properties of Floor and Wall Tiles.pptx
Processing & Properties of Floor and Wall Tiles.pptx
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(ANVI) Koregaon Park Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
Call for Papers - African Journal of Biological Sciences, E-ISSN: 2663-2187, ...
 
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
Sheet Pile Wall Design and Construction: A Practical Guide for Civil Engineer...
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 

Intro to Reinforcement Learning

  • 1. Introduction to Reinforcement Learning - Utkarsh Garg
  • 2. How do we learn to do Stuff? • When any living organism, gets exposed to a specific stimulus (or a situation), there is an effect of strengthening the future behaviour of that organism when it has been exposed to a specific stimulus prompting it to execute the learned behaviour. • The organism’s behaviour is controlled by detectable changes in the environment, which is something external that influences an activity. For example, our bodies can detect touch, sound, vision, etc. • The organism’s brain uses reinforcement or punishment to modify the likelihood of a behaviour. As well, it involves voluntary behaviour that can be described with the following example on animal behaviour: • dog can be trained to jump higher when rewarded by dog treats, meaning its behaviour was reinforced by treats to perform specific actions
  • 3. With the advancements in Robotics Arm Manipulation, Google Deep Mind beating a professional Alpha Go Player, and recently the OpenAI team beating a professional DOTA player, the field of reinforcement learning has really exploded in recent years Before we understand how these systems were able to accomplish something like above, lets first learn about the building blocks of Reinforcement learning. Let’s learn to crawl before we run!
  • 4.  A maze-like problem  The agent lives in a grid  Walls block the agent’s path  Noisy movement: actions do not always go as planned  80% of the time, the action North takes the agent North (if there is no wall there)  10% of the time, North takes the agent West; 10% East  If there is a wall in the direction the agent would have been taken, the agent stays put  The agent receives rewards each time step  Small “living” reward each step (can be negative)  Big rewards come at the end (good or bad)  Goal: maximize sum of rewards Grid World
  • 5. Deterministic Grid World Stochastic Grid World
  • 6. • Need to reach from point A to B • Each segment shows time in hrs. A to C takes 4 mins • The shortest path in this problem is ACDEGHB • This is a deterministic problem • Let’s say we introduce some traffic with some probabilities in each path • There is 25% chance it will take 10 mins and 75% chance it will take 3 mins to reach point C from point A. Similar some probabilities for other segments • Now, if we run the simulation multiple times, the shortest time path would be different for each iteration due to randomness in traffic introduced in the system. This is called a Stochastic process • Finding the shortest time route is not straight forward anymore. In real world we may not know these probabilities as well. Our goal is now to find the most probable shortest path. Another Example
  • 7. Reinforcement Learning • Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward.
  • 8. A simple example of the above system:  Imagine a baby is given a TV remote control at your home (environment)  The baby (agent) will first observe the TV and its state (if its on/of, what channel etc.)  Then the curious baby will take certain actions like hitting the remote control (action) and observe how would the TV response (next state)  As a non-responding TV is dull, the baby dislike it (receiving a negative reward) and will take less actions that will lead to such a result (updating the policy) and vice versa.  The baby will repeat the process until he/she finds a policy (what to do under different circumstances) that he/she is happy with (maximizing the total (discounted) rewards).
  • 10. Reward and Policy • The reward structure of our system depends on how and what we want our system to learn R(s) = -2.0R(s) = -0.4 R(s) = -0.03R(s) = -0.01
  • 11. • We not only want the system to greedily get whatever the highest reward it is getting right now but we also want it to consider the future reward. Why? It leads to better strategies!
  • 12. • Therefore, we want to: • Maximize the sum of rewards • Prefer rewards now more than later since we deal with a stochastic process and we never know if the action we take leads to the target state with the reward
  • 13. Calculating Rewards In the picture on the left, • the two paths are policies • Each circle is a state and each diamond a reward • The agent needs to decide the optimal path (or policy) so that it maximizes its total reward • If it was a deterministic process, both paths would lead to equal sum of rewards • But since we are dealing with a Stochastic process, we cannot wait for the 4th circle as the policy may not take us to max reward One way to model this is to exponentially decay future rewards: 𝛾(gamma) is the decaying factor. Therefore, the reward equation becomes: Total discounted reward = r_1 + 𝛾 r_2 + 𝛾² r_3 + 𝛾³ r_4 + 𝛾⁴ r_5+ … The above equation gives us a quantitative basis to say that the agent would prefer path 1 as the value of Total discounted award is more than the second case.
  • 15. Q - Learning What is Q? • Q-value: Q(s,a) is the value of total discounted rewards, when the agent takes an action a and then follows the most optimal path (that is why we have max over all actions in below equation). • And Q*(s,a) is this value for the best action at state s. By having this value for all combinations of states and actions, Q table Reward Value 1 Step -0.04 Power +0.5 Mines -10 End +1 or -1 𝛾 = 0.9
  • 17. Exploration Vs Exploitation • There is an important concept of the exploration and exploitation trade off in reinforcement learning. • Exploration is all about finding more information about an environment, whereas exploitation is exploiting already known information to maximize the rewards. • Real Life Example: Say you go to the same restaurant (which you like) every day. You are basically exploiting. But on the other hand, if you search for new restaurant every time before going to any one of them, then it’s exploration. Exploration is very important for the search of future rewards which might be higher than the near rewards i.e. you may find a new restaurant even better than when you were exploiting.
  • 18. Generalization across States • Basic Q-Learning keeps a table of all q-values • In realistic situations, we cannot possibly learn about every single state! • Too many states to visit them all in training • Too many states to hold the q-tables in memory • Instead, we want to generalize: • Learn about some small number of training states from experience • Generalize that experience to new, similar situations • This is a fundamental idea in machine learning, and we’ll see it over and over again
  • 19. State space • Discretized vertical distance from lower pipe • Discretized horizontal distance from next pair of pipes • Life: Dead or Living Actions • Click • Do nothing Rewards • +1 if Flappy Bird still alive • -1000 if Flappy Bird is dead • 6-7 hours of Q-learning Generalization Example 1
  • 20. Let’s say we discover through experience that this state is bad: In naïve q-learning, we know nothing about this state: Or even this one! Generalization Example 2
  • 21. • Solution: describe a state using a vector of features (properties) • Features are functions from states to real numbers (often 0/1) that capture important properties of the state • Example features: • Distance to closest ghost • Distance to closest dot • Number of ghosts • 1 / (dist to dot)2 • Is Pacman in a tunnel? (0/1) • …… etc. • Is it the exact state on this slide? • Can also describe a q-state (s, a) with features (e.g. action moves closer to food) • Now instead of a Q table, we have these features using which we can train any supervised learning algo to learn the Q values and hence the right actions Feature Based Representation
  • 22. Generalization Example 3 (play video) 4 Actions available: • The avg angle of the blades • Difference in angle between front and back • Difference in angle between left and right • Angle for the tail rotor Task: Learn to hover States: • Data from various sensors Note! The most efficient policy it found was to fly inverted!
  • 25. Alpha Go • In 2016, initial version of alphago lee beat 17 times world champion lee sedol. • Just a year later, alphago zero beat unlike its predecessor was trained without any data from real human games • It learned only by playing against itself. The 2016 version was defeated 100-0 by alphago zero. • Go has shown us that AI has started to move beyond what humans can tell it to do. • This was shown when the alphago made the move37. For humans or the world champion, it was a seemingly bad move, but it turn out to be a game changing move which led to alphago’s victory Arch Link : https://applied- data.science/static/main/res/alpha_go_zero_cheat_sheet.png
  • 27. Self Driving Cars Supervised learning based self driving car (with simulator) https://www.youtube.com/watch?v=EaY5QiZwSP4&t=1111s The reinforcement learning way to do this! https://wayve.ai/blog/learning-to-drive-in-a-day-with- reinforcement-learning