SlideShare a Scribd company logo
Reinforcement Learning
Dr.S.Nagaraju
Overview
1 About Reinforcement Learning
2 The Reinforcement Learning Problem
Rewards
Environment
State
3 Inside an RL Agent
4 Problems within Reinforcement Learning & MDP& MDP
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
Many Facets of Reinforcement Learning
Figure: Many Facets of Reinforcement Learning.
About Reinforcement Learning
Branches of Machine Learning
Figure: Branches of Machine Learning.
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
 Science of sequence of decision making deals with agents that must sense
& act upon their environment.
 An agent learns to behave in an environment by performing the actions
and seeing the results of previous actions.
 For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback.
 In RL, the agent learns automatically using feedbacks without any
labelled data, unlike supervised learning.
What is Reinforcement Learning?
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
Characteristics of Reinforcement Learning
What makes reinforcement learning different from other machine
learning paradigms?
No supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters(Sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
The ReinforcementLearningComponents
Rewards
A reward Rt is a scalar feedback signal
Indicates how well agent is doing at time-step t
The agent’s job is to maximise cumulative reward
Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected
cumulative reward
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Examples of Rewards
Fly stunt manoeuvres in a helicopter
+ve reward for following desired trajectory
-ve reward for crashing walk
Defeat the world champion
+/-ve reward for winning/losing a game
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Examples of Rewards
Control a power station
+ve reward for producing power
-ve reward for exceeding safety thresholds
Make a humanoid robot walk
+ve reward for forward motion
-ve reward for falling over
About Reinforcement Learning The
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Agent and Environment
At each step t the agent:
Executes action At
Receives observation Ot
Receives scalar reward
Rt
The environment:
Receives action At
Emits observation Ot
Emits scalar reward Rt
t increments at env. step
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
History and State
The history is the sequence of actions, observations, rewards
Ht = A1, O1, R1, . . . , At, Ot, Rt
What happens next depends on the history:
The agent selects actions
The environment selects observations/rewards
State is the concise information used to determine what
happens next
Formally, state is a function of the history:
St = f (Ht)
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Environment State
The environment state Se
t
is the environment data
representation
i.e. whatever data the
environment uses to pick
the next
observation/reward
The environment can be
real-world or simulated
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Agent State
t
The agent state Sa is the
agent’s internal
representation
i.e. whatever information
the agent uses to pick the
next action
i.e. it is the information
used by reinforcement
learning algorithms
It can be function of
history:
t
Sa = f (Ht)
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Information State
An information state(a.k.a.Markov state) contains all useful
information from the history.
Definition
A state St is Markov if and only if
P[St+1|St] = P[St+1|S1, . . . , St]
Future is independent of the past given the present”
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state Se is Markov
t
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Fully Observable Environments
Full observability: agent directly
observes environment
St = Sa = Se
t t
Agent state = environment
state = information state
Formally, this is a Markov
decision process (MDP)
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Partially Observable Environments
Partial observability: agent indirectly observes environment:
A robot with camera vision isn’t told its absolute location
Now agent state =
/ environment state, formally this is a
partially observable Markov decision process(POMDP)
About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Major Components of the RL Agent
RL agent may include one or more of these components:
Policy: agent’s behavior /what actions can perform
Value function: prediction of future reward
Model: agent’s view of the environment
Inside the RL Agent
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Policy
A policy is the agent’s behaviour
It is a map from state to action
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
Inside the RL Agent
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states&/actions
Therefore to select between actions, needs sum of cumulative
rewards
vπ(s) = Eπ[Rt + γRt+1 + γ2Rt+2 + . . . |St = s]
Inside the RL Agent
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Model
Agent’s view of the environment
P predicts the next state (Transition Model)
R predicts the next (immediate) reward, (Reward Model)
ss′
P = P[St+1
a J
t t
= s |S = s, A = a]
a
s
R = E[Rt+1 t t
|S = s, A = a]
Inside the RL Agent
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example
Rewards: -1 per
time-step
Actions: N, E, S, W
States: Agent’s location
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example Policy
Arrows represent policy π(s) for each state s
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example: Value Function
Numbers represent value vπ(s) of each state s (No. of steps the
agent is away from the goal)
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example: Model
Grid layout represents transition model P a
ss′
s
Numbers represent immediate reward Ra for each state s
(same for all a)
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Categorizing RL agents (1)
Value Based
No Policy (Implicit)
Value Function
Policy Based
Policy
No Value Function
Actor Critic
Policy
Value Function
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Categorizing RL agents (2)
Model Free
Policy and/or Value Function
No Model
Model Based
Policy and/or Value Function
Model
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Reinforcement Learning Agent Taxonomy
Figure: Reinforcement Learning Agent Taxonomy.
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Learning and Planning
Two fundamental problems in sequential decision making
Reinforcement Learning:
The environment model is initially unknown
The agent interacts with the environment
The agent improves its policy
Planning:
A model of the environment is known
The agent performs computations with its model (without any
interaction)
The agent improves its policy
a.k.a. deliberation, reasoning, introspection, pondering,
thought, search
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Exploration and Exploitation
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy
From its experiences of the environment
Without losing too much reward along the way
Exploration finds more information about the environment
Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Exploration and Exploitation Examples
Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Prediction and Control
Prediction: evaluate the future
Given a policy
Control: optimise the future
Find the best policy
Markov Decision Process (MDP)
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Overview
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Introduction to MDP
Markov decision process is a mathematical environment for
reinforcement learning
Where the environment is fully observable
Almost all RL problems can be formalised as MDPs
Partially observable problems can be converted into MDPs
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Markov Property
”The future is independent of the past given the present”
Definition
A state St is Markov it determines being in one state what can be
the transition probability to other state P[St +1|St ] = P[St +1|S1,..,St ]
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future MDPs
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
State Transition Matrix
For a Markov state s and successor state s’, the state
transition probability is denoted by
Pss′ = P[St +1 = s’|St = s]
State transition matrix P denotes transition probabilities from
all states s to all successor states s’,
...
P = from . ... .
...
Pn1 Pnn
P11 P1n
where each row of the matrix sums to 1.
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Markov Process
A Markov process is a memoryless process, i.e. a sequence of
states S1,S2,... with the Markov property.
Definition
A Markov Process (or Markov Chain) is a tuple < S ,P >
S is a (finite) set of states
P is a state transition probability matrix,
Pss' = P[St+1 = s′|St = s]
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Example: Student Markov Chain/Process
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Example: Student Markov Process Episodes
Sample episodes for Student Markov
Chain starting from S1 = C1
S1,S2,...,ST
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass
Sleep
C1 FB FB C1 C2 C3 Pub C1 FB
FB FB C1 C2 C3 Pub C2 Sleep
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Example: Student Markov Chain Transition Matrix
P =
Sleep
c1
c2 0.2
1.0
Pub 0.2 0.4 0.4
c1 c2 c3 Pass Pub FB
0.5 0.5
0.8
c3 0.6 0.4
Pass
FB 0.1 0.9
Sleep 1
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Markov Reward Process
A Markov reward process is a Markov chain with rewards.
Definition
A Markov Reward Process is a tuple < S ,P ,R ,γ >
S is a (finite) set of states
P is a state transition probability matrix,
Pss' = P[St+1 = s′|St = s]
R is a reward function, Rs = E[Rt+1|St = s]
γ is a discount factor, γ ∈ [0;1]
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: Student MRP
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Returns
Definition
The return Gt is the total discounted reward from time-step t.
Gt = Rt +1 + γRt +2 + γ2Rt +3 + ...
t
inf
Σ
G = Rt +k+1
k=0
The discount γ ∈ [0,1] is the present value of future rewards
The value of receiving reward R after k + 1 time-steps is γkR
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Why discount?
Most Markov reward and decision processes are discounted. Why?
Mathematically convenient to use
Avoids infinite returns in cyclic Markov processes
If the reward is financial, immediate rewards may earn more
interest than delayed rewards
Human behaviour shows preference for immediate reward
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Value Function
The value function v(s) gives the long-term value of state s
Definition
The state value function v(s) of an MRP is the expected return
starting from state s
v(s) = E[Gt |St = s]
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: Student MRP Returns
Sample returns for Student MRP:
Starting from S1 = C1 with γ = 1
2
T 2
G1 = R2 + γR3 + · ·· + γ RT − 2
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass Sleep
C1 FB FB C1 C2 C3 Pub C1 FB FB
FB C1 C2 C3 Pub C2 Sleep
1
2
1
4
1
8
G1 = -2 - 2 × − 2 × + 10 × = −2.25
1 1 1 1
2 4 8 16
G1 = -2 - 1 × − 1 × − 2 × − 2 × = −3.125
1 1 1 1
2 4 8 16
G1 = -2 -2 × − 2 × + 1 × − 2 × + ···= − 3.41
1 1 1 1
2 4 8 16
G1 = -2 - 1 × − 1 × − 2 × − 2 × + ···= − 3.20
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(1)
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(2)
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(3)
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Bellman Equation for MRPs
The value function can be decomposed into two parts:
immediate reward Rt+1
discounted value of successor state (St+1)
v(s) = E[Gt |St = s]
= E[Rt +1 + γRt +2 + γ2Rt +3 + ...|St = s]
= E[Rt +1 + γ(Rt +2 + γRt +3 + ...)|St = s]
= E[Rt +1 + γV (St +1)|St = s]
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: Bellman Equation for Student MRP
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Bellman Equation in Matrix Form
The Bellman equation can be expressed concisely using matrices,
v = R + γP v
where v is a column vector with one entry per state
v(n)
=
v(1) R
. .
+ γ
P ... P
1 11 1n
R n ...
Pn1 Pnn
v(1)
.
. . .
v(n)
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Markov Decision Process
A Markov decision process (MDP) is a Markov reward process with
decisions. It is an environment in which all states are Markov.
Definition
A Markov Reward Process is a tuple < S ,A ,P ,R ,γ >
S is a (finite) set of states
A is a finite set of actions
P is a state transition probability matrix,
Pa ′
ss t +1 t t
' = P[S = s |S = s,A = a]
a
s
a
t +1
R is a reward function, R = E[R |St = s,A = a]
γ is a discount factor, γ ∈ [0;1]
Markov Decision Process
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Value Function
Definition
The state-value function vπ(s) of an MDP is the expected return
starting from state s, and then following policy
vπ(s) = Eπ[Gt |St = s]
Definition
The action-value function qπ(s,a) is the expected return starting
from state s, taking action a, and then following policy π
qπ(s,a) = Eπ[Gt |St = s,At = a]
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: State-Value Function for Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Bellman Expectation Equation
The state-value function can again be decomposed into immediate
reward plus discounted value of successor state,
vπ(s) = Eπ[Rt +1 + γvπ(st +1)|St = s]
The action-value function can similarly be decomposed,
qπ(s,a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: Bellman Expectation Equation for Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Conclusion
Reinforcement Learning is one of the most interesting and useful
parts of Machine learning.
In RL, the agent explores the environment without any human
intervention. It is the main learning algorithm that is used in
Artificial Intelligence.
The main issue with the RL algorithm is that some of the
parameters may affect the speed of the learning, such as delayed
feedback.

More Related Content

Similar to RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
MLconf
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
POOJASHREEC1
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
VaishnavGhadge1
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
Chandra Meena
 
Reinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RLReinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RL
Thom Lane
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
SKS
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.pptbutest
 
What is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfWhat is Reinforcement Learning.pdf
What is Reinforcement Learning.pdf
Aiblogtech
 
Introduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptxIntroduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptx
Harsha Patel
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313Slideshare
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
YasutoTamura1
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
Farzad M. Zaravand
 
Lecture 1 - introduction.pdf
Lecture 1 - introduction.pdfLecture 1 - introduction.pdf
Lecture 1 - introduction.pdf
NamanJain758248
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
archayacb21
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithm
SameerJolly2
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt
201roopikha
 
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement Learning
Nguyen Luong An Phu
 
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
MLconf
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
DataScienceLab
 
Q_Learning.ppt
Q_Learning.pptQ_Learning.ppt
Q_Learning.ppt
AyushGiri27
 

Similar to RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx (20)

Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Reinforcement Learning.ppt
Reinforcement Learning.pptReinforcement Learning.ppt
Reinforcement Learning.ppt
 
reinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdfreinforcement-learning-141009013546-conversion-gate02.pdf
reinforcement-learning-141009013546-conversion-gate02.pdf
 
Reinforcement learning
Reinforcement learning Reinforcement learning
Reinforcement learning
 
Reinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RLReinforcement Learning with Amazon SageMaker RL
Reinforcement Learning with Amazon SageMaker RL
 
Reinforcement learning
Reinforcement  learningReinforcement  learning
Reinforcement learning
 
acai01-updated.ppt
acai01-updated.pptacai01-updated.ppt
acai01-updated.ppt
 
What is Reinforcement Learning.pdf
What is Reinforcement Learning.pdfWhat is Reinforcement Learning.pdf
What is Reinforcement Learning.pdf
 
Introduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptxIntroduction to Reinforcement Learning.pptx
Introduction to Reinforcement Learning.pptx
 
Reinforcement learning 7313
Reinforcement learning 7313Reinforcement learning 7313
Reinforcement learning 7313
 
How to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative waysHow to formulate reinforcement learning in illustrative ways
How to formulate reinforcement learning in illustrative ways
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
Lecture 1 - introduction.pdf
Lecture 1 - introduction.pdfLecture 1 - introduction.pdf
Lecture 1 - introduction.pdf
 
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptxREINFORCEMENT LEARNING (reinforced through trial and error).pptx
REINFORCEMENT LEARNING (reinforced through trial and error).pptx
 
Naive Reinforcement algorithm
Naive Reinforcement algorithmNaive Reinforcement algorithm
Naive Reinforcement algorithm
 
rlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piuttrlpptgroup3-231018180804-0c05fb2f789piutt
rlpptgroup3-231018180804-0c05fb2f789piutt
 
Introduce to Reinforcement Learning
Introduce to Reinforcement LearningIntroduce to Reinforcement Learning
Introduce to Reinforcement Learning
 
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
Harm van Seijen, Research Scientist, Maluuba at MLconf SF 2016
 
Reinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine SweeperReinforcement Learning on Mine Sweeper
Reinforcement Learning on Mine Sweeper
 
Q_Learning.ppt
Q_Learning.pptQ_Learning.ppt
Q_Learning.ppt
 

Recently uploaded

ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
ssuser36d3051
 
Ethernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.pptEthernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.ppt
azkamurat
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
Mukeshwaran Balu
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
Victor Morales
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
dxobcob
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
heavyhaig
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
nooriasukmaningtyas
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
SyedAbiiAzazi1
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
symbo111
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
Series of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.pptSeries of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.ppt
PauloRodrigues104553
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
bhadouriyakaku
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
jpsjournal1
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
aqil azizi
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
drwaing
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
zwunae
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 

Recently uploaded (20)

ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
sieving analysis and results interpretation
sieving analysis and results interpretationsieving analysis and results interpretation
sieving analysis and results interpretation
 
Ethernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.pptEthernet Routing and switching chapter 1.ppt
Ethernet Routing and switching chapter 1.ppt
 
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
ACRP 4-09 Risk Assessment Method to Support Modification of Airfield Separat...
 
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsKuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressions
 
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
一比一原版(Otago毕业证)奥塔哥大学毕业证成绩单如何办理
 
Technical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prismsTechnical Drawings introduction to drawing of prisms
Technical Drawings introduction to drawing of prisms
 
Low power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniquesLow power architecture of logic gates using adiabatic techniques
Low power architecture of logic gates using adiabatic techniques
 
14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application14 Template Contractual Notice - EOT Application
14 Template Contractual Notice - EOT Application
 
Building Electrical System Design & Installation
Building Electrical System Design & InstallationBuilding Electrical System Design & Installation
Building Electrical System Design & Installation
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
Series of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.pptSeries of visio cisco devices Cisco_Icons.ppt
Series of visio cisco devices Cisco_Icons.ppt
 
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.pptPROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
PROJECT FORMAT FOR EVS AMITY UNIVERSITY GWALIOR.ppt
 
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTCHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECT
 
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdfTutorial for 16S rRNA Gene Analysis with QIIME2.pdf
Tutorial for 16S rRNA Gene Analysis with QIIME2.pdf
 
digital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdfdigital fundamental by Thomas L.floydl.pdf
digital fundamental by Thomas L.floydl.pdf
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单专业办理
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 

RL_Dr.SNR Final ppt for Presentation 28.05.2021.pptx

  • 2. Overview 1 About Reinforcement Learning 2 The Reinforcement Learning Problem Rewards Environment State 3 Inside an RL Agent 4 Problems within Reinforcement Learning & MDP& MDP
  • 3. About Reinforcement Learning The Reinforcement Learning Inside the RL Agent Problems within Reinforcement Learning Many Facets of Reinforcement Learning Figure: Many Facets of Reinforcement Learning. About Reinforcement Learning
  • 4. Branches of Machine Learning Figure: Branches of Machine Learning. About Reinforcement Learning About Reinforcement Learning The Reinforcement Learning Inside the RL Agent Problems within Reinforcement Learning
  • 5.  Science of sequence of decision making deals with agents that must sense & act upon their environment.  An agent learns to behave in an environment by performing the actions and seeing the results of previous actions.  For each good action, the agent gets positive feedback, and for each bad action, the agent gets negative feedback.  In RL, the agent learns automatically using feedbacks without any labelled data, unlike supervised learning. What is Reinforcement Learning? About Reinforcement Learning About Reinforcement Learning The Reinforcement Learning Inside the RL Agent Problems within Reinforcement Learning
  • 6. Characteristics of Reinforcement Learning What makes reinforcement learning different from other machine learning paradigms? No supervisor, only a reward signal Feedback is delayed, not instantaneous Time really matters(Sequential, non i.i.d data) Agent’s actions affect the subsequent data it receives About Reinforcement Learning About Reinforcement Learning The Reinforcement Learning Inside the RL Agent Problems within Reinforcement Learning
  • 7. The ReinforcementLearningComponents Rewards A reward Rt is a scalar feedback signal Indicates how well agent is doing at time-step t The agent’s job is to maximise cumulative reward Reinforcement learning is based on the reward hypothesis Definition (Reward Hypothesis) All goals can be described by the maximisation of expected cumulative reward About Reinforcement Learning The Reinforcement Learning Inside the RL Agent Problems within Reinforcement Learning
  • 8. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Examples of Rewards Fly stunt manoeuvres in a helicopter +ve reward for following desired trajectory -ve reward for crashing walk Defeat the world champion +/-ve reward for winning/losing a game
  • 9. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Examples of Rewards Control a power station +ve reward for producing power -ve reward for exceeding safety thresholds Make a humanoid robot walk +ve reward for forward motion -ve reward for falling over
  • 10. About Reinforcement Learning The Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Agent and Environment At each step t the agent: Executes action At Receives observation Ot Receives scalar reward Rt The environment: Receives action At Emits observation Ot Emits scalar reward Rt t increments at env. step
  • 11. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State History and State The history is the sequence of actions, observations, rewards Ht = A1, O1, R1, . . . , At, Ot, Rt What happens next depends on the history: The agent selects actions The environment selects observations/rewards State is the concise information used to determine what happens next Formally, state is a function of the history: St = f (Ht)
  • 12. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Environment State The environment state Se t is the environment data representation i.e. whatever data the environment uses to pick the next observation/reward The environment can be real-world or simulated
  • 13. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Agent State t The agent state Sa is the agent’s internal representation i.e. whatever information the agent uses to pick the next action i.e. it is the information used by reinforcement learning algorithms It can be function of history: t Sa = f (Ht)
  • 14. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Information State An information state(a.k.a.Markov state) contains all useful information from the history. Definition A state St is Markov if and only if P[St+1|St] = P[St+1|S1, . . . , St] Future is independent of the past given the present” Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future The environment state Se is Markov t
  • 15. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Fully Observable Environments Full observability: agent directly observes environment St = Sa = Se t t Agent state = environment state = information state Formally, this is a Markov decision process (MDP)
  • 16. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Rewards Environment State Partially Observable Environments Partial observability: agent indirectly observes environment: A robot with camera vision isn’t told its absolute location Now agent state = / environment state, formally this is a partially observable Markov decision process(POMDP)
  • 17. About Reinforcement Learning Reinforcement Learning Components Inside the RL Agent Problems within Reinforcement Learning Major Components of the RL Agent RL agent may include one or more of these components: Policy: agent’s behavior /what actions can perform Value function: prediction of future reward Model: agent’s view of the environment Inside the RL Agent
  • 18. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Policy A policy is the agent’s behaviour It is a map from state to action Deterministic policy: a = π(s) Stochastic policy: π(a|s) = P[At = a|St = s] Inside the RL Agent
  • 19. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Value Function Value function is a prediction of future reward Used to evaluate the goodness/badness of states&/actions Therefore to select between actions, needs sum of cumulative rewards vπ(s) = Eπ[Rt + γRt+1 + γ2Rt+2 + . . . |St = s] Inside the RL Agent
  • 20. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Model Agent’s view of the environment P predicts the next state (Transition Model) R predicts the next (immediate) reward, (Reward Model) ss′ P = P[St+1 a J t t = s |S = s, A = a] a s R = E[Rt+1 t t |S = s, A = a] Inside the RL Agent
  • 21. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Maze Example Rewards: -1 per time-step Actions: N, E, S, W States: Agent’s location
  • 22. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Maze Example Policy Arrows represent policy π(s) for each state s
  • 23. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Maze Example: Value Function Numbers represent value vπ(s) of each state s (No. of steps the agent is away from the goal)
  • 24. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Maze Example: Model Grid layout represents transition model P a ss′ s Numbers represent immediate reward Ra for each state s (same for all a)
  • 25. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Categorizing RL agents (1) Value Based No Policy (Implicit) Value Function Policy Based Policy No Value Function Actor Critic Policy Value Function
  • 26. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Categorizing RL agents (2) Model Free Policy and/or Value Function No Model Model Based Policy and/or Value Function Model
  • 27. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Reinforcement Learning Agent Taxonomy Figure: Reinforcement Learning Agent Taxonomy.
  • 28. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Learning and Planning Two fundamental problems in sequential decision making Reinforcement Learning: The environment model is initially unknown The agent interacts with the environment The agent improves its policy Planning: A model of the environment is known The agent performs computations with its model (without any interaction) The agent improves its policy a.k.a. deliberation, reasoning, introspection, pondering, thought, search
  • 29. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Exploration and Exploitation Reinforcement learning is like trial-and-error learning The agent should discover a good policy From its experiences of the environment Without losing too much reward along the way Exploration finds more information about the environment Exploitation exploits known information to maximise reward It is usually important to explore as well as exploit
  • 30. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Exploration and Exploitation Examples Restaurant Selection Exploitation: Go to your favourite restaurant Exploration: Try a new restaurant Oil Drilling Exploitation: Drill at the best known location Exploration: Drill at a new location Game Playing Exploitation: Play the move you believe is best Exploration: Play an experimental move Online Banner Advertisements Exploitation: Show the most successful advert Exploration: Show a different advert
  • 31. About Reinforcement Learning The Reinforcement Learning Problem Inside the RL Agent Problems within Reinforcement Learning Prediction and Control Prediction: evaluate the future Given a policy Control: optimise the future Find the best policy
  • 33. Introduction Markov Process Markov Reward Process Markov Decision Process Overview Markov Decision Process
  • 34. Introduction Markov Process Markov Reward Process Markov Decision Process Introduction to MDP Markov decision process is a mathematical environment for reinforcement learning Where the environment is fully observable Almost all RL problems can be formalised as MDPs Partially observable problems can be converted into MDPs Markov Decision Process
  • 35. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain Markov Property ”The future is independent of the past given the present” Definition A state St is Markov it determines being in one state what can be the transition probability to other state P[St +1|St ] = P[St +1|S1,..,St ] The state captures all relevant information from the history Once the state is known, the history may be thrown away i.e. The state is a sufficient statistic of the future MDPs Markov Decision Process
  • 36. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain State Transition Matrix For a Markov state s and successor state s’, the state transition probability is denoted by Pss′ = P[St +1 = s’|St = s] State transition matrix P denotes transition probabilities from all states s to all successor states s’, ... P = from . ... . ... Pn1 Pnn P11 P1n where each row of the matrix sums to 1. Markov Decision Process
  • 37. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain Markov Process A Markov process is a memoryless process, i.e. a sequence of states S1,S2,... with the Markov property. Definition A Markov Process (or Markov Chain) is a tuple < S ,P > S is a (finite) set of states P is a state transition probability matrix, Pss' = P[St+1 = s′|St = s] Markov Decision Process
  • 38. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain Example: Student Markov Chain/Process Markov Decision Process
  • 39. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain Example: Student Markov Process Episodes Sample episodes for Student Markov Chain starting from S1 = C1 S1,S2,...,ST C1 C2 C3 Pass Sleep C1 FB FB C1 C2 Sleep C1 C2 C3 Pub C2 C3 Pass Sleep C1 FB FB C1 C2 C3 Pub C1 FB FB FB C1 C2 C3 Pub C2 Sleep Markov Decision Process
  • 40. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Property Markov Chain Example: Student Markov Chain Transition Matrix P = Sleep c1 c2 0.2 1.0 Pub 0.2 0.4 0.4 c1 c2 c3 Pass Pub FB 0.5 0.5 0.8 c3 0.6 0.4 Pass FB 0.1 0.9 Sleep 1 Markov Decision Process
  • 41. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Markov Reward Process A Markov reward process is a Markov chain with rewards. Definition A Markov Reward Process is a tuple < S ,P ,R ,γ > S is a (finite) set of states P is a state transition probability matrix, Pss' = P[St+1 = s′|St = s] R is a reward function, Rs = E[Rt+1|St = s] γ is a discount factor, γ ∈ [0;1] Markov Decision Process
  • 42. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: Student MRP Markov Decision Process
  • 43. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Returns Definition The return Gt is the total discounted reward from time-step t. Gt = Rt +1 + γRt +2 + γ2Rt +3 + ... t inf Σ G = Rt +k+1 k=0 The discount γ ∈ [0,1] is the present value of future rewards The value of receiving reward R after k + 1 time-steps is γkR Markov Decision Process
  • 44. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Why discount? Most Markov reward and decision processes are discounted. Why? Mathematically convenient to use Avoids infinite returns in cyclic Markov processes If the reward is financial, immediate rewards may earn more interest than delayed rewards Human behaviour shows preference for immediate reward Markov Decision Process
  • 45. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Value Function The value function v(s) gives the long-term value of state s Definition The state value function v(s) of an MRP is the expected return starting from state s v(s) = E[Gt |St = s] Markov Decision Process
  • 46. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: Student MRP Returns Sample returns for Student MRP: Starting from S1 = C1 with γ = 1 2 T 2 G1 = R2 + γR3 + · ·· + γ RT − 2 C1 C2 C3 Pass Sleep C1 FB FB C1 C2 Sleep C1 C2 C3 Pub C2 C3 Pass Sleep C1 FB FB C1 C2 C3 Pub C1 FB FB FB C1 C2 C3 Pub C2 Sleep 1 2 1 4 1 8 G1 = -2 - 2 × − 2 × + 10 × = −2.25 1 1 1 1 2 4 8 16 G1 = -2 - 1 × − 1 × − 2 × − 2 × = −3.125 1 1 1 1 2 4 8 16 G1 = -2 -2 × − 2 × + 1 × − 2 × + ···= − 3.41 1 1 1 1 2 4 8 16 G1 = -2 - 1 × − 1 × − 2 × − 2 × + ···= − 3.20 Markov Decision Process
  • 47. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: State-Value Function for Student MRP(1) Markov Decision Process
  • 48. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: State-Value Function for Student MRP(2) Markov Decision Process
  • 49. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: State-Value Function for Student MRP(3) Markov Decision Process
  • 50. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Bellman Equation for MRPs The value function can be decomposed into two parts: immediate reward Rt+1 discounted value of successor state (St+1) v(s) = E[Gt |St = s] = E[Rt +1 + γRt +2 + γ2Rt +3 + ...|St = s] = E[Rt +1 + γ(Rt +2 + γRt +3 + ...)|St = s] = E[Rt +1 + γV (St +1)|St = s] Markov Decision Process
  • 51. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Example: Bellman Equation for Student MRP Markov Decision Process
  • 52. Introduction Markov Process Markov Reward Process Markov Decision Process Return Value function Bellman Equation Bellman Equation in Matrix Form The Bellman equation can be expressed concisely using matrices, v = R + γP v where v is a column vector with one entry per state v(n) = v(1) R . . + γ P ... P 1 11 1n R n ... Pn1 Pnn v(1) . . . . v(n) Markov Decision Process
  • 53. Introduction Markov Process Markov Reward Process Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation Markov Decision Process A Markov decision process (MDP) is a Markov reward process with decisions. It is an environment in which all states are Markov. Definition A Markov Reward Process is a tuple < S ,A ,P ,R ,γ > S is a (finite) set of states A is a finite set of actions P is a state transition probability matrix, Pa ′ ss t +1 t t ' = P[S = s |S = s,A = a] a s a t +1 R is a reward function, R = E[R |St = s,A = a] γ is a discount factor, γ ∈ [0;1] Markov Decision Process
  • 54. Introduction Markov Process Markov Reward Process Markov Decision Process Example: Student MDP Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation
  • 55. Introduction Markov Process Markov Reward Process Markov Decision Process Value Function Definition The state-value function vπ(s) of an MDP is the expected return starting from state s, and then following policy vπ(s) = Eπ[Gt |St = s] Definition The action-value function qπ(s,a) is the expected return starting from state s, taking action a, and then following policy π qπ(s,a) = Eπ[Gt |St = s,At = a] Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation
  • 56. Introduction Markov Process Markov Reward Process Markov Decision Process Example: State-Value Function for Student MDP Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation
  • 57. Introduction Markov Process Markov Reward Process Markov Decision Process Bellman Expectation Equation The state-value function can again be decomposed into immediate reward plus discounted value of successor state, vπ(s) = Eπ[Rt +1 + γvπ(st +1)|St = s] The action-value function can similarly be decomposed, qπ(s,a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a] Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation
  • 58. Introduction Markov Process Markov Reward Process Markov Decision Process Example: Bellman Expectation Equation for Student MDP Markov Decision Process Markov Decision Process Value Function Bellman Expectation Equation
  • 59. Conclusion Reinforcement Learning is one of the most interesting and useful parts of Machine learning. In RL, the agent explores the environment without any human intervention. It is the main learning algorithm that is used in Artificial Intelligence. The main issue with the RL algorithm is that some of the parameters may affect the speed of the learning, such as delayed feedback.