This document provides an overview of reinforcement learning. It discusses key concepts like rewards, environment, state, and the reinforcement learning agent. The agent learns through trial-and-error interactions with its environment by trying different actions and receiving feedback in the form of rewards. The goal is to learn a policy that maximizes long-term rewards by balancing exploration of new actions with exploitation of known rewarding actions. The document also covers reinforcement learning problems, components of the reinforcement learning agent, and different categories of reinforcement learning methods.
Reinforcement learning is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, an artificial intelligence faces a game-like situation.
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
A computer game using temporal difference algorithm of Machine learning which improves the ability of the computer to learn and also explore the best next move for the game by greedy movement techniques and exploration method techniques for the future states of the game.
Reinforcement learning is the training of machine learning models to make a sequence of decisions. The agent learns to achieve a goal in an uncertain, potentially complex environment. In reinforcement learning, an artificial intelligence faces a game-like situation.
An efficient use of temporal difference technique in Computer Game LearningPrabhu Kumar
A computer game using temporal difference algorithm of Machine learning which improves the ability of the computer to learn and also explore the best next move for the game by greedy movement techniques and exploration method techniques for the future states of the game.
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017MLconf
Ben Lau is a quantitative researcher in a macro hedge fund in Hong Kong and he looks to apply mathematical models and signal processing techniques to study the financial market. Prior joining the financial industry, he specialized in using his mathematical modelling skills to discover the mysteries of the universe whilst working at Stanford Linear Accelerator Centre, a national accelerator laboratory where he studied the asymmetry between matter and antimatter by analysing tens of billions of collision events created by the particle accelerators. Ben was awarded his Ph.D. in Particle Physics from Princeton University and his undergraduate degree (with First Class Honours) at the Chinese University of Hong Kong.
Abstract Summary:
Deep Reinforcement Learning: Developing a robotic car with the ability to form long term driving strategies is the key for enabling fully autonomous driving in the future. Reinforcement learning has been considered a strong AI paradigm which can be used to teach machines through interaction with the environment and by learning from their mistakes. In this talk, we will discuss how to apply deep reinforcement learning technique to train a self-driving car under an open source racing car simulator called TORCS. I am going to share how this is implemented and will discuss various challenges in this project.
At the forefront of artificial intelligence is reinforcement learning (RL), a potent paradigm for teaching intelligent agents to make sequential decisions in complicated environments. The purpose of this article is to present a thorough analysis of reinforcement learning, including its foundational ideas, essential elements, practical uses, and most recent developments.
Understanding Reinforcement Learning
In the machine learning subfield known as reinforcement learning, an agent picks up decision-making skills via interacting with its surroundings. RL involves learning through trial and error, as opposed to supervised learning, in which the model is trained on labeled data, and unsupervised learning, in which the algorithm finds patterns in unlabeled data. Based on its actions, the agent receives feedback in the form of rewards or penalties, which helps it gradually learn the best courses of action.
Key Components of Reinforcement Learning
Agent
The fundamental component of reinforcement learning is the agent, which is the entity in charge of making choices in a particular environment. This could be any system intended to interact with and impact its environment, such as a robot or an algorithm that plays games.
Environment
The external system or context that an agent operates in is referred to as the environment. It offers the environment in which the agent acts and receives feedback in the form of incentives or penalties.
State
The state captures pertinent data that the agent uses to make decisions, representing the environment as it is at the moment. States play a critical role in dictating the agent's next moves and the results that follow.
Action
The choices or actions that an agent can make in a particular state are known as actions. The agent's decision space is defined by the set of feasible actions, and it is up to it to select the best course of action given its current understanding.
Reward
The feedback mechanism in reinforcement learning is provided by rewards. They put a number on the immediate gain or expense incurred by an agent acting in a certain state. Learning a policy that maximizes the cumulative reward over time is the agent's aim.
The Reinforcement Learning Process
Reinforcement learning is best understood as a cyclical process. In this process, the agent interacts with its surroundings and modifies its behavior in response to feedback.
Exploration and Exploitation
There is a basic trade-off between exploration and exploitation that the agent must make. The agent experiments with different actions to find out how they affect the environment and gain more knowledge about it. Choosing actions that, in the agent's opinion and in light of its current knowledge, will result in the highest cumulative reward is known as exploitation.
Policy
A key idea in reinforcement learning is the policy, which is the behavior or strategy the agent uses to choose which actions to perform in which states.
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
REINFORCEMENT LEARNING (reinforced through trial and error).pptxarchayacb21
Reinforcement learning (RL) is a machine learning (ML) technique that trains software to make decisions to achieve the most optimal results. It mimics the trial-and-error learning process that humans use to achieve their goals.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017MLconf
Ben Lau is a quantitative researcher in a macro hedge fund in Hong Kong and he looks to apply mathematical models and signal processing techniques to study the financial market. Prior joining the financial industry, he specialized in using his mathematical modelling skills to discover the mysteries of the universe whilst working at Stanford Linear Accelerator Centre, a national accelerator laboratory where he studied the asymmetry between matter and antimatter by analysing tens of billions of collision events created by the particle accelerators. Ben was awarded his Ph.D. in Particle Physics from Princeton University and his undergraduate degree (with First Class Honours) at the Chinese University of Hong Kong.
Abstract Summary:
Deep Reinforcement Learning: Developing a robotic car with the ability to form long term driving strategies is the key for enabling fully autonomous driving in the future. Reinforcement learning has been considered a strong AI paradigm which can be used to teach machines through interaction with the environment and by learning from their mistakes. In this talk, we will discuss how to apply deep reinforcement learning technique to train a self-driving car under an open source racing car simulator called TORCS. I am going to share how this is implemented and will discuss various challenges in this project.
At the forefront of artificial intelligence is reinforcement learning (RL), a potent paradigm for teaching intelligent agents to make sequential decisions in complicated environments. The purpose of this article is to present a thorough analysis of reinforcement learning, including its foundational ideas, essential elements, practical uses, and most recent developments.
Understanding Reinforcement Learning
In the machine learning subfield known as reinforcement learning, an agent picks up decision-making skills via interacting with its surroundings. RL involves learning through trial and error, as opposed to supervised learning, in which the model is trained on labeled data, and unsupervised learning, in which the algorithm finds patterns in unlabeled data. Based on its actions, the agent receives feedback in the form of rewards or penalties, which helps it gradually learn the best courses of action.
Key Components of Reinforcement Learning
Agent
The fundamental component of reinforcement learning is the agent, which is the entity in charge of making choices in a particular environment. This could be any system intended to interact with and impact its environment, such as a robot or an algorithm that plays games.
Environment
The external system or context that an agent operates in is referred to as the environment. It offers the environment in which the agent acts and receives feedback in the form of incentives or penalties.
State
The state captures pertinent data that the agent uses to make decisions, representing the environment as it is at the moment. States play a critical role in dictating the agent's next moves and the results that follow.
Action
The choices or actions that an agent can make in a particular state are known as actions. The agent's decision space is defined by the set of feasible actions, and it is up to it to select the best course of action given its current understanding.
Reward
The feedback mechanism in reinforcement learning is provided by rewards. They put a number on the immediate gain or expense incurred by an agent acting in a certain state. Learning a policy that maximizes the cumulative reward over time is the agent's aim.
The Reinforcement Learning Process
Reinforcement learning is best understood as a cyclical process. In this process, the agent interacts with its surroundings and modifies its behavior in response to feedback.
Exploration and Exploitation
There is a basic trade-off between exploration and exploitation that the agent must make. The agent experiments with different actions to find out how they affect the environment and gain more knowledge about it. Choosing actions that, in the agent's opinion and in light of its current knowledge, will result in the highest cumulative reward is known as exploitation.
Policy
A key idea in reinforcement learning is the policy, which is the behavior or strategy the agent uses to choose which actions to perform in which states.
Reinforcement learning (RL) is an area of machine learning concerned with how intelligent agents ought to take actions in an environment in order to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.
REINFORCEMENT LEARNING (reinforced through trial and error).pptxarchayacb21
Reinforcement learning (RL) is a machine learning (ML) technique that trains software to make decisions to achieve the most optimal results. It mimics the trial-and-error learning process that humans use to achieve their goals.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Low power architecture of logic gates using adiabatic techniquesnooriasukmaningtyas
The growing significance of portable systems to limit power consumption in ultra-large-scale-integration chips of very high density, has recently led to rapid and inventive progresses in low-power design. The most effective technique is adiabatic logic circuit design in energy-efficient hardware. This paper presents two adiabatic approaches for the design of low power circuits, modified positive feedback adiabatic logic (modified PFAL) and the other is direct current diode based positive feedback adiabatic logic (DC-DB PFAL). Logic gates are the preliminary components in any digital circuit design. By improving the performance of basic gates, one can improvise the whole system performance. In this paper proposed circuit design of the low power architecture of OR/NOR, AND/NAND, and XOR/XNOR gates are presented using the said approaches and their results are analyzed for powerdissipation, delay, power-delay-product and rise time and compared with the other adiabatic techniques along with the conventional complementary metal oxide semiconductor (CMOS) designs reported in the literature. It has been found that the designs with DC-DB PFAL technique outperform with the percentage improvement of 65% for NOR gate and 7% for NAND gate and 34% for XNOR gate over the modified PFAL techniques at 10 MHz respectively.
CHINA’S GEO-ECONOMIC OUTREACH IN CENTRAL ASIAN COUNTRIES AND FUTURE PROSPECTjpsjournal1
The rivalry between prominent international actors for dominance over Central Asia's hydrocarbon
reserves and the ancient silk trade route, along with China's diplomatic endeavours in the area, has been
referred to as the "New Great Game." This research centres on the power struggle, considering
geopolitical, geostrategic, and geoeconomic variables. Topics including trade, political hegemony, oil
politics, and conventional and nontraditional security are all explored and explained by the researcher.
Using Mackinder's Heartland, Spykman Rimland, and Hegemonic Stability theories, examines China's role
in Central Asia. This study adheres to the empirical epistemological method and has taken care of
objectivity. This study analyze primary and secondary research documents critically to elaborate role of
china’s geo economic outreach in central Asian countries and its future prospect. China is thriving in trade,
pipeline politics, and winning states, according to this study, thanks to important instruments like the
Shanghai Cooperation Organisation and the Belt and Road Economic Initiative. According to this study,
China is seeing significant success in commerce, pipeline politics, and gaining influence on other
governments. This success may be attributed to the effective utilisation of key tools such as the Shanghai
Cooperation Organisation and the Belt and Road Economic Initiative.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
Literature Review Basics and Understanding Reference Management.pptxDr Ramhari Poudyal
Three-day training on academic research focuses on analytical tools at United Technical College, supported by the University Grant Commission, Nepal. 24-26 May 2024
2. Overview
1 About Reinforcement Learning
2 The Reinforcement Learning Problem
Rewards
Environment
State
3 Inside an RL Agent
4 Problems within Reinforcement Learning & MDP& MDP
3. About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
Many Facets of Reinforcement Learning
Figure: Many Facets of Reinforcement Learning.
About Reinforcement Learning
4. Branches of Machine Learning
Figure: Branches of Machine Learning.
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
5. Science of sequence of decision making deals with agents that must sense
& act upon their environment.
An agent learns to behave in an environment by performing the actions
and seeing the results of previous actions.
For each good action, the agent gets positive feedback, and for each
bad action, the agent gets negative feedback.
In RL, the agent learns automatically using feedbacks without any
labelled data, unlike supervised learning.
What is Reinforcement Learning?
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
6. Characteristics of Reinforcement Learning
What makes reinforcement learning different from other machine
learning paradigms?
No supervisor, only a reward signal
Feedback is delayed, not instantaneous
Time really matters(Sequential, non i.i.d data)
Agent’s actions affect the subsequent data it receives
About Reinforcement Learning
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
7. The ReinforcementLearningComponents
Rewards
A reward Rt is a scalar feedback signal
Indicates how well agent is doing at time-step t
The agent’s job is to maximise cumulative reward
Reinforcement learning is based on the reward hypothesis
Definition (Reward Hypothesis)
All goals can be described by the maximisation of expected
cumulative reward
About Reinforcement Learning
The Reinforcement Learning
Inside the RL Agent
Problems within Reinforcement Learning
8. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Examples of Rewards
Fly stunt manoeuvres in a helicopter
+ve reward for following desired trajectory
-ve reward for crashing walk
Defeat the world champion
+/-ve reward for winning/losing a game
9. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Examples of Rewards
Control a power station
+ve reward for producing power
-ve reward for exceeding safety thresholds
Make a humanoid robot walk
+ve reward for forward motion
-ve reward for falling over
10. About Reinforcement Learning The
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Agent and Environment
At each step t the agent:
Executes action At
Receives observation Ot
Receives scalar reward
Rt
The environment:
Receives action At
Emits observation Ot
Emits scalar reward Rt
t increments at env. step
11. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
History and State
The history is the sequence of actions, observations, rewards
Ht = A1, O1, R1, . . . , At, Ot, Rt
What happens next depends on the history:
The agent selects actions
The environment selects observations/rewards
State is the concise information used to determine what
happens next
Formally, state is a function of the history:
St = f (Ht)
12. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Environment State
The environment state Se
t
is the environment data
representation
i.e. whatever data the
environment uses to pick
the next
observation/reward
The environment can be
real-world or simulated
13. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Agent State
t
The agent state Sa is the
agent’s internal
representation
i.e. whatever information
the agent uses to pick the
next action
i.e. it is the information
used by reinforcement
learning algorithms
It can be function of
history:
t
Sa = f (Ht)
14. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Information State
An information state(a.k.a.Markov state) contains all useful
information from the history.
Definition
A state St is Markov if and only if
P[St+1|St] = P[St+1|S1, . . . , St]
Future is independent of the past given the present”
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future
The environment state Se is Markov
t
15. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Fully Observable Environments
Full observability: agent directly
observes environment
St = Sa = Se
t t
Agent state = environment
state = information state
Formally, this is a Markov
decision process (MDP)
16. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Rewards
Environment
State
Partially Observable Environments
Partial observability: agent indirectly observes environment:
A robot with camera vision isn’t told its absolute location
Now agent state =
/ environment state, formally this is a
partially observable Markov decision process(POMDP)
17. About Reinforcement Learning
Reinforcement Learning Components
Inside the RL Agent
Problems within Reinforcement Learning
Major Components of the RL Agent
RL agent may include one or more of these components:
Policy: agent’s behavior /what actions can perform
Value function: prediction of future reward
Model: agent’s view of the environment
Inside the RL Agent
18. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Policy
A policy is the agent’s behaviour
It is a map from state to action
Deterministic policy: a = π(s)
Stochastic policy: π(a|s) = P[At = a|St = s]
Inside the RL Agent
19. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Value Function
Value function is a prediction of future reward
Used to evaluate the goodness/badness of states&/actions
Therefore to select between actions, needs sum of cumulative
rewards
vπ(s) = Eπ[Rt + γRt+1 + γ2Rt+2 + . . . |St = s]
Inside the RL Agent
20. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Model
Agent’s view of the environment
P predicts the next state (Transition Model)
R predicts the next (immediate) reward, (Reward Model)
ss′
P = P[St+1
a J
t t
= s |S = s, A = a]
a
s
R = E[Rt+1 t t
|S = s, A = a]
Inside the RL Agent
21. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example
Rewards: -1 per
time-step
Actions: N, E, S, W
States: Agent’s location
22. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example Policy
Arrows represent policy π(s) for each state s
23. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example: Value Function
Numbers represent value vπ(s) of each state s (No. of steps the
agent is away from the goal)
24. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Maze Example: Model
Grid layout represents transition model P a
ss′
s
Numbers represent immediate reward Ra for each state s
(same for all a)
25. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Categorizing RL agents (1)
Value Based
No Policy (Implicit)
Value Function
Policy Based
Policy
No Value Function
Actor Critic
Policy
Value Function
26. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Categorizing RL agents (2)
Model Free
Policy and/or Value Function
No Model
Model Based
Policy and/or Value Function
Model
27. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Reinforcement Learning Agent Taxonomy
Figure: Reinforcement Learning Agent Taxonomy.
28. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Learning and Planning
Two fundamental problems in sequential decision making
Reinforcement Learning:
The environment model is initially unknown
The agent interacts with the environment
The agent improves its policy
Planning:
A model of the environment is known
The agent performs computations with its model (without any
interaction)
The agent improves its policy
a.k.a. deliberation, reasoning, introspection, pondering,
thought, search
29. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Exploration and Exploitation
Reinforcement learning is like trial-and-error learning
The agent should discover a good policy
From its experiences of the environment
Without losing too much reward along the way
Exploration finds more information about the environment
Exploitation exploits known information to maximise reward
It is usually important to explore as well as exploit
30. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Exploration and Exploitation Examples
Restaurant Selection
Exploitation: Go to your favourite restaurant
Exploration: Try a new restaurant
Oil Drilling
Exploitation: Drill at the best known location
Exploration: Drill at a new location
Game Playing
Exploitation: Play the move you believe is best
Exploration: Play an experimental move
Online Banner Advertisements
Exploitation: Show the most successful advert
Exploration: Show a different advert
31. About Reinforcement Learning
The Reinforcement Learning Problem
Inside the RL Agent
Problems within Reinforcement Learning
Prediction and Control
Prediction: evaluate the future
Given a policy
Control: optimise the future
Find the best policy
34. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Introduction to MDP
Markov decision process is a mathematical environment for
reinforcement learning
Where the environment is fully observable
Almost all RL problems can be formalised as MDPs
Partially observable problems can be converted into MDPs
Markov Decision Process
35. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Markov Property
”The future is independent of the past given the present”
Definition
A state St is Markov it determines being in one state what can be
the transition probability to other state P[St +1|St ] = P[St +1|S1,..,St ]
The state captures all relevant information from the history
Once the state is known, the history may be thrown away
i.e. The state is a sufficient statistic of the future MDPs
Markov Decision Process
36. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
State Transition Matrix
For a Markov state s and successor state s’, the state
transition probability is denoted by
Pss′ = P[St +1 = s’|St = s]
State transition matrix P denotes transition probabilities from
all states s to all successor states s’,
...
P = from . ... .
...
Pn1 Pnn
P11 P1n
where each row of the matrix sums to 1.
Markov Decision Process
37. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Markov Process
A Markov process is a memoryless process, i.e. a sequence of
states S1,S2,... with the Markov property.
Definition
A Markov Process (or Markov Chain) is a tuple < S ,P >
S is a (finite) set of states
P is a state transition probability matrix,
Pss' = P[St+1 = s′|St = s]
Markov Decision Process
38. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Example: Student Markov Chain/Process
Markov Decision Process
39. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Property
Markov Chain
Example: Student Markov Process Episodes
Sample episodes for Student Markov
Chain starting from S1 = C1
S1,S2,...,ST
C1 C2 C3 Pass Sleep
C1 FB FB C1 C2 Sleep
C1 C2 C3 Pub C2 C3 Pass
Sleep
C1 FB FB C1 C2 C3 Pub C1 FB
FB FB C1 C2 C3 Pub C2 Sleep
Markov Decision Process
41. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Markov Reward Process
A Markov reward process is a Markov chain with rewards.
Definition
A Markov Reward Process is a tuple < S ,P ,R ,γ >
S is a (finite) set of states
P is a state transition probability matrix,
Pss' = P[St+1 = s′|St = s]
R is a reward function, Rs = E[Rt+1|St = s]
γ is a discount factor, γ ∈ [0;1]
Markov Decision Process
43. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Returns
Definition
The return Gt is the total discounted reward from time-step t.
Gt = Rt +1 + γRt +2 + γ2Rt +3 + ...
t
inf
Σ
G = Rt +k+1
k=0
The discount γ ∈ [0,1] is the present value of future rewards
The value of receiving reward R after k + 1 time-steps is γkR
Markov Decision Process
44. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Why discount?
Most Markov reward and decision processes are discounted. Why?
Mathematically convenient to use
Avoids infinite returns in cyclic Markov processes
If the reward is financial, immediate rewards may earn more
interest than delayed rewards
Human behaviour shows preference for immediate reward
Markov Decision Process
45. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Value Function
The value function v(s) gives the long-term value of state s
Definition
The state value function v(s) of an MRP is the expected return
starting from state s
v(s) = E[Gt |St = s]
Markov Decision Process
47. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(1)
Markov Decision Process
48. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(2)
Markov Decision Process
49. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: State-Value Function for Student MRP(3)
Markov Decision Process
50. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Bellman Equation for MRPs
The value function can be decomposed into two parts:
immediate reward Rt+1
discounted value of successor state (St+1)
v(s) = E[Gt |St = s]
= E[Rt +1 + γRt +2 + γ2Rt +3 + ...|St = s]
= E[Rt +1 + γ(Rt +2 + γRt +3 + ...)|St = s]
= E[Rt +1 + γV (St +1)|St = s]
Markov Decision Process
51. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Example: Bellman Equation for Student MRP
Markov Decision Process
52. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Return
Value function
Bellman Equation
Bellman Equation in Matrix Form
The Bellman equation can be expressed concisely using matrices,
v = R + γP v
where v is a column vector with one entry per state
v(n)
=
v(1) R
. .
+ γ
P ... P
1 11 1n
R n ...
Pn1 Pnn
v(1)
.
. . .
v(n)
Markov Decision Process
53. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
Markov Decision Process
A Markov decision process (MDP) is a Markov reward process with
decisions. It is an environment in which all states are Markov.
Definition
A Markov Reward Process is a tuple < S ,A ,P ,R ,γ >
S is a (finite) set of states
A is a finite set of actions
P is a state transition probability matrix,
Pa ′
ss t +1 t t
' = P[S = s |S = s,A = a]
a
s
a
t +1
R is a reward function, R = E[R |St = s,A = a]
γ is a discount factor, γ ∈ [0;1]
Markov Decision Process
54. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
55. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Value Function
Definition
The state-value function vπ(s) of an MDP is the expected return
starting from state s, and then following policy
vπ(s) = Eπ[Gt |St = s]
Definition
The action-value function qπ(s,a) is the expected return starting
from state s, taking action a, and then following policy π
qπ(s,a) = Eπ[Gt |St = s,At = a]
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
56. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: State-Value Function for Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
57. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Bellman Expectation Equation
The state-value function can again be decomposed into immediate
reward plus discounted value of successor state,
vπ(s) = Eπ[Rt +1 + γvπ(st +1)|St = s]
The action-value function can similarly be decomposed,
qπ(s,a) = Eπ[Rt+1 + γqπ(St+1, At+1)|St = s,At = a]
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
58. Introduction
Markov Process
Markov Reward Process
Markov Decision Process
Example: Bellman Expectation Equation for Student MDP
Markov Decision Process
Markov Decision Process
Value Function
Bellman Expectation Equation
59. Conclusion
Reinforcement Learning is one of the most interesting and useful
parts of Machine learning.
In RL, the agent explores the environment without any human
intervention. It is the main learning algorithm that is used in
Artificial Intelligence.
The main issue with the RL algorithm is that some of the
parameters may affect the speed of the learning, such as delayed
feedback.