SlideShare une entreprise Scribd logo
1  sur  53
Télécharger pour lire hors ligne
Introduction to Reinforcement Learning
Edward Balaban

January 17, 2014
Introduction to
Reinforcement
Learning
Edward Balaban

Preliminaries
MDP
POMDP
Resources
Objectives

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP

Introduce Reinforcement Learning and its applications

Resources

Overview Markov Decision Processes, Value Iteration,
Policy Iteration, and Q-learning
Overview Partially Observable Markov Decision
Processes and methods to solve them
Illustrate the above concepts with some examples

1 / 28
What is Reinforcement Learning?
Supervised learning: learn a model from training data
that maps inputs to outputs, use it to generate outputs
from future inputs
Unsupervised learning: recognize patterns in input data

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources

Reinforcement learning (RL): provide the learning agent
with a reward function and let it figure out the best
strategy for obtaining large rewards

Some of the material in these slides is borrowed from Andrew Ng and Wheeler Ruml lectures
on reinforcement learning
2 / 28
What is Reinforcement Learning?
Supervised learning: learn a model from training data
that maps inputs to outputs, use it to generate outputs
from future inputs
Unsupervised learning: recognize patterns in input data

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources

Reinforcement learning (RL): provide the learning agent
with a reward function and let it figure out the best
strategy for obtaining large rewards
RL has been used in such diverse applications as:
Business strategy planning
Aircraft control
Optimal routing (data packets, vehicles, etc)
Robot motion control
Some of the material in these slides is borrowed from Andrew Ng and Wheeler Ruml lectures
on reinforcement learning
2 / 28
How do we model for RL?
Modeling frameworks with increasing levels of uncertainty:

Introduction to
Reinforcement
Learning
Edward Balaban

State space models:
no uncertainty

Preliminaries

Markov Decision Processes (MDPs):
uncertainty in action effects

POMDP

MDP

Resources

Partially Observable Markov Decision Processes (POMDPs):
uncertainty in action effects and current state

3 / 28
How do we model for RL?
Modeling frameworks with increasing levels of uncertainty:

Introduction to
Reinforcement
Learning
Edward Balaban

State space models:
no uncertainty

Preliminaries

Markov Decision Processes (MDPs):
uncertainty in action effects

POMDP

MDP

Resources

Partially Observable Markov Decision Processes (POMDPs):
uncertainty in action effects and current state
Other modeling frameworks exist, e.g. Predictive State Representation:
Generalizations of POMDPs that were shown to have both a
greater representational capacity than POMDPs and yield
representations that are at least as compact (Singh et al, 2004
and Even-Dar et al, 2005)
Represent the state of a dynamic system by tracking occurrence
probabilities of a set of future events (tests), conditioned on past
events (histories)
Rely solely on observable quantities (unlike POMDPs)
3 / 28
Introduction to
Reinforcement
Learning
Edward Balaban

Preliminaries
MDP
POMDP
Resources
Markov Decision Process (MDP)

Introduction to
Reinforcement
Learning

States:

S = {s1 , ..., s|S| }

Actions:

A = {a1 , ..., a|A| }

Preliminaries

Transition probabilities:

T (s, a, s ) = Pr (s |s, a)

MDP

Rewards:

R:S→R

Policy:

π : S → A, Π is the set of all policies

Edward Balaban

Learning
Solving
Continuous state MDP
Example

POMDP
Resources

4 / 28
Markov Decision Process (MDP)

Introduction to
Reinforcement
Learning

States:

S = {s1 , ..., s|S| }

Actions:

A = {a1 , ..., a|A| }

Preliminaries

Transition probabilities:

T (s, a, s ) = Pr (s |s, a)

MDP

Rewards:

R:S→R

Policy:

π : S → A, Π is the set of all policies

Edward Balaban

Learning
Solving
Continuous state MDP
Example

POMDP

Value function:

Resources

V π (s) = E [R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + . . . |s0 = s, π]

4 / 28
Markov Decision Process (MDP)

Introduction to
Reinforcement
Learning

States:

S = {s1 , ..., s|S| }

Actions:

A = {a1 , ..., a|A| }

Preliminaries

Transition probabilities:

T (s, a, s ) = Pr (s |s, a)

MDP

Rewards:

R:S→R

Policy:

π : S → A, Π is the set of all policies

Edward Balaban

Learning
Solving
Continuous state MDP
Example

POMDP
Resources

Value function:
V π (s) = E [R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + . . . |s0 = s, π]
Bellman Equation:
V π (s) = R(s) + γ

T (s, a, s )V π (s )
s ∈S

Optimal Value function:
V ∗ = max V π (s)
π

4 / 28
Markov Decision Process (MDP), continued
Bellman Equation for the optimal value function:
T (s, a, s )V ∗ (s )

V ∗ (s) = R(s) + max γ
a∈A
s ∈S

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving

Policy:

Continuous state MDP

π ∗ (s) = arg max γ
a∈A
∗

V (s) = V

T (s, a, s )V ∗ (s )
s ∈S

π∗

Example

POMDP
Resources

(s) ≥ V π (s)

5 / 28
Learning an MDP

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP

Usually S, A, and γ are known.

Learning
Solving
Continuous state MDP
Example

# times took action a in state s and got to s
T (s, a, s ) =
#times took action a in state s

POMDP
Resources

6 / 28
Learning an MDP

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP

Usually S, A, and γ are known.

Learning
Solving
Continuous state MDP
Example

# times took action a in state s and got to s
T (s, a, s ) =
#times took action a in state s

POMDP
Resources

Similarly, if R is unknown, can also pick our estimate of the
expected immediate reward R(s) in state s to be the average
reward observed in that state.

6 / 28
Introduction to
Reinforcement
Learning

Solving an MDP: Value Iteration

Edward Balaban
Preliminaries
MDP
Learning
Solving
Continuous state MDP
Example

∀s ∈ S, V (s) ← 0
Repeat until convergence:

POMDP
Resources

∀s ∈ S, V (s) ← R(s) + max γ
a∈A

s ∈S

T (s, a, s )V (s )

7 / 28
Introduction to
Reinforcement
Learning

Convergence

Edward Balaban
From the definition of Bellman operator:
Preliminaries
MDP
||B(V1 ) − B(V2 )||∞ = max R(s) + γmax
s∈S

Psa (s )V1 (s ) − R(s) − γmax

a∈A

Psa (s )V2 (s )

a∈A
s ∈S

s

Learning
Solving

∈S

(1)

Continuous state MDP
Example

POMDP
= γ · max max
s∈S

Psa (s )V1 (s ) − max

a∈A

Psa (s )V2 (s )

(2)

a∈A
s ∈S

s

Resources

∈S

To go further, we need to understand whether the two maximization operations over the set of actions
for V1 and V2 can be combined. To do that, let’s use the following definitions:

f1 (a) =

Psa (s )V1 (s )

(3)

Psa (s )V2 (s )

(4)

s ∈S

f2 (a) =
s

∈S

8 / 28
Introduction to
Reinforcement
Learning

Convergence, continued
∗
In order to, for the moment, get rid of the max operators, let’s also define a1 as the action that
∗
maximizes f1 and a2 as the action that maximizes f2 . Then

max

a∈A

s ∈S

Preliminaries

∗
∗
Psa (s )V2 (s ) can be written as |f1 (a1 ) − f2 (a2 )|.

Psa (s )V1 (s ) − max

a∈A

Edward Balaban

s

MDP

∈S

Learning
Solving

∗
∗
∗
∗
∗
∗
Since f1 (a1 ) ≥ f2 (a1 ) and f2 (a2 ) ≥ f1 (a2 ) (by the virtue of a1 and a2 maximizing f1 and f2 ,
respectively), we can “unpack” the absolute value operator as follows:
∗

∗

∗

∗

∗

∗

∗

∗

∗

∗

f1 (a1 ) − f2 (a2 ) ≤ f1 (a1 ) − f2 (a1 )
f2 (a2 ) − f1 (a1 ) ≤ f2 (a2 ) − f1 (a2 )

Continuous state MDP
Example

(5)

POMDP

(6)

Resources

Then it is also true that
∗

∗

f1 (a1 ) − f2 (a2 ) ≤ |f1 (a1 ) − f2 (a1 )|

(7)

∗
f2 (a2 )

− f1 (a2 )|

(8)

f1 (a1 ) − f2 (a2 ) ≤ max |f1 (a) − f2 (a)|

(9)

−

∗
f1 (a1 )

≤

∗
|f2 (a2 )

∗

And, finally, it should also be true for ∀a that
∗

∗

a

∗
f2 (a2 )

−

∗
f1 (a1 )

≤ max |f2 (a) − f1 (a)|

(10)

a

Therefore we can conclude that
|max f1 (a) − max f2 (a)| ≤ max |f1 (a) − f2 (a)|
a

a

(11)

a

9 / 28
Introduction to
Reinforcement
Learning

Convergence, continued

Edward Balaban

Then Equation 2 can be rewritten as an inequality:

Preliminaries
||B(V1 ) − B(V2 )||∞ ≤ γ · max max

Psa (s )V1 (s ) −

Psa (s )V2 (s )

(12)

s∈S a∈A
s ∈S

s

MDP
Learning

∈S

Solving
Continuous state MDP

Simplifying further, we get:

Example

POMDP
||B(V1 ) − B(V2 )||∞ ≤ γ · max max

Psa (s ) V1 (s ) − V2 (s )

(13)

Resources

s∈S a∈A
s ∈S

By using the triangle inequality and the fact that Psa (s ) ≥ 0, we can rewrite the above expression as

Psa (s ) V1 (s ) − V2 (s )

||B(V1 ) − B(V2 )||∞ ≤ γ · max max

(14)

s∈S a∈A
s ∈S

Psa (s ) V1 (s ) − V2 (s ) can be seen as the expectation of V1 (s ) − V2 (s ) . It is, therefore,
s ∈S

no greater than the maximum value that V1 (s ) − V2 (s ) can take. Thus the above inequality can
be written as:
||B(V1 ) − B(V2 )||∞ ≤ γ · max max max V1 (s ) − V2 (s )

(15)

s∈S a∈A s ∈S

10 / 28
Introduction to
Reinforcement
Learning

Convergence, continued

Edward Balaban
The remaining expression on the right can only be maximized with respect to s , so we can simplify to

Preliminaries
MDP

||B(V1 ) − B(V2 )||∞ ≤ γ · max V1 (s ) − V2 (s )

(16)

s ∈S

Learning
Solving
Continuous state MDP

What we have on the right hand side now is the definition of infinity norm, therefore we finally obtain:

Example

POMDP
||B(V1 ) − B(V2 )||∞ ≤ γ||V1 − V2 ||∞

(17)
Resources

We’ll prove that the Bellman operator has at most one fixed point by contradiction. Let’s assume that
there are two distinct fixed points, V1 and V2 . Since B(V1 ) = V1 and B(V2 ) = V2 , the inequality
obtained in part (a) becomes
||V1 − V2 ||∞ ≤ γ||V1 − V2 ||∞
(18)
(1 − γ)||V1 − V2 ||∞ ≤ 0

(19)

Since γ ∈ [0, 1), then 1 − γ > 0. An infinity norm of any variable is non-negative, so the only way for
the above expression to be true is if ||V1 − V2 ||∞ = 0, and, consequently, if V1 = V2 . Therefore we
proved that the Bellman operator has at most one fixed point.

11 / 28
Using an MDP with value iteration

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP

Repeat:

Learning
Solving
Continuous state MDP

Execute π in the MDP for some number of trials.
Using the accumulated experience in the MDP, update
estimates for T (s, a, s ) (and R, if applicable)

Example

POMDP
Resources

Apply value iteration to get a new estimated value
function V
Update π to be the greedy policy with respect to V

12 / 28
Solving an MDP: Policy Iteration

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving

Initialize π randomly.
Repeat until convergence:
V ← Vπ
∀s ∈ S, π(s) = arg max
a∈A

Continuous state MDP
Example

POMDP
Resources

s

∈S T (s, a, s )V (s )

V ← V π can be done efficiently by solving Bellman’s
equations as a system of linear equations.

13 / 28
Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Continuous state MDP
Example

POMDP
Resources

14 / 28
Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

V (s) = R(s) + γmax
a

T (s, a, s )V (s )
s

MDP
Learning
Solving
Continuous state MDP
Example

POMDP
Resources

14 / 28
Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

V (s) = R(s) + γmax
a

T (s, a, s )V (s )
s

MDP
Learning
Solving

Think of Q-learning as a regression!

Continuous state MDP
Example

POMDP
Resources

14 / 28
Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

V (s) = R(s) + γmax
a

T (s, a, s )V (s )
s

MDP
Learning
Solving

Think of Q-learning as a regression!

Continuous state MDP
Example

POMDP

Explore states: in state s, took action a, got reward r , ended
up in state s ( s, a, s , r ).

Resources

Q(s, a) ← Q(s, a) + α(error )

14 / 28
Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

V (s) = R(s) + γmax
a

T (s, a, s )V (s )
s

MDP
Learning
Solving

Think of Q-learning as a regression!

Continuous state MDP
Example

POMDP

Explore states: in state s, took action a, got reward r , ended
up in state s ( s, a, s , r ).

Resources

Q(s, a) ← Q(s, a) + α(error )
Q(s, a) ← Q(s, a) + α(sensed − predicted)

14 / 28
Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

V (s) = R(s) + γmax
a

T (s, a, s )V (s )
s

MDP
Learning
Solving

Think of Q-learning as a regression!

Continuous state MDP
Example

POMDP

Explore states: in state s, took action a, got reward r , ended
up in state s ( s, a, s , r ).

Resources

Q(s, a) ← Q(s, a) + α(error )
Q(s, a) ← Q(s, a) + α(sensed − predicted)
Q(s, a) ← Q(s, a) + α([γ(r + max Q(s , a )] − [Q(s, a)])
a

Stochastic update with step size α.
14 / 28
Continuous State MDP
A more realistic form of MDP

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Continuous state MDP
Example

POMDP
Resources

15 / 28
Continuous State MDP
A more realistic form of MDP

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

Needs a simulator

MDP
Learning
Solving
Continuous state MDP
Example

POMDP
Resources

15 / 28
Continuous State MDP
A more realistic form of MDP

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

Needs a simulator

MDP
Learning

Solving continuous-state MDPs:

Solving
Continuous state MDP
Example

LQR

POMDP

Fitted Value Iteration

Resources

15 / 28
An example MDP - the inverted pendulum

Introduction to
Reinforcement
Learning

A thin pole is connected via a free hinge to a cart
Edward Balaban

The cart can move laterally on a smooth table surface
Preliminaries

Failure occurs if:
the angle of the pole deviates by more than a certain amount
from the vertical position
the cart’s position goes out of bounds

MDP
Learning
Solving
Continuous state MDP
Example

The objective is to develop a controller to balance the pole

POMDP

The only actions the controller can take is accelerate the cart either left
or right

Resources

The algorithm cannot use any knowledge of the dynamics of the
underlying system

16 / 28
Introduction to
Reinforcement
Learning

Baby pendulum: results

Edward Balaban

7.5

Preliminaries

7
MDP
Learning

6.5

Solving
Continuous state MDP
Example

6

POMDP
Resources

5.5
5
4.5
4
3.5
3
0

20

40

60

80

100

120

140

160

180
17 / 28
Introduction to
Reinforcement
Learning
Edward Balaban

Preliminaries
MDP
POMDP
Resources
Partially Observable Markov Decision Process
(MDP)

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

States:

S = {s1 , ..., s|S| }

MDP

Actions:

A = {a1 , ..., a|A| }

POMDP

Transition probabilities:

T (s, a, s ) = Pr (s |s, a)

Observations:

Z = {z1 , ..., z|Z | }

Observation probabilities:

O(z, a, s ) = Pr (z |s, a)

Belief state (agent):

b = {b(s1 ), . . . , b(s|S| )} : S → [0, 1]|S| ,
|S|
i=1

Definition
Solving
Example
System Health
Management

Resources

b(si ) = 1

Belief space:

B - the set of all belief states (infinite)

Initial belief:

b0

Rewards:

R:S→R

Policy:

π : B → A|B| , Π is the set of all policies

18 / 28
Solving a POMDP

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP

Solving a realistic POMDP exactly is often computationally
intractable.

POMDP
Definition
Solving
Example

Approximate method families:

System Health
Management

Resources

Point-based methods
Monte Carlo methods
Generalization methods

19 / 28
Example: Prognostic Decision Making

Introduction to
Reinforcement
Learning

System Degradation

Edward Balaban

All of aerospace systems experience degradation
Preliminaries

Degradation can be use- or time-dependent

MDP

The operating environment is often a significant
factor

POMDP
Definition

JAXA Hayabusa

Solving
Example
System Health
Management

Resources

20 / 28
Example: Prognostic Decision Making

Introduction to
Reinforcement
Learning

System Degradation

Edward Balaban

All of aerospace systems experience degradation
Preliminaries

Degradation can be use- or time-dependent

MDP

The operating environment is often a significant
factor

POMDP
Definition

JAXA Hayabusa

Solving
Example

Faults

System Health
Management

Degradation can accelerate if a fault occurs

Resources

In a complex, multi-component system a fault
can have cascading effects
In case of a fault, a quick mitigation decision is
often required
United Flight 232

20 / 28
Example: Prognostic Decision Making

Introduction to
Reinforcement
Learning

System Degradation

Edward Balaban

All of aerospace systems experience degradation
Preliminaries

Degradation can be use- or time-dependent

MDP

The operating environment is often a significant
factor

POMDP
Definition

JAXA Hayabusa

Solving
Example

Faults

System Health
Management

Degradation can accelerate if a fault occurs

Resources

In a complex, multi-component system a fault
can have cascading effects
In case of a fault, a quick mitigation decision is
often required
United Flight 232

System Health Management (SHM)
Recent designs, e.g. S-92, have more SHM
capabilities (in fault detection and diagnosis)
Still, maintenance is predominantly done based
on fixed schedules
In-flight emergencies are handled through skill
and ingenuity of the crew and ground control

Sikorsky S-92

20 / 28
How Can We Do Better?
In recent years progress has been made in using physics modeling and
computational methods for:
Fault detection,

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

Fault magnitude estimation,

MDP

Degradation trajectory prediction (prognostics).

POMDP
Definition
Solving
Example
System Health
Management

Resources

21 / 28
How Can We Do Better?
In recent years progress has been made in using physics modeling and
computational methods for:
Fault detection,

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

Fault magnitude estimation,

MDP

Degradation trajectory prediction (prognostics).

POMDP
Definition

Research on how to utilize prognostic health information is in the very early
stages, however.

Solving
Example
System Health
Management

Resources

21 / 28
How Can We Do Better?
In recent years progress has been made in using physics modeling and
computational methods for:
Fault detection,

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

Fault magnitude estimation,

MDP

Degradation trajectory prediction (prognostics).

POMDP
Definition

Research on how to utilize prognostic health information is in the very early
stages, however.

Prognostic Decision Making (PDM)

Solving
Example
System Health
Management

Resources

The process of selecting system actions informed by predictions of
the future system health state
PDM can help with the following, for example:
Component life extension,
Fault mitigation,
Mission replanning
Crew decision support in emergencies,
Condition-based maintenance,
Asset allocation.
21 / 28
Introduction to
Reinforcement
Learning

System
Described as a continuous-state, continuous-action POMDP:

Edward Balaban

State space:

S ⊆ Rn

Preliminaries

Action space:

A ⊆ Rm

MDP

Observations:

Z ⊆ Rp

POMDP

Transition function:

T (s, a, s ) = pdf (s |s, a) : S × A × S → [0, ∞)

Definition

Observation function:

O(z , a, s ) = pdf (z |s , a) : S×A×Z → [0, ∞)

Example

Belief state:

b(s) = pdf (s)

Belief space:

B - the set of all belief states

Initial belief:

b0

Belief update:

b az (s ) ∝ O(z , a, s )

Solving

System Health
Management

Resources

T (s, a, s )b(s)ds
S

Policy:

π(a, b) = pdf (a|b) : A × B → [0, ∞), Π is the
set of all policies

Costs:

C = {c1 (s, a), ..., c|C | (s, a)} : S × A → R|C |

Rewards:

R(s, r ) = pdf (r |s) : S × R → [0, ∞)

Objectives:

F = {f1 (s), . . . , f|F | (s)} : S → R|F |

Constraints:

G = {g1 (s), . . . , g|G| (s)} : S → B|G|
22 / 28
Introduction to
Reinforcement
Learning

System Degradation
Let H = {h1 , . . . , hH } be the vector of system health parameters
incorporated into the state

Edward Balaban
Preliminaries

Fault: Gfault ∈ G defines significant deviations from the expected
nominal behavior. A fault occurs if ∃i, gi (s) = true, gi ∈ Gfault .

MDP

Failure: Gfailure ∈ G defines states where the system loses functional
capability with respect to a health parameter h ∈ H
System failure F : S → B, a boolean function indicating when the entire
system is effectively non-functional (F is defined via the Gfailure set)
End of Life (EoL): tEoL : F (s) = true

POMDP
Definition
Solving
Example
System Health
Management

Resources

Remaining Useful Life: RUL = tEoL − t

h
1

fault threshold

tfault
EoL

failure threshold
0

t
23 / 28
States

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Definition
Solving
Example
System Health
Management

Resources

Nominal (green), fault (yellow), and failure (red) states defined
using Gfault and Gfailure constraints

24 / 28
Diagnostics

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Definition
Solving
Example
System Health
Management

Resources

the process of determining the current belief state - bt

24 / 28
Prognostics

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Definition
Solving
Example
System Health
Management

Resources

the process of determining, at time t, the belief state b(t+∆t) ,
given the current policy π

24 / 28
Introduction to
Reinforcement
Learning

Decision Making

Edward Balaban
Preliminaries
MDP
POMDP
Definition
Solving
Example
System Health
Management

Resources

the process of finding (or approximating) π ∗ , such that
π ∗ = arg max J π (bt )
π∈Π

24 / 28
Case Study: UAV Mission Replanning
Given:
An initial mission route (not necessarily optimized) which includes
waypoint parameter constraints (e.g. on airspeed or bank angle).

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries

Each waypoint is associated with a payoff value

MDP

A healthy vehicle is able to complete the entire route within the energy
and component health constraints

POMDP

Transition costs between a pair of waypoints are history-dependent
A fault occurs that makes it impossible to complete the mission before
the End of Life (EoL)

Definition
Solving
Example
System Health
Management

Resources

Find:
A policy π that maximizes mission payoff and extends the remaining useful
life

25 / 28
Introduction to
Reinforcement
Learning

Reasoning Architecture
Vehicle Simulation
(including prognostic
models)

Edward Balaban

health and energy cost estimates
Diagnoser

Preliminaries
MDP

input route and
parameter constraints

candidate route

observations

POMDP
Definition
Solving

Decision Maker

Vehicle

Example
System Health
Management

initial fault set
current fault set

Resources

Particle Filter is currently used as the decision-making algorithm
Decision Maker picks ordered waypoint subsets and parameter values for
candidate routes and proposed routes
The vehicle simulation is 6DOF, with prognostic models for battery and
motor temperatures, as well as the battery state of charge
The fault mode currently implemented is increased motor friction
The fault leads to increased current consumption and motor/battery
overheating
26 / 28
Mission Replanning Simulation

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Definition
Solving
Example
System Health
Management

Resources

27 / 28
Introduction to
Reinforcement
Learning
Edward Balaban

Preliminaries
MDP
POMDP
Resources
Resources
Sutton and Barto book:
http:
//webdocs.cs.ualberta.ca/˜sutton/book/ebook/
Intro to POMDPs:
http://cs.brown.edu/research/ai/pomdp/
tutorial/index.html
Stanford Autonomous Helicopter project:
http://heli.stanford.edu
NASA Vehicle Health Management (Intelligent Systems
Division): http://ti.arc.nasa.gov/tech/dash/
pcoe/publications/
E. Balaban and J. J. Alonso, “A Modeling Framework
for Prognostic Decision Making and its Application to
UAV Mission Planning”, in proceedings of the Annual
Conference of the Prognostics and Health Management
Society, 2013, pp. 1-12.:
https://c3.nasa.gov/dashlink/resources/881/

Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources

28 / 28
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources

Thank you!

28 / 28

Contenu connexe

Tendances

An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAnirban Santara
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement LearningNatan Katz
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-LearningLyft
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningDongHyun Kwak
 
Markov decision process
Markov decision processMarkov decision process
Markov decision processHamed Abdi
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraData Science Milan
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learningAndres Mendez-Vazquez
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement LearningSalem-Kabbani
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learningSubrat Panda, PhD
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningDongmin Lee
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningKhaled Saleh
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017Masa Kato
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNINGpradiprahul
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningNAVER Engineering
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning Melaku Eneayehu
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginnersgokulprasath06
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 

Tendances (20)

Generalized Reinforcement Learning
Generalized Reinforcement LearningGeneralized Reinforcement Learning
Generalized Reinforcement Learning
 
An Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGIAn Introduction to Reinforcement Learning - The Doors to AGI
An Introduction to Reinforcement Learning - The Doors to AGI
 
Reinfrocement Learning
Reinfrocement LearningReinfrocement Learning
Reinfrocement Learning
 
Distributed Deep Q-Learning
Distributed Deep Q-LearningDistributed Deep Q-Learning
Distributed Deep Q-Learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Markov decision process
Markov decision processMarkov decision process
Markov decision process
 
Reinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del PraReinforcement Learning Overview | Marco Del Pra
Reinforcement Learning Overview | Marco Del Pra
 
25 introduction reinforcement_learning
25 introduction reinforcement_learning25 introduction reinforcement_learning
25 introduction reinforcement_learning
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
An introduction to reinforcement learning
An introduction to reinforcement learningAn introduction to reinforcement learning
An introduction to reinforcement learning
 
Exploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement LearningExploration Strategies in Reinforcement Learning
Exploration Strategies in Reinforcement Learning
 
Deep Q-Learning
Deep Q-LearningDeep Q-Learning
Deep Q-Learning
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Koh_Liang_ICML2017
Koh_Liang_ICML2017Koh_Liang_ICML2017
Koh_Liang_ICML2017
 
REINFORCEMENT LEARNING
REINFORCEMENT LEARNINGREINFORCEMENT LEARNING
REINFORCEMENT LEARNING
 
Introduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement LearningIntroduction of Deep Reinforcement Learning
Introduction of Deep Reinforcement Learning
 
Reinforcement Learning Q-Learning
Reinforcement Learning   Q-Learning Reinforcement Learning   Q-Learning
Reinforcement Learning Q-Learning
 
Reinforcement Learning Guide For Beginners
Reinforcement Learning Guide For BeginnersReinforcement Learning Guide For Beginners
Reinforcement Learning Guide For Beginners
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 

En vedette

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learningbutest
 
Reinforcement learning conductrics-superweek2017
Reinforcement learning conductrics-superweek2017Reinforcement learning conductrics-superweek2017
Reinforcement learning conductrics-superweek2017Matt Gershoff
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learningDeep Learning JP
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningLior Rokach
 
Language Understanding for Text-based Games using Deep Reinforcement Learning
Language Understanding for Text-based Games using Deep Reinforcement LearningLanguage Understanding for Text-based Games using Deep Reinforcement Learning
Language Understanding for Text-based Games using Deep Reinforcement LearningMark Chang
 
Data import and widening in Google Analytics
Data import and widening in Google AnalyticsData import and widening in Google Analytics
Data import and widening in Google AnalyticsZorin Radovancevic
 
Predictive Analytics Broken Down
Predictive Analytics Broken DownPredictive Analytics Broken Down
Predictive Analytics Broken DownMatt Gershoff
 
Turning Analysis into Action with APIs - Superweek2017
Turning Analysis into Action with APIs - Superweek2017Turning Analysis into Action with APIs - Superweek2017
Turning Analysis into Action with APIs - Superweek2017Mark Edmondson
 
Predictive Conversion Modeling - Lifting Web Analytics to the next level
Predictive Conversion Modeling - Lifting Web Analytics to the next levelPredictive Conversion Modeling - Lifting Web Analytics to the next level
Predictive Conversion Modeling - Lifting Web Analytics to the next levelPetri Mertanen
 
Damion Brown - The Missing Automation Layer - Superweek 2017
Damion Brown - The Missing Automation Layer - Superweek 2017Damion Brown - The Missing Automation Layer - Superweek 2017
Damion Brown - The Missing Automation Layer - Superweek 2017Damion Brown
 
Effects of Reinforcement in the Classroom
Effects of Reinforcement in the ClassroomEffects of Reinforcement in the Classroom
Effects of Reinforcement in the ClassroomAMaciocia
 
Should Digital Analysts Become More Data Science-y?
Should Digital Analysts Become More Data Science-y?Should Digital Analysts Become More Data Science-y?
Should Digital Analysts Become More Data Science-y?Tim Wilson
 
Using Reinforcement in the Classroom
Using Reinforcement in the ClassroomUsing Reinforcement in the Classroom
Using Reinforcement in the Classroomsworaac
 
Radical Analytics, Superweek Hungary, January 2017
Radical Analytics, Superweek Hungary, January 2017Radical Analytics, Superweek Hungary, January 2017
Radical Analytics, Superweek Hungary, January 2017Stéphane Hamel
 
Reinforcement & Punishment
Reinforcement & PunishmentReinforcement & Punishment
Reinforcement & Punishmentcaseylashaek
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningRyo Iwaki
 
Quantum Information Technology
Quantum Information TechnologyQuantum Information Technology
Quantum Information TechnologyFenny Thakrar
 
Deep Reinforcement Learning An Introduction
Deep Reinforcement Learning An IntroductionDeep Reinforcement Learning An Introduction
Deep Reinforcement Learning An IntroductionXiaohu ZHU
 

En vedette (20)

Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
Reinforcement learning conductrics-superweek2017
Reinforcement learning conductrics-superweek2017Reinforcement learning conductrics-superweek2017
Reinforcement learning conductrics-superweek2017
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
PyData NYC 2015
PyData NYC 2015PyData NYC 2015
PyData NYC 2015
 
Language Understanding for Text-based Games using Deep Reinforcement Learning
Language Understanding for Text-based Games using Deep Reinforcement LearningLanguage Understanding for Text-based Games using Deep Reinforcement Learning
Language Understanding for Text-based Games using Deep Reinforcement Learning
 
Data import and widening in Google Analytics
Data import and widening in Google AnalyticsData import and widening in Google Analytics
Data import and widening in Google Analytics
 
Predictive Analytics Broken Down
Predictive Analytics Broken DownPredictive Analytics Broken Down
Predictive Analytics Broken Down
 
Turning Analysis into Action with APIs - Superweek2017
Turning Analysis into Action with APIs - Superweek2017Turning Analysis into Action with APIs - Superweek2017
Turning Analysis into Action with APIs - Superweek2017
 
Predictive Conversion Modeling - Lifting Web Analytics to the next level
Predictive Conversion Modeling - Lifting Web Analytics to the next levelPredictive Conversion Modeling - Lifting Web Analytics to the next level
Predictive Conversion Modeling - Lifting Web Analytics to the next level
 
Damion Brown - The Missing Automation Layer - Superweek 2017
Damion Brown - The Missing Automation Layer - Superweek 2017Damion Brown - The Missing Automation Layer - Superweek 2017
Damion Brown - The Missing Automation Layer - Superweek 2017
 
Effects of Reinforcement in the Classroom
Effects of Reinforcement in the ClassroomEffects of Reinforcement in the Classroom
Effects of Reinforcement in the Classroom
 
Should Digital Analysts Become More Data Science-y?
Should Digital Analysts Become More Data Science-y?Should Digital Analysts Become More Data Science-y?
Should Digital Analysts Become More Data Science-y?
 
Using Reinforcement in the Classroom
Using Reinforcement in the ClassroomUsing Reinforcement in the Classroom
Using Reinforcement in the Classroom
 
Radical Analytics, Superweek Hungary, January 2017
Radical Analytics, Superweek Hungary, January 2017Radical Analytics, Superweek Hungary, January 2017
Radical Analytics, Superweek Hungary, January 2017
 
Reinforcement & Punishment
Reinforcement & PunishmentReinforcement & Punishment
Reinforcement & Punishment
 
safe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learningsafe and efficient off policy reinforcement learning
safe and efficient off policy reinforcement learning
 
Quantum Information Technology
Quantum Information TechnologyQuantum Information Technology
Quantum Information Technology
 
Deep Reinforcement Learning An Introduction
Deep Reinforcement Learning An IntroductionDeep Reinforcement Learning An Introduction
Deep Reinforcement Learning An Introduction
 

Similaire à Introduction to Reinforcement Learning

Summery of Robust and Effective Metric Learning Using Capped Trace Norm
Summery of  Robust and Effective Metric Learning Using Capped Trace NormSummery of  Robust and Effective Metric Learning Using Capped Trace Norm
Summery of Robust and Effective Metric Learning Using Capped Trace Normssuser42f2881
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final finaldinesh malla
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemIosif Itkin
 
Spark summit talk, july 2014 powered by reveal
Spark summit talk, july 2014 powered by revealSpark summit talk, july 2014 powered by reveal
Spark summit talk, july 2014 powered by revealDebasish Das
 
shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014Shuyang Li
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...Yuko Kuroki (黒木祐子)
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3arogozhnikov
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeTwo Sigma
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonChun-Ming Chang
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TDAshwin Rao
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfPo-Chuan Chen
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12VuTran231
 
STA003_WK4_L.pptx
STA003_WK4_L.pptxSTA003_WK4_L.pptx
STA003_WK4_L.pptxMAmir23
 

Similaire à Introduction to Reinforcement Learning (20)

Summery of Robust and Effective Metric Learning Using Capped Trace Norm
Summery of  Robust and Effective Metric Learning Using Capped Trace NormSummery of  Robust and Effective Metric Learning Using Capped Trace Norm
Summery of Robust and Effective Metric Learning Using Capped Trace Norm
 
Naive Bayes Presentation
Naive Bayes PresentationNaive Bayes Presentation
Naive Bayes Presentation
 
Jsai final final final
Jsai final final finalJsai final final final
Jsai final final final
 
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light SystemTMPA-2015: Implementing the MetaVCG Approach in the C-light System
TMPA-2015: Implementing the MetaVCG Approach in the C-light System
 
Spark summit talk, july 2014 powered by reveal
Spark summit talk, july 2014 powered by revealSpark summit talk, july 2014 powered by reveal
Spark summit talk, july 2014 powered by reveal
 
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
MUMS: Transition & SPUQ Workshop - Practical Bayesian Optimization for Urban ...
 
shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014shuyangli_summerpresentation08082014
shuyangli_summerpresentation08082014
 
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
[AAAI2021] Combinatorial Pure Exploration with Full-bandit or Partial Linear ...
 
MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3MLHEP 2015: Introductory Lecture #3
MLHEP 2015: Introductory Lecture #3
 
Rademacher Averages: Theory and Practice
Rademacher Averages: Theory and PracticeRademacher Averages: Theory and Practice
Rademacher Averages: Theory and Practice
 
Mc td
Mc tdMc td
Mc td
 
Least Squares method
Least Squares methodLeast Squares method
Least Squares method
 
Dp
DpDp
Dp
 
Hands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in PythonHands-on Tutorial of Machine Learning in Python
Hands-on Tutorial of Machine Learning in Python
 
Value Function Geometry and Gradient TD
Value Function Geometry and Gradient TDValue Function Geometry and Gradient TD
Value Function Geometry and Gradient TD
 
RL intro
RL introRL intro
RL intro
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Cs229 notes12
Cs229 notes12Cs229 notes12
Cs229 notes12
 
Query optimisation
Query optimisationQuery optimisation
Query optimisation
 
STA003_WK4_L.pptx
STA003_WK4_L.pptxSTA003_WK4_L.pptx
STA003_WK4_L.pptx
 

Dernier

The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...Aggregage
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIUdaiappa Ramachandran
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfDianaGray10
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemAsko Soukka
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 

Dernier (20)

The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
The Data Metaverse: Unpacking the Roles, Use Cases, and Tech Trends in Data a...
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
RAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AIRAG Patterns and Vector Search in Generative AI
RAG Patterns and Vector Search in Generative AI
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdfUiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
UiPath Solutions Management Preview - Northern CA Chapter - March 22.pdf
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
Bird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystemBird eye's view on Camunda open source ecosystem
Bird eye's view on Camunda open source ecosystem
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 

Introduction to Reinforcement Learning

  • 1. Introduction to Reinforcement Learning Edward Balaban January 17, 2014
  • 3. Objectives Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP POMDP Introduce Reinforcement Learning and its applications Resources Overview Markov Decision Processes, Value Iteration, Policy Iteration, and Q-learning Overview Partially Observable Markov Decision Processes and methods to solve them Illustrate the above concepts with some examples 1 / 28
  • 4. What is Reinforcement Learning? Supervised learning: learn a model from training data that maps inputs to outputs, use it to generate outputs from future inputs Unsupervised learning: recognize patterns in input data Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP POMDP Resources Reinforcement learning (RL): provide the learning agent with a reward function and let it figure out the best strategy for obtaining large rewards Some of the material in these slides is borrowed from Andrew Ng and Wheeler Ruml lectures on reinforcement learning 2 / 28
  • 5. What is Reinforcement Learning? Supervised learning: learn a model from training data that maps inputs to outputs, use it to generate outputs from future inputs Unsupervised learning: recognize patterns in input data Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP POMDP Resources Reinforcement learning (RL): provide the learning agent with a reward function and let it figure out the best strategy for obtaining large rewards RL has been used in such diverse applications as: Business strategy planning Aircraft control Optimal routing (data packets, vehicles, etc) Robot motion control Some of the material in these slides is borrowed from Andrew Ng and Wheeler Ruml lectures on reinforcement learning 2 / 28
  • 6. How do we model for RL? Modeling frameworks with increasing levels of uncertainty: Introduction to Reinforcement Learning Edward Balaban State space models: no uncertainty Preliminaries Markov Decision Processes (MDPs): uncertainty in action effects POMDP MDP Resources Partially Observable Markov Decision Processes (POMDPs): uncertainty in action effects and current state 3 / 28
  • 7. How do we model for RL? Modeling frameworks with increasing levels of uncertainty: Introduction to Reinforcement Learning Edward Balaban State space models: no uncertainty Preliminaries Markov Decision Processes (MDPs): uncertainty in action effects POMDP MDP Resources Partially Observable Markov Decision Processes (POMDPs): uncertainty in action effects and current state Other modeling frameworks exist, e.g. Predictive State Representation: Generalizations of POMDPs that were shown to have both a greater representational capacity than POMDPs and yield representations that are at least as compact (Singh et al, 2004 and Even-Dar et al, 2005) Represent the state of a dynamic system by tracking occurrence probabilities of a set of future events (tests), conditioned on past events (histories) Rely solely on observable quantities (unlike POMDPs) 3 / 28
  • 9. Markov Decision Process (MDP) Introduction to Reinforcement Learning States: S = {s1 , ..., s|S| } Actions: A = {a1 , ..., a|A| } Preliminaries Transition probabilities: T (s, a, s ) = Pr (s |s, a) MDP Rewards: R:S→R Policy: π : S → A, Π is the set of all policies Edward Balaban Learning Solving Continuous state MDP Example POMDP Resources 4 / 28
  • 10. Markov Decision Process (MDP) Introduction to Reinforcement Learning States: S = {s1 , ..., s|S| } Actions: A = {a1 , ..., a|A| } Preliminaries Transition probabilities: T (s, a, s ) = Pr (s |s, a) MDP Rewards: R:S→R Policy: π : S → A, Π is the set of all policies Edward Balaban Learning Solving Continuous state MDP Example POMDP Value function: Resources V π (s) = E [R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + . . . |s0 = s, π] 4 / 28
  • 11. Markov Decision Process (MDP) Introduction to Reinforcement Learning States: S = {s1 , ..., s|S| } Actions: A = {a1 , ..., a|A| } Preliminaries Transition probabilities: T (s, a, s ) = Pr (s |s, a) MDP Rewards: R:S→R Policy: π : S → A, Π is the set of all policies Edward Balaban Learning Solving Continuous state MDP Example POMDP Resources Value function: V π (s) = E [R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + . . . |s0 = s, π] Bellman Equation: V π (s) = R(s) + γ T (s, a, s )V π (s ) s ∈S Optimal Value function: V ∗ = max V π (s) π 4 / 28
  • 12. Markov Decision Process (MDP), continued Bellman Equation for the optimal value function: T (s, a, s )V ∗ (s ) V ∗ (s) = R(s) + max γ a∈A s ∈S Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP Learning Solving Policy: Continuous state MDP π ∗ (s) = arg max γ a∈A ∗ V (s) = V T (s, a, s )V ∗ (s ) s ∈S π∗ Example POMDP Resources (s) ≥ V π (s) 5 / 28
  • 13. Learning an MDP Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP Usually S, A, and γ are known. Learning Solving Continuous state MDP Example # times took action a in state s and got to s T (s, a, s ) = #times took action a in state s POMDP Resources 6 / 28
  • 14. Learning an MDP Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP Usually S, A, and γ are known. Learning Solving Continuous state MDP Example # times took action a in state s and got to s T (s, a, s ) = #times took action a in state s POMDP Resources Similarly, if R is unknown, can also pick our estimate of the expected immediate reward R(s) in state s to be the average reward observed in that state. 6 / 28
  • 15. Introduction to Reinforcement Learning Solving an MDP: Value Iteration Edward Balaban Preliminaries MDP Learning Solving Continuous state MDP Example ∀s ∈ S, V (s) ← 0 Repeat until convergence: POMDP Resources ∀s ∈ S, V (s) ← R(s) + max γ a∈A s ∈S T (s, a, s )V (s ) 7 / 28
  • 16. Introduction to Reinforcement Learning Convergence Edward Balaban From the definition of Bellman operator: Preliminaries MDP ||B(V1 ) − B(V2 )||∞ = max R(s) + γmax s∈S Psa (s )V1 (s ) − R(s) − γmax a∈A Psa (s )V2 (s ) a∈A s ∈S s Learning Solving ∈S (1) Continuous state MDP Example POMDP = γ · max max s∈S Psa (s )V1 (s ) − max a∈A Psa (s )V2 (s ) (2) a∈A s ∈S s Resources ∈S To go further, we need to understand whether the two maximization operations over the set of actions for V1 and V2 can be combined. To do that, let’s use the following definitions: f1 (a) = Psa (s )V1 (s ) (3) Psa (s )V2 (s ) (4) s ∈S f2 (a) = s ∈S 8 / 28
  • 17. Introduction to Reinforcement Learning Convergence, continued ∗ In order to, for the moment, get rid of the max operators, let’s also define a1 as the action that ∗ maximizes f1 and a2 as the action that maximizes f2 . Then max a∈A s ∈S Preliminaries ∗ ∗ Psa (s )V2 (s ) can be written as |f1 (a1 ) − f2 (a2 )|. Psa (s )V1 (s ) − max a∈A Edward Balaban s MDP ∈S Learning Solving ∗ ∗ ∗ ∗ ∗ ∗ Since f1 (a1 ) ≥ f2 (a1 ) and f2 (a2 ) ≥ f1 (a2 ) (by the virtue of a1 and a2 maximizing f1 and f2 , respectively), we can “unpack” the absolute value operator as follows: ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ ∗ f1 (a1 ) − f2 (a2 ) ≤ f1 (a1 ) − f2 (a1 ) f2 (a2 ) − f1 (a1 ) ≤ f2 (a2 ) − f1 (a2 ) Continuous state MDP Example (5) POMDP (6) Resources Then it is also true that ∗ ∗ f1 (a1 ) − f2 (a2 ) ≤ |f1 (a1 ) − f2 (a1 )| (7) ∗ f2 (a2 ) − f1 (a2 )| (8) f1 (a1 ) − f2 (a2 ) ≤ max |f1 (a) − f2 (a)| (9) − ∗ f1 (a1 ) ≤ ∗ |f2 (a2 ) ∗ And, finally, it should also be true for ∀a that ∗ ∗ a ∗ f2 (a2 ) − ∗ f1 (a1 ) ≤ max |f2 (a) − f1 (a)| (10) a Therefore we can conclude that |max f1 (a) − max f2 (a)| ≤ max |f1 (a) − f2 (a)| a a (11) a 9 / 28
  • 18. Introduction to Reinforcement Learning Convergence, continued Edward Balaban Then Equation 2 can be rewritten as an inequality: Preliminaries ||B(V1 ) − B(V2 )||∞ ≤ γ · max max Psa (s )V1 (s ) − Psa (s )V2 (s ) (12) s∈S a∈A s ∈S s MDP Learning ∈S Solving Continuous state MDP Simplifying further, we get: Example POMDP ||B(V1 ) − B(V2 )||∞ ≤ γ · max max Psa (s ) V1 (s ) − V2 (s ) (13) Resources s∈S a∈A s ∈S By using the triangle inequality and the fact that Psa (s ) ≥ 0, we can rewrite the above expression as Psa (s ) V1 (s ) − V2 (s ) ||B(V1 ) − B(V2 )||∞ ≤ γ · max max (14) s∈S a∈A s ∈S Psa (s ) V1 (s ) − V2 (s ) can be seen as the expectation of V1 (s ) − V2 (s ) . It is, therefore, s ∈S no greater than the maximum value that V1 (s ) − V2 (s ) can take. Thus the above inequality can be written as: ||B(V1 ) − B(V2 )||∞ ≤ γ · max max max V1 (s ) − V2 (s ) (15) s∈S a∈A s ∈S 10 / 28
  • 19. Introduction to Reinforcement Learning Convergence, continued Edward Balaban The remaining expression on the right can only be maximized with respect to s , so we can simplify to Preliminaries MDP ||B(V1 ) − B(V2 )||∞ ≤ γ · max V1 (s ) − V2 (s ) (16) s ∈S Learning Solving Continuous state MDP What we have on the right hand side now is the definition of infinity norm, therefore we finally obtain: Example POMDP ||B(V1 ) − B(V2 )||∞ ≤ γ||V1 − V2 ||∞ (17) Resources We’ll prove that the Bellman operator has at most one fixed point by contradiction. Let’s assume that there are two distinct fixed points, V1 and V2 . Since B(V1 ) = V1 and B(V2 ) = V2 , the inequality obtained in part (a) becomes ||V1 − V2 ||∞ ≤ γ||V1 − V2 ||∞ (18) (1 − γ)||V1 − V2 ||∞ ≤ 0 (19) Since γ ∈ [0, 1), then 1 − γ > 0. An infinity norm of any variable is non-negative, so the only way for the above expression to be true is if ||V1 − V2 ||∞ = 0, and, consequently, if V1 = V2 . Therefore we proved that the Bellman operator has at most one fixed point. 11 / 28
  • 20. Using an MDP with value iteration Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP Repeat: Learning Solving Continuous state MDP Execute π in the MDP for some number of trials. Using the accumulated experience in the MDP, update estimates for T (s, a, s ) (and R, if applicable) Example POMDP Resources Apply value iteration to get a new estimated value function V Update π to be the greedy policy with respect to V 12 / 28
  • 21. Solving an MDP: Policy Iteration Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP Learning Solving Initialize π randomly. Repeat until convergence: V ← Vπ ∀s ∈ S, π(s) = arg max a∈A Continuous state MDP Example POMDP Resources s ∈S T (s, a, s )V (s ) V ← V π can be done efficiently by solving Bellman’s equations as a system of linear equations. 13 / 28
  • 22. Solving (and learning) an MDP: Q-learning Model-free reinforcement learning. Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP Learning Solving Continuous state MDP Example POMDP Resources 14 / 28
  • 23. Solving (and learning) an MDP: Q-learning Model-free reinforcement learning. Introduction to Reinforcement Learning Edward Balaban Preliminaries V (s) = R(s) + γmax a T (s, a, s )V (s ) s MDP Learning Solving Continuous state MDP Example POMDP Resources 14 / 28
  • 24. Solving (and learning) an MDP: Q-learning Model-free reinforcement learning. Introduction to Reinforcement Learning Edward Balaban Preliminaries V (s) = R(s) + γmax a T (s, a, s )V (s ) s MDP Learning Solving Think of Q-learning as a regression! Continuous state MDP Example POMDP Resources 14 / 28
  • 25. Solving (and learning) an MDP: Q-learning Model-free reinforcement learning. Introduction to Reinforcement Learning Edward Balaban Preliminaries V (s) = R(s) + γmax a T (s, a, s )V (s ) s MDP Learning Solving Think of Q-learning as a regression! Continuous state MDP Example POMDP Explore states: in state s, took action a, got reward r , ended up in state s ( s, a, s , r ). Resources Q(s, a) ← Q(s, a) + α(error ) 14 / 28
  • 26. Solving (and learning) an MDP: Q-learning Model-free reinforcement learning. Introduction to Reinforcement Learning Edward Balaban Preliminaries V (s) = R(s) + γmax a T (s, a, s )V (s ) s MDP Learning Solving Think of Q-learning as a regression! Continuous state MDP Example POMDP Explore states: in state s, took action a, got reward r , ended up in state s ( s, a, s , r ). Resources Q(s, a) ← Q(s, a) + α(error ) Q(s, a) ← Q(s, a) + α(sensed − predicted) 14 / 28
  • 27. Solving (and learning) an MDP: Q-learning Model-free reinforcement learning. Introduction to Reinforcement Learning Edward Balaban Preliminaries V (s) = R(s) + γmax a T (s, a, s )V (s ) s MDP Learning Solving Think of Q-learning as a regression! Continuous state MDP Example POMDP Explore states: in state s, took action a, got reward r , ended up in state s ( s, a, s , r ). Resources Q(s, a) ← Q(s, a) + α(error ) Q(s, a) ← Q(s, a) + α(sensed − predicted) Q(s, a) ← Q(s, a) + α([γ(r + max Q(s , a )] − [Q(s, a)]) a Stochastic update with step size α. 14 / 28
  • 28. Continuous State MDP A more realistic form of MDP Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP Learning Solving Continuous state MDP Example POMDP Resources 15 / 28
  • 29. Continuous State MDP A more realistic form of MDP Introduction to Reinforcement Learning Edward Balaban Preliminaries Needs a simulator MDP Learning Solving Continuous state MDP Example POMDP Resources 15 / 28
  • 30. Continuous State MDP A more realistic form of MDP Introduction to Reinforcement Learning Edward Balaban Preliminaries Needs a simulator MDP Learning Solving continuous-state MDPs: Solving Continuous state MDP Example LQR POMDP Fitted Value Iteration Resources 15 / 28
  • 31. An example MDP - the inverted pendulum Introduction to Reinforcement Learning A thin pole is connected via a free hinge to a cart Edward Balaban The cart can move laterally on a smooth table surface Preliminaries Failure occurs if: the angle of the pole deviates by more than a certain amount from the vertical position the cart’s position goes out of bounds MDP Learning Solving Continuous state MDP Example The objective is to develop a controller to balance the pole POMDP The only actions the controller can take is accelerate the cart either left or right Resources The algorithm cannot use any knowledge of the dynamics of the underlying system 16 / 28
  • 32. Introduction to Reinforcement Learning Baby pendulum: results Edward Balaban 7.5 Preliminaries 7 MDP Learning 6.5 Solving Continuous state MDP Example 6 POMDP Resources 5.5 5 4.5 4 3.5 3 0 20 40 60 80 100 120 140 160 180 17 / 28
  • 34. Partially Observable Markov Decision Process (MDP) Introduction to Reinforcement Learning Edward Balaban Preliminaries States: S = {s1 , ..., s|S| } MDP Actions: A = {a1 , ..., a|A| } POMDP Transition probabilities: T (s, a, s ) = Pr (s |s, a) Observations: Z = {z1 , ..., z|Z | } Observation probabilities: O(z, a, s ) = Pr (z |s, a) Belief state (agent): b = {b(s1 ), . . . , b(s|S| )} : S → [0, 1]|S| , |S| i=1 Definition Solving Example System Health Management Resources b(si ) = 1 Belief space: B - the set of all belief states (infinite) Initial belief: b0 Rewards: R:S→R Policy: π : B → A|B| , Π is the set of all policies 18 / 28
  • 35. Solving a POMDP Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP Solving a realistic POMDP exactly is often computationally intractable. POMDP Definition Solving Example Approximate method families: System Health Management Resources Point-based methods Monte Carlo methods Generalization methods 19 / 28
  • 36. Example: Prognostic Decision Making Introduction to Reinforcement Learning System Degradation Edward Balaban All of aerospace systems experience degradation Preliminaries Degradation can be use- or time-dependent MDP The operating environment is often a significant factor POMDP Definition JAXA Hayabusa Solving Example System Health Management Resources 20 / 28
  • 37. Example: Prognostic Decision Making Introduction to Reinforcement Learning System Degradation Edward Balaban All of aerospace systems experience degradation Preliminaries Degradation can be use- or time-dependent MDP The operating environment is often a significant factor POMDP Definition JAXA Hayabusa Solving Example Faults System Health Management Degradation can accelerate if a fault occurs Resources In a complex, multi-component system a fault can have cascading effects In case of a fault, a quick mitigation decision is often required United Flight 232 20 / 28
  • 38. Example: Prognostic Decision Making Introduction to Reinforcement Learning System Degradation Edward Balaban All of aerospace systems experience degradation Preliminaries Degradation can be use- or time-dependent MDP The operating environment is often a significant factor POMDP Definition JAXA Hayabusa Solving Example Faults System Health Management Degradation can accelerate if a fault occurs Resources In a complex, multi-component system a fault can have cascading effects In case of a fault, a quick mitigation decision is often required United Flight 232 System Health Management (SHM) Recent designs, e.g. S-92, have more SHM capabilities (in fault detection and diagnosis) Still, maintenance is predominantly done based on fixed schedules In-flight emergencies are handled through skill and ingenuity of the crew and ground control Sikorsky S-92 20 / 28
  • 39. How Can We Do Better? In recent years progress has been made in using physics modeling and computational methods for: Fault detection, Introduction to Reinforcement Learning Edward Balaban Preliminaries Fault magnitude estimation, MDP Degradation trajectory prediction (prognostics). POMDP Definition Solving Example System Health Management Resources 21 / 28
  • 40. How Can We Do Better? In recent years progress has been made in using physics modeling and computational methods for: Fault detection, Introduction to Reinforcement Learning Edward Balaban Preliminaries Fault magnitude estimation, MDP Degradation trajectory prediction (prognostics). POMDP Definition Research on how to utilize prognostic health information is in the very early stages, however. Solving Example System Health Management Resources 21 / 28
  • 41. How Can We Do Better? In recent years progress has been made in using physics modeling and computational methods for: Fault detection, Introduction to Reinforcement Learning Edward Balaban Preliminaries Fault magnitude estimation, MDP Degradation trajectory prediction (prognostics). POMDP Definition Research on how to utilize prognostic health information is in the very early stages, however. Prognostic Decision Making (PDM) Solving Example System Health Management Resources The process of selecting system actions informed by predictions of the future system health state PDM can help with the following, for example: Component life extension, Fault mitigation, Mission replanning Crew decision support in emergencies, Condition-based maintenance, Asset allocation. 21 / 28
  • 42. Introduction to Reinforcement Learning System Described as a continuous-state, continuous-action POMDP: Edward Balaban State space: S ⊆ Rn Preliminaries Action space: A ⊆ Rm MDP Observations: Z ⊆ Rp POMDP Transition function: T (s, a, s ) = pdf (s |s, a) : S × A × S → [0, ∞) Definition Observation function: O(z , a, s ) = pdf (z |s , a) : S×A×Z → [0, ∞) Example Belief state: b(s) = pdf (s) Belief space: B - the set of all belief states Initial belief: b0 Belief update: b az (s ) ∝ O(z , a, s ) Solving System Health Management Resources T (s, a, s )b(s)ds S Policy: π(a, b) = pdf (a|b) : A × B → [0, ∞), Π is the set of all policies Costs: C = {c1 (s, a), ..., c|C | (s, a)} : S × A → R|C | Rewards: R(s, r ) = pdf (r |s) : S × R → [0, ∞) Objectives: F = {f1 (s), . . . , f|F | (s)} : S → R|F | Constraints: G = {g1 (s), . . . , g|G| (s)} : S → B|G| 22 / 28
  • 43. Introduction to Reinforcement Learning System Degradation Let H = {h1 , . . . , hH } be the vector of system health parameters incorporated into the state Edward Balaban Preliminaries Fault: Gfault ∈ G defines significant deviations from the expected nominal behavior. A fault occurs if ∃i, gi (s) = true, gi ∈ Gfault . MDP Failure: Gfailure ∈ G defines states where the system loses functional capability with respect to a health parameter h ∈ H System failure F : S → B, a boolean function indicating when the entire system is effectively non-functional (F is defined via the Gfailure set) End of Life (EoL): tEoL : F (s) = true POMDP Definition Solving Example System Health Management Resources Remaining Useful Life: RUL = tEoL − t h 1 fault threshold tfault EoL failure threshold 0 t 23 / 28
  • 44. States Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP POMDP Definition Solving Example System Health Management Resources Nominal (green), fault (yellow), and failure (red) states defined using Gfault and Gfailure constraints 24 / 28
  • 45. Diagnostics Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP POMDP Definition Solving Example System Health Management Resources the process of determining the current belief state - bt 24 / 28
  • 46. Prognostics Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP POMDP Definition Solving Example System Health Management Resources the process of determining, at time t, the belief state b(t+∆t) , given the current policy π 24 / 28
  • 47. Introduction to Reinforcement Learning Decision Making Edward Balaban Preliminaries MDP POMDP Definition Solving Example System Health Management Resources the process of finding (or approximating) π ∗ , such that π ∗ = arg max J π (bt ) π∈Π 24 / 28
  • 48. Case Study: UAV Mission Replanning Given: An initial mission route (not necessarily optimized) which includes waypoint parameter constraints (e.g. on airspeed or bank angle). Introduction to Reinforcement Learning Edward Balaban Preliminaries Each waypoint is associated with a payoff value MDP A healthy vehicle is able to complete the entire route within the energy and component health constraints POMDP Transition costs between a pair of waypoints are history-dependent A fault occurs that makes it impossible to complete the mission before the End of Life (EoL) Definition Solving Example System Health Management Resources Find: A policy π that maximizes mission payoff and extends the remaining useful life 25 / 28
  • 49. Introduction to Reinforcement Learning Reasoning Architecture Vehicle Simulation (including prognostic models) Edward Balaban health and energy cost estimates Diagnoser Preliminaries MDP input route and parameter constraints candidate route observations POMDP Definition Solving Decision Maker Vehicle Example System Health Management initial fault set current fault set Resources Particle Filter is currently used as the decision-making algorithm Decision Maker picks ordered waypoint subsets and parameter values for candidate routes and proposed routes The vehicle simulation is 6DOF, with prognostic models for battery and motor temperatures, as well as the battery state of charge The fault mode currently implemented is increased motor friction The fault leads to increased current consumption and motor/battery overheating 26 / 28
  • 50. Mission Replanning Simulation Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP POMDP Definition Solving Example System Health Management Resources 27 / 28
  • 52. Resources Sutton and Barto book: http: //webdocs.cs.ualberta.ca/˜sutton/book/ebook/ Intro to POMDPs: http://cs.brown.edu/research/ai/pomdp/ tutorial/index.html Stanford Autonomous Helicopter project: http://heli.stanford.edu NASA Vehicle Health Management (Intelligent Systems Division): http://ti.arc.nasa.gov/tech/dash/ pcoe/publications/ E. Balaban and J. J. Alonso, “A Modeling Framework for Prognostic Decision Making and its Application to UAV Mission Planning”, in proceedings of the Annual Conference of the Prognostics and Health Management Society, 2013, pp. 1-12.: https://c3.nasa.gov/dashlink/resources/881/ Introduction to Reinforcement Learning Edward Balaban Preliminaries MDP POMDP Resources 28 / 28