4. What is Reinforcement Learning?
Supervised learning: learn a model from training data
that maps inputs to outputs, use it to generate outputs
from future inputs
Unsupervised learning: recognize patterns in input data
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources
Reinforcement learning (RL): provide the learning agent
with a reward function and let it figure out the best
strategy for obtaining large rewards
Some of the material in these slides is borrowed from Andrew Ng and Wheeler Ruml lectures
on reinforcement learning
2 / 28
5. What is Reinforcement Learning?
Supervised learning: learn a model from training data
that maps inputs to outputs, use it to generate outputs
from future inputs
Unsupervised learning: recognize patterns in input data
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources
Reinforcement learning (RL): provide the learning agent
with a reward function and let it figure out the best
strategy for obtaining large rewards
RL has been used in such diverse applications as:
Business strategy planning
Aircraft control
Optimal routing (data packets, vehicles, etc)
Robot motion control
Some of the material in these slides is borrowed from Andrew Ng and Wheeler Ruml lectures
on reinforcement learning
2 / 28
6. How do we model for RL?
Modeling frameworks with increasing levels of uncertainty:
Introduction to
Reinforcement
Learning
Edward Balaban
State space models:
no uncertainty
Preliminaries
Markov Decision Processes (MDPs):
uncertainty in action effects
POMDP
MDP
Resources
Partially Observable Markov Decision Processes (POMDPs):
uncertainty in action effects and current state
3 / 28
7. How do we model for RL?
Modeling frameworks with increasing levels of uncertainty:
Introduction to
Reinforcement
Learning
Edward Balaban
State space models:
no uncertainty
Preliminaries
Markov Decision Processes (MDPs):
uncertainty in action effects
POMDP
MDP
Resources
Partially Observable Markov Decision Processes (POMDPs):
uncertainty in action effects and current state
Other modeling frameworks exist, e.g. Predictive State Representation:
Generalizations of POMDPs that were shown to have both a
greater representational capacity than POMDPs and yield
representations that are at least as compact (Singh et al, 2004
and Even-Dar et al, 2005)
Represent the state of a dynamic system by tracking occurrence
probabilities of a set of future events (tests), conditioned on past
events (histories)
Rely solely on observable quantities (unlike POMDPs)
3 / 28
9. Markov Decision Process (MDP)
Introduction to
Reinforcement
Learning
States:
S = {s1 , ..., s|S| }
Actions:
A = {a1 , ..., a|A| }
Preliminaries
Transition probabilities:
T (s, a, s ) = Pr (s |s, a)
MDP
Rewards:
R:S→R
Policy:
π : S → A, Π is the set of all policies
Edward Balaban
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
4 / 28
10. Markov Decision Process (MDP)
Introduction to
Reinforcement
Learning
States:
S = {s1 , ..., s|S| }
Actions:
A = {a1 , ..., a|A| }
Preliminaries
Transition probabilities:
T (s, a, s ) = Pr (s |s, a)
MDP
Rewards:
R:S→R
Policy:
π : S → A, Π is the set of all policies
Edward Balaban
Learning
Solving
Continuous state MDP
Example
POMDP
Value function:
Resources
V π (s) = E [R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + . . . |s0 = s, π]
4 / 28
11. Markov Decision Process (MDP)
Introduction to
Reinforcement
Learning
States:
S = {s1 , ..., s|S| }
Actions:
A = {a1 , ..., a|A| }
Preliminaries
Transition probabilities:
T (s, a, s ) = Pr (s |s, a)
MDP
Rewards:
R:S→R
Policy:
π : S → A, Π is the set of all policies
Edward Balaban
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
Value function:
V π (s) = E [R(s0 ) + γR(s1 ) + γ 2 R(s2 ) + . . . |s0 = s, π]
Bellman Equation:
V π (s) = R(s) + γ
T (s, a, s )V π (s )
s ∈S
Optimal Value function:
V ∗ = max V π (s)
π
4 / 28
12. Markov Decision Process (MDP), continued
Bellman Equation for the optimal value function:
T (s, a, s )V ∗ (s )
V ∗ (s) = R(s) + max γ
a∈A
s ∈S
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Policy:
Continuous state MDP
π ∗ (s) = arg max γ
a∈A
∗
V (s) = V
T (s, a, s )V ∗ (s )
s ∈S
π∗
Example
POMDP
Resources
(s) ≥ V π (s)
5 / 28
13. Learning an MDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Usually S, A, and γ are known.
Learning
Solving
Continuous state MDP
Example
# times took action a in state s and got to s
T (s, a, s ) =
#times took action a in state s
POMDP
Resources
6 / 28
14. Learning an MDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Usually S, A, and γ are known.
Learning
Solving
Continuous state MDP
Example
# times took action a in state s and got to s
T (s, a, s ) =
#times took action a in state s
POMDP
Resources
Similarly, if R is unknown, can also pick our estimate of the
expected immediate reward R(s) in state s to be the average
reward observed in that state.
6 / 28
15. Introduction to
Reinforcement
Learning
Solving an MDP: Value Iteration
Edward Balaban
Preliminaries
MDP
Learning
Solving
Continuous state MDP
Example
∀s ∈ S, V (s) ← 0
Repeat until convergence:
POMDP
Resources
∀s ∈ S, V (s) ← R(s) + max γ
a∈A
s ∈S
T (s, a, s )V (s )
7 / 28
16. Introduction to
Reinforcement
Learning
Convergence
Edward Balaban
From the definition of Bellman operator:
Preliminaries
MDP
||B(V1 ) − B(V2 )||∞ = max R(s) + γmax
s∈S
Psa (s )V1 (s ) − R(s) − γmax
a∈A
Psa (s )V2 (s )
a∈A
s ∈S
s
Learning
Solving
∈S
(1)
Continuous state MDP
Example
POMDP
= γ · max max
s∈S
Psa (s )V1 (s ) − max
a∈A
Psa (s )V2 (s )
(2)
a∈A
s ∈S
s
Resources
∈S
To go further, we need to understand whether the two maximization operations over the set of actions
for V1 and V2 can be combined. To do that, let’s use the following definitions:
f1 (a) =
Psa (s )V1 (s )
(3)
Psa (s )V2 (s )
(4)
s ∈S
f2 (a) =
s
∈S
8 / 28
17. Introduction to
Reinforcement
Learning
Convergence, continued
∗
In order to, for the moment, get rid of the max operators, let’s also define a1 as the action that
∗
maximizes f1 and a2 as the action that maximizes f2 . Then
max
a∈A
s ∈S
Preliminaries
∗
∗
Psa (s )V2 (s ) can be written as |f1 (a1 ) − f2 (a2 )|.
Psa (s )V1 (s ) − max
a∈A
Edward Balaban
s
MDP
∈S
Learning
Solving
∗
∗
∗
∗
∗
∗
Since f1 (a1 ) ≥ f2 (a1 ) and f2 (a2 ) ≥ f1 (a2 ) (by the virtue of a1 and a2 maximizing f1 and f2 ,
respectively), we can “unpack” the absolute value operator as follows:
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
f1 (a1 ) − f2 (a2 ) ≤ f1 (a1 ) − f2 (a1 )
f2 (a2 ) − f1 (a1 ) ≤ f2 (a2 ) − f1 (a2 )
Continuous state MDP
Example
(5)
POMDP
(6)
Resources
Then it is also true that
∗
∗
f1 (a1 ) − f2 (a2 ) ≤ |f1 (a1 ) − f2 (a1 )|
(7)
∗
f2 (a2 )
− f1 (a2 )|
(8)
f1 (a1 ) − f2 (a2 ) ≤ max |f1 (a) − f2 (a)|
(9)
−
∗
f1 (a1 )
≤
∗
|f2 (a2 )
∗
And, finally, it should also be true for ∀a that
∗
∗
a
∗
f2 (a2 )
−
∗
f1 (a1 )
≤ max |f2 (a) − f1 (a)|
(10)
a
Therefore we can conclude that
|max f1 (a) − max f2 (a)| ≤ max |f1 (a) − f2 (a)|
a
a
(11)
a
9 / 28
18. Introduction to
Reinforcement
Learning
Convergence, continued
Edward Balaban
Then Equation 2 can be rewritten as an inequality:
Preliminaries
||B(V1 ) − B(V2 )||∞ ≤ γ · max max
Psa (s )V1 (s ) −
Psa (s )V2 (s )
(12)
s∈S a∈A
s ∈S
s
MDP
Learning
∈S
Solving
Continuous state MDP
Simplifying further, we get:
Example
POMDP
||B(V1 ) − B(V2 )||∞ ≤ γ · max max
Psa (s ) V1 (s ) − V2 (s )
(13)
Resources
s∈S a∈A
s ∈S
By using the triangle inequality and the fact that Psa (s ) ≥ 0, we can rewrite the above expression as
Psa (s ) V1 (s ) − V2 (s )
||B(V1 ) − B(V2 )||∞ ≤ γ · max max
(14)
s∈S a∈A
s ∈S
Psa (s ) V1 (s ) − V2 (s ) can be seen as the expectation of V1 (s ) − V2 (s ) . It is, therefore,
s ∈S
no greater than the maximum value that V1 (s ) − V2 (s ) can take. Thus the above inequality can
be written as:
||B(V1 ) − B(V2 )||∞ ≤ γ · max max max V1 (s ) − V2 (s )
(15)
s∈S a∈A s ∈S
10 / 28
19. Introduction to
Reinforcement
Learning
Convergence, continued
Edward Balaban
The remaining expression on the right can only be maximized with respect to s , so we can simplify to
Preliminaries
MDP
||B(V1 ) − B(V2 )||∞ ≤ γ · max V1 (s ) − V2 (s )
(16)
s ∈S
Learning
Solving
Continuous state MDP
What we have on the right hand side now is the definition of infinity norm, therefore we finally obtain:
Example
POMDP
||B(V1 ) − B(V2 )||∞ ≤ γ||V1 − V2 ||∞
(17)
Resources
We’ll prove that the Bellman operator has at most one fixed point by contradiction. Let’s assume that
there are two distinct fixed points, V1 and V2 . Since B(V1 ) = V1 and B(V2 ) = V2 , the inequality
obtained in part (a) becomes
||V1 − V2 ||∞ ≤ γ||V1 − V2 ||∞
(18)
(1 − γ)||V1 − V2 ||∞ ≤ 0
(19)
Since γ ∈ [0, 1), then 1 − γ > 0. An infinity norm of any variable is non-negative, so the only way for
the above expression to be true is if ||V1 − V2 ||∞ = 0, and, consequently, if V1 = V2 . Therefore we
proved that the Bellman operator has at most one fixed point.
11 / 28
20. Using an MDP with value iteration
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Repeat:
Learning
Solving
Continuous state MDP
Execute π in the MDP for some number of trials.
Using the accumulated experience in the MDP, update
estimates for T (s, a, s ) (and R, if applicable)
Example
POMDP
Resources
Apply value iteration to get a new estimated value
function V
Update π to be the greedy policy with respect to V
12 / 28
21. Solving an MDP: Policy Iteration
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Initialize π randomly.
Repeat until convergence:
V ← Vπ
∀s ∈ S, π(s) = arg max
a∈A
Continuous state MDP
Example
POMDP
Resources
s
∈S T (s, a, s )V (s )
V ← V π can be done efficiently by solving Bellman’s
equations as a system of linear equations.
13 / 28
22. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
14 / 28
23. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
V (s) = R(s) + γmax
a
T (s, a, s )V (s )
s
MDP
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
14 / 28
24. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
V (s) = R(s) + γmax
a
T (s, a, s )V (s )
s
MDP
Learning
Solving
Think of Q-learning as a regression!
Continuous state MDP
Example
POMDP
Resources
14 / 28
25. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
V (s) = R(s) + γmax
a
T (s, a, s )V (s )
s
MDP
Learning
Solving
Think of Q-learning as a regression!
Continuous state MDP
Example
POMDP
Explore states: in state s, took action a, got reward r , ended
up in state s ( s, a, s , r ).
Resources
Q(s, a) ← Q(s, a) + α(error )
14 / 28
26. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
V (s) = R(s) + γmax
a
T (s, a, s )V (s )
s
MDP
Learning
Solving
Think of Q-learning as a regression!
Continuous state MDP
Example
POMDP
Explore states: in state s, took action a, got reward r , ended
up in state s ( s, a, s , r ).
Resources
Q(s, a) ← Q(s, a) + α(error )
Q(s, a) ← Q(s, a) + α(sensed − predicted)
14 / 28
27. Solving (and learning) an MDP: Q-learning
Model-free reinforcement learning.
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
V (s) = R(s) + γmax
a
T (s, a, s )V (s )
s
MDP
Learning
Solving
Think of Q-learning as a regression!
Continuous state MDP
Example
POMDP
Explore states: in state s, took action a, got reward r , ended
up in state s ( s, a, s , r ).
Resources
Q(s, a) ← Q(s, a) + α(error )
Q(s, a) ← Q(s, a) + α(sensed − predicted)
Q(s, a) ← Q(s, a) + α([γ(r + max Q(s , a )] − [Q(s, a)])
a
Stochastic update with step size α.
14 / 28
28. Continuous State MDP
A more realistic form of MDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
15 / 28
29. Continuous State MDP
A more realistic form of MDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Needs a simulator
MDP
Learning
Solving
Continuous state MDP
Example
POMDP
Resources
15 / 28
30. Continuous State MDP
A more realistic form of MDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Needs a simulator
MDP
Learning
Solving continuous-state MDPs:
Solving
Continuous state MDP
Example
LQR
POMDP
Fitted Value Iteration
Resources
15 / 28
31. An example MDP - the inverted pendulum
Introduction to
Reinforcement
Learning
A thin pole is connected via a free hinge to a cart
Edward Balaban
The cart can move laterally on a smooth table surface
Preliminaries
Failure occurs if:
the angle of the pole deviates by more than a certain amount
from the vertical position
the cart’s position goes out of bounds
MDP
Learning
Solving
Continuous state MDP
Example
The objective is to develop a controller to balance the pole
POMDP
The only actions the controller can take is accelerate the cart either left
or right
Resources
The algorithm cannot use any knowledge of the dynamics of the
underlying system
16 / 28
34. Partially Observable Markov Decision Process
(MDP)
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
States:
S = {s1 , ..., s|S| }
MDP
Actions:
A = {a1 , ..., a|A| }
POMDP
Transition probabilities:
T (s, a, s ) = Pr (s |s, a)
Observations:
Z = {z1 , ..., z|Z | }
Observation probabilities:
O(z, a, s ) = Pr (z |s, a)
Belief state (agent):
b = {b(s1 ), . . . , b(s|S| )} : S → [0, 1]|S| ,
|S|
i=1
Definition
Solving
Example
System Health
Management
Resources
b(si ) = 1
Belief space:
B - the set of all belief states (infinite)
Initial belief:
b0
Rewards:
R:S→R
Policy:
π : B → A|B| , Π is the set of all policies
18 / 28
35. Solving a POMDP
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
Solving a realistic POMDP exactly is often computationally
intractable.
POMDP
Definition
Solving
Example
Approximate method families:
System Health
Management
Resources
Point-based methods
Monte Carlo methods
Generalization methods
19 / 28
36. Example: Prognostic Decision Making
Introduction to
Reinforcement
Learning
System Degradation
Edward Balaban
All of aerospace systems experience degradation
Preliminaries
Degradation can be use- or time-dependent
MDP
The operating environment is often a significant
factor
POMDP
Definition
JAXA Hayabusa
Solving
Example
System Health
Management
Resources
20 / 28
37. Example: Prognostic Decision Making
Introduction to
Reinforcement
Learning
System Degradation
Edward Balaban
All of aerospace systems experience degradation
Preliminaries
Degradation can be use- or time-dependent
MDP
The operating environment is often a significant
factor
POMDP
Definition
JAXA Hayabusa
Solving
Example
Faults
System Health
Management
Degradation can accelerate if a fault occurs
Resources
In a complex, multi-component system a fault
can have cascading effects
In case of a fault, a quick mitigation decision is
often required
United Flight 232
20 / 28
38. Example: Prognostic Decision Making
Introduction to
Reinforcement
Learning
System Degradation
Edward Balaban
All of aerospace systems experience degradation
Preliminaries
Degradation can be use- or time-dependent
MDP
The operating environment is often a significant
factor
POMDP
Definition
JAXA Hayabusa
Solving
Example
Faults
System Health
Management
Degradation can accelerate if a fault occurs
Resources
In a complex, multi-component system a fault
can have cascading effects
In case of a fault, a quick mitigation decision is
often required
United Flight 232
System Health Management (SHM)
Recent designs, e.g. S-92, have more SHM
capabilities (in fault detection and diagnosis)
Still, maintenance is predominantly done based
on fixed schedules
In-flight emergencies are handled through skill
and ingenuity of the crew and ground control
Sikorsky S-92
20 / 28
39. How Can We Do Better?
In recent years progress has been made in using physics modeling and
computational methods for:
Fault detection,
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Fault magnitude estimation,
MDP
Degradation trajectory prediction (prognostics).
POMDP
Definition
Solving
Example
System Health
Management
Resources
21 / 28
40. How Can We Do Better?
In recent years progress has been made in using physics modeling and
computational methods for:
Fault detection,
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Fault magnitude estimation,
MDP
Degradation trajectory prediction (prognostics).
POMDP
Definition
Research on how to utilize prognostic health information is in the very early
stages, however.
Solving
Example
System Health
Management
Resources
21 / 28
41. How Can We Do Better?
In recent years progress has been made in using physics modeling and
computational methods for:
Fault detection,
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Fault magnitude estimation,
MDP
Degradation trajectory prediction (prognostics).
POMDP
Definition
Research on how to utilize prognostic health information is in the very early
stages, however.
Prognostic Decision Making (PDM)
Solving
Example
System Health
Management
Resources
The process of selecting system actions informed by predictions of
the future system health state
PDM can help with the following, for example:
Component life extension,
Fault mitigation,
Mission replanning
Crew decision support in emergencies,
Condition-based maintenance,
Asset allocation.
21 / 28
42. Introduction to
Reinforcement
Learning
System
Described as a continuous-state, continuous-action POMDP:
Edward Balaban
State space:
S ⊆ Rn
Preliminaries
Action space:
A ⊆ Rm
MDP
Observations:
Z ⊆ Rp
POMDP
Transition function:
T (s, a, s ) = pdf (s |s, a) : S × A × S → [0, ∞)
Definition
Observation function:
O(z , a, s ) = pdf (z |s , a) : S×A×Z → [0, ∞)
Example
Belief state:
b(s) = pdf (s)
Belief space:
B - the set of all belief states
Initial belief:
b0
Belief update:
b az (s ) ∝ O(z , a, s )
Solving
System Health
Management
Resources
T (s, a, s )b(s)ds
S
Policy:
π(a, b) = pdf (a|b) : A × B → [0, ∞), Π is the
set of all policies
Costs:
C = {c1 (s, a), ..., c|C | (s, a)} : S × A → R|C |
Rewards:
R(s, r ) = pdf (r |s) : S × R → [0, ∞)
Objectives:
F = {f1 (s), . . . , f|F | (s)} : S → R|F |
Constraints:
G = {g1 (s), . . . , g|G| (s)} : S → B|G|
22 / 28
43. Introduction to
Reinforcement
Learning
System Degradation
Let H = {h1 , . . . , hH } be the vector of system health parameters
incorporated into the state
Edward Balaban
Preliminaries
Fault: Gfault ∈ G defines significant deviations from the expected
nominal behavior. A fault occurs if ∃i, gi (s) = true, gi ∈ Gfault .
MDP
Failure: Gfailure ∈ G defines states where the system loses functional
capability with respect to a health parameter h ∈ H
System failure F : S → B, a boolean function indicating when the entire
system is effectively non-functional (F is defined via the Gfailure set)
End of Life (EoL): tEoL : F (s) = true
POMDP
Definition
Solving
Example
System Health
Management
Resources
Remaining Useful Life: RUL = tEoL − t
h
1
fault threshold
tfault
EoL
failure threshold
0
t
23 / 28
47. Introduction to
Reinforcement
Learning
Decision Making
Edward Balaban
Preliminaries
MDP
POMDP
Definition
Solving
Example
System Health
Management
Resources
the process of finding (or approximating) π ∗ , such that
π ∗ = arg max J π (bt )
π∈Π
24 / 28
48. Case Study: UAV Mission Replanning
Given:
An initial mission route (not necessarily optimized) which includes
waypoint parameter constraints (e.g. on airspeed or bank angle).
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
Each waypoint is associated with a payoff value
MDP
A healthy vehicle is able to complete the entire route within the energy
and component health constraints
POMDP
Transition costs between a pair of waypoints are history-dependent
A fault occurs that makes it impossible to complete the mission before
the End of Life (EoL)
Definition
Solving
Example
System Health
Management
Resources
Find:
A policy π that maximizes mission payoff and extends the remaining useful
life
25 / 28
49. Introduction to
Reinforcement
Learning
Reasoning Architecture
Vehicle Simulation
(including prognostic
models)
Edward Balaban
health and energy cost estimates
Diagnoser
Preliminaries
MDP
input route and
parameter constraints
candidate route
observations
POMDP
Definition
Solving
Decision Maker
Vehicle
Example
System Health
Management
initial fault set
current fault set
Resources
Particle Filter is currently used as the decision-making algorithm
Decision Maker picks ordered waypoint subsets and parameter values for
candidate routes and proposed routes
The vehicle simulation is 6DOF, with prognostic models for battery and
motor temperatures, as well as the battery state of charge
The fault mode currently implemented is increased motor friction
The fault leads to increased current consumption and motor/battery
overheating
26 / 28
50. Mission Replanning Simulation
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Definition
Solving
Example
System Health
Management
Resources
27 / 28
52. Resources
Sutton and Barto book:
http:
//webdocs.cs.ualberta.ca/˜sutton/book/ebook/
Intro to POMDPs:
http://cs.brown.edu/research/ai/pomdp/
tutorial/index.html
Stanford Autonomous Helicopter project:
http://heli.stanford.edu
NASA Vehicle Health Management (Intelligent Systems
Division): http://ti.arc.nasa.gov/tech/dash/
pcoe/publications/
E. Balaban and J. J. Alonso, “A Modeling Framework
for Prognostic Decision Making and its Application to
UAV Mission Planning”, in proceedings of the Annual
Conference of the Prognostics and Health Management
Society, 2013, pp. 1-12.:
https://c3.nasa.gov/dashlink/resources/881/
Introduction to
Reinforcement
Learning
Edward Balaban
Preliminaries
MDP
POMDP
Resources
28 / 28