SlideShare une entreprise Scribd logo
1  sur  86
Télécharger pour lire hors ligne
Zap Q-Learning
Reinforcement Learning: Hidden Theory, and New Super-Fast Algorithms
Center for Systems and Control (CSC@USC)
and Ming Hsieh Institute for Electrical Engineering
February 21, 2018
Adithya M. Devraj Sean P. Meyn
Department of Electrical and Computer Engineering — University of Florida
Zap Q-Learning
Outline
1 Stochastic Approximation
2 Fastest Stochastic Approximation
3 Reinforcement Learning
4 Zap Q-Learning
5 Conclusions & Future Work
6 References
E[f(θ,W)]
θ=θ∗
= 0
Stochastic Approximation
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may be
expensive. For root finding, we may need to compute the expectation
for many values of θ
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may be
expensive. For root finding, we may need to compute the expectation
for many values of θ
3 Motivates stochastic approximation: θ(n + 1) = θ(n) + αnf(θ(n), W(n))
The recursive algorithms we come up with are often slow, and their
variance may be infinite: typical in Q-learning [Devraj & M 2017]
1 / 31
Stochastic Approximation ODE Method
Algorithm and Convergence Analysis
Algorithm:
θ(n + 1) = θ(n) + αnf(θ(n), W(n))
Goal:
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Interpretation: θ∗ ≡ stationary point of the ODE
d
dt
ϑ(t) = ¯f(ϑ(t))
2 / 31
Stochastic Approximation ODE Method
Algorithm and Convergence Analysis
Algorithm:
θ(n + 1) = θ(n) + αnf(θ(n), W(n))
Goal:
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Interpretation: θ∗ ≡ stationary point of the ODE
d
dt
ϑ(t) = ¯f(ϑ(t))
Analysis: Stability of the ODE ⊕ (See Borkar’s monograph) =⇒
lim
n→∞
θ(n) = θ∗
2 / 31
Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable:
η = c(x) fX(x) dx
3 / 31
Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable
SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ]
Algorithm: θ(n) =
1
n
n
i=1
c(X(i))
3 / 31
Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable
SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ]
Algorithm: θ(n) =
1
n
n
i=1
c(X(i))
=⇒ (n + 1)θ(n + 1) =
n+1
i=1
c(X(i)) = nθ(n) + c(X(n + 1))
=⇒ (n + 1)θ(n + 1) = (n + 1)θ(n) + [c(X(n + 1)) − θ(n)]
SA Recursion: θ(n + 1) = θ(n) + αnf(θ(n), X(n + 1))
αn = ∞, α2
n < ∞
3 / 31
θ(k)
k
Fastest Stochastic Approximation
Fastest Stochastic Approximation Algorithm Performance
Performance Criteria
Two standard approaches to evaluate performance, ˜θ(n) := θ(n) − θ∗:
1 Finite-n bound:
P{ ˜θ(n) ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2
)
2 Asymptotic covariance:
Σ = lim
n→∞
nE ˜θ(n)˜θ(n)T
,
√
n˜θ(n) ≈ N(0, Σ)
4 / 31
Fastest Stochastic Approximation Algorithm Performance
Asymptotic Covariance
Σ = lim
n→∞
Σn = lim
n→∞
nE ˜θ(n)˜θ(n)T
,
√
n˜θ(n) ≈ N(0, Σ)
SA recursion for covariance:
Σn+1 ≈ Σn + 1
n (A + 1
2I)Σn + Σn(A + 1
2I)T
+ Σ∆
A = d
dθ
¯f (θ∗)
Conclusions
1 If Re λ(A) ≥ −1
2 for some eigenvalue then Σ is (typically) infinite
2 If Re λ(A) < −1
2 for all, then Σ = limn→∞ Σn is the unique solution
to the Lyapunov equation:
0 = (A + 1
2I)Σ + Σ(A + 1
2I)T
+ Σ∆
5 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
If G = G∗ := −A−1 then
Resembles Monte-Carlo estimate
Resembles Newton-Rapshon
It is optimal: Σ∗ = G∗Σ∆G∗T
≤ ΣG any other G
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
If G = G∗ := −A−1 then
Resembles Monte-Carlo estimate
Resembles Newton-Rapshon
It is optimal: Σ∗ = G∗Σ∆G∗T
≤ ΣG any other G
Polyak-Ruppert averaging is also optimal, but first two bullets are missing.
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Example: return to Monte-Carlo
θ(n + 1) = θ(n) +
g
n + 1
−θ(n) + X(n + 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Example: return to Monte-Carlo
θ(n + 1) = θ(n) +
g
n + 1
−θ(n) + X(n + 1)
∆(n) = X(n) − E[X(n)]
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
0 1 2 3 4 5 g
σ2
∆
Σ =
σ2
∆
2
g2
g − 1/2
Asymptotic variance as a function of g
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
0 1 2 3 4 5
t 104
0.4
0.6
0.8
1
1.2
(t)
20 30.8
10 15.8
1 3
0.5
0.1
g
SA estimates of E[W2
], W ∼ N(0, 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
Requires An ≈ A(θn) :=
d
dθ
¯f (θn)
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
ODE for Zap-SNR
d
dt
xt = (−A(xt))−1 ¯f (xt), A(x) =
d
dx
¯f (x)
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
ODE for Zap-SNR
d
dt
xt = (−A(xt))−1 ¯f (xt), A(x) =
d
dx
¯f (x)
Not necessarily stable (just like in deterministic Newton-Raphson)
General conditions for convergence is open 8 / 31
Reinforcement Learning
and Stochastic Approximation
Reinforcement Learning
and Stochastic Approximation
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Φ(n) = (state, action)
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
Necessary Ingredients:
Parameterized family {hθ : θ ∈ Rd}
Adapted, d-dimensional stochastic process {ζn}
Examples are TD- and Q-Learning
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
Necessary Ingredients:
Parameterized family {hθ : θ ∈ Rd}
Adapted, d-dimensional stochastic process {ζn}
Examples are TD- and Q-Learning
These algorithms are thus special cases of stochastic approximation
(as we all know)
9 / 31
Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
10 / 31
Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
Value function:
h∗
(x) = min
U
∞
n=0
βn
E[c(X(n), U(n)) | X(0) = x]
10 / 31
Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
Value function:
h∗
(x) = min
U
∞
n=0
βn
E[c(X(n), U(n)) | X(0) = x]
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
10 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
11 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
11 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
h∗
(x) = min
u
Q∗
(x, u)
11 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
h∗
(x) = min
u
Q∗
(x, u)
Another Bellman equation:
Q∗
(x, u) = c(x, u) + βE[Q∗
(X(n + 1)) | X(n) = x, U(n) = u]
Q∗
(x) = min
u
Q∗
(x, u)
11 / 31
Reinforcement Learning Q-Learning
Q-Learning and Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
12 / 31
Reinforcement Learning Q-Learning
Q-Learning and Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
That is,
0 = E[F(Q∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] ,
with Φ(n + 1) = (X(n + 1), X(n), U(n)).
12 / 31
Reinforcement Learning Q-Learning
Q-Learning and Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
Q-Learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
The family {Qθ} and eligibility vectors {ζn} are part of algorithm design.
12 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
13 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
13 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
Asymptotic covariance is typically infinite
13 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Big Question: Can we Zap Q-Learning?
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
Asymptotic covariance is typically infinite
13 / 31
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Zap Q-Learning
Zap Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
Improvements are needed!
1
4
65
3 2
Histogram of parameter estimates after 106 iterations.
1000 200 300 400 486.6
0
10
20
30
40
n = 106
Histogram for θ
θ*
n(15)
(15)
Example from Devraj & M 2017
14 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ);
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ); At points of differentiability:
A(θ) = E ζn βψ(X(n + 1), φθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ); At points of differentiability:
A(θ) = E ζn βψ(X(n + 1), φθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
Algorithm:
θ(n + 1)= θ(n) + αn(−An)−1
(f(θ(n), Φ(n))); An = An−1 + γn(An − An−1);
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ); At points of differentiability:
A(θ) = E ζn βψ(X(n + 1), φθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
Algorithm:
θ(n + 1)= θ(n) + αn(−An)−1
(f(θ(n), Φ(n))); An = An−1 + γn(An − An−1);
An+1 :=
d
dθ
f (θn, Φ(n))
= ζn βψ(X(n + 1), φθn
(X(n + 1))) − ψ(X(n), U(n))
T
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
16 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
ODE for Zap-Q
qt = Q∗
(ςt),
d
dt
ςt = −ςt + c
⇒ convergence, optimal covariance, ...
16 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Zap, γn = αn
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n
Optimal scalar gain is approximately αn = 1500/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Watkins, g = 1500
Zap, γn = αn
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =
√
n˜θn
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
18 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =
√
n˜θn
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
Discount factor: β = 0.99
19 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500
g = 1500
Speedy
Poly
g = 5000
Polyak-Ruppert
B0 10 20 30 40 50
0
1
2
0 20 40 60 80 100 120 140 160
0
0.5
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
Watkins
BellmanError
Histogramsn=106
20 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500
g = 1500
Speedy
Poly
g = 5000
g = 500
g = 1500
Speedy
Poly
g = 5000
10
3
10
4
10
5
10
6
100
101
102
103
104
10
3
10
4
10
5
10
6
n
Polyak-Ruppert
Polyak-RuppertB
B
n
0 10 20 30 40 50
0
1
2
0 20 40 60 80 100 120 140 160
0
0.5
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
WatkinsWatkins
BellmanErrorBellmanError
Histogramsn=106
2σ confidence intervals for the Q-learning algorithms
20 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
21 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Authors observed slow convergence
Proposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
{Gn}
21 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
-0.525-30 -25 -20 -15 -10 -5
-10
-5
0
5
10
Re (λ(GA))
Co(λ(GA))
λi(GA)Real λi(A)
Eigenvalues of A and GA for the finance example
Favorite choice of gain in [23] barely meets the criterion Re(λ(GA)) < −1
2
21 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100. Parameterized Q-function: Qθ with θ ∈ R10
Zap-Q
G-Q
-1000 0 1000 2000 3000 -600 -400 -200 0 200 400 600 800
-250 -200 -150 -100 -50 0 50 100 -200 -100 0 100 200 300
Theoritical pdf Experimental pdf Empirical: 1000 trials
Wn =
√
n˜θn
Entry #1: n = 2 × 106
Entry #7: n = 2 × 106
22 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100.
Parameterized Q-function: Qθ with θ ∈ R10
Histograms of the average reward obtained using the different algorithms:
1 1.05 1.1 1.15 1.2 1.25
0
20
40
60
80
100
1 1.05 1.1 1.15 1.2 1.25
0
100
200
300
400
500
600
1 1.05 1.1 1.15 1.2 1.25
0
5
10
15
20
25
30
35 G-Q(0)
G-Q(0)
Zap-Q
Zap-Q ρ = 0.8
ρ = 1.0
g = 100
g = 200
Zap-Q ρ = 0.85
n = 2 × 104
n = 2 × 105
n = 2 × 106
Zap-Q G-Q
23 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
24 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
24 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Optimal stopping time problems
Adaptive optimization of algorithm parameters
24 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Optimal stopping time problems
Adaptive optimization of algorithm parameters
Finite-time analysis
24 / 31
Conclusions & Future Work
Thank you!
thankful
25 / 31
References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f)<∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − π f → 0
sup
C
Ex[SτC(f)]<∞
References
26 / 31
References
This lecture
A. M. Devraj and S. P. Meyn, Zap Q-learning. Advances in Neural
Information Processing Systems (NIPS). Dec. 2017.
A. M. Devraj and S. P. Meyn, Fastest convergence for Q-learning. Available
on ArXiv. Jul. 2017.
27 / 31
References
Selected References I
[1] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
[2] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic
approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,
Berlin, 1990. Translated from the French by Stephen S. Wilson.
[3] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan
Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK,
2008.
[4] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic
approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[5] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
University Press, Cambridge, second edition, 2009. Published in the Cambridge
Mathematical Library.
[6] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
28 / 31
References
Selected References II
[7] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.
The Annals of Statistics, 13(1):236–245, 1985.
[8] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes.
Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research
and Industrial Engineering, Ithaca, NY, 1988.
[9] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i
telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages
98–107, 1990.
[10] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Control Optim., 30(4):838–855, 1992.
[11] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic
approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[12] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems
24, pages 451–459. Curran Associates, Inc., 2011.
29 / 31
References
Selected References III
[13] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
[14] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, Cambridge, UK, 1989.
[15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
[16] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[18] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.
[19] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[20] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
30 / 31
References
Selected References IV
[21] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
[22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
[23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
[24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[27] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE
Conference on Decision and Control, pages 3598–3605, Dec. 2009.
31 / 31

Contenu connexe

Tendances

Tendances (20)

パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成パターン認識と機械学習 §6.2 カーネル関数の構成
パターン認識と機械学習 §6.2 カーネル関数の構成
 
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
[DL輪読会]Recent Advances in Autoencoder-Based Representation Learning
 
[DL輪読会]Deep Reinforcement Learning that Matters
[DL輪読会]Deep Reinforcement Learning that Matters[DL輪読会]Deep Reinforcement Learning that Matters
[DL輪読会]Deep Reinforcement Learning that Matters
 
Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習Noisy Labels と戦う深層学習
Noisy Labels と戦う深層学習
 
文献紹介:YOLO series:v1-v5, X, F, and YOWO
文献紹介:YOLO series:v1-v5, X, F, and YOWO文献紹介:YOLO series:v1-v5, X, F, and YOWO
文献紹介:YOLO series:v1-v5, X, F, and YOWO
 
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
【DL輪読会】Trajectory Prediction with Latent Belief Energy-Based Model
 
[DL輪読会]Hindsight Experience Replay
[DL輪読会]Hindsight Experience Replay[DL輪読会]Hindsight Experience Replay
[DL輪読会]Hindsight Experience Replay
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
 
[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via T...
[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via  T...[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via  T...
[DL輪読会] Adversarial Skill Chaining for Long-Horizon Robot Manipulation via T...
 
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
[DL輪読会]BADGR: An Autonomous Self-Supervised Learning-Based Navigation System
 
[DL輪読会]Autonomous Reinforcement Learning: Formalism and Benchmarking
[DL輪読会]Autonomous Reinforcement Learning: Formalism and Benchmarking[DL輪読会]Autonomous Reinforcement Learning: Formalism and Benchmarking
[DL輪読会]Autonomous Reinforcement Learning: Formalism and Benchmarking
 
【DL輪読会】Transformers are Sample Efficient World Models
【DL輪読会】Transformers are Sample Efficient World Models【DL輪読会】Transformers are Sample Efficient World Models
【DL輪読会】Transformers are Sample Efficient World Models
 
MASTERING ATARI WITH DISCRETE WORLD MODELS (DreamerV2)
MASTERING ATARI WITH DISCRETE WORLD MODELS (DreamerV2)MASTERING ATARI WITH DISCRETE WORLD MODELS (DreamerV2)
MASTERING ATARI WITH DISCRETE WORLD MODELS (DreamerV2)
 
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
【DL輪読会】AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
 
機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)機械学習による統計的実験計画(ベイズ最適化を中心に)
機械学習による統計的実験計画(ベイズ最適化を中心に)
 
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
【DL輪読会】Data-Efficient Reinforcement Learning with Self-Predictive Representat...
 
[DL輪読会]Active Domain Randomization
[DL輪読会]Active Domain Randomization[DL輪読会]Active Domain Randomization
[DL輪読会]Active Domain Randomization
 
ブラックボックス最適化とその応用
ブラックボックス最適化とその応用ブラックボックス最適化とその応用
ブラックボックス最適化とその応用
 
[DLHacks 実装] DeepPose: Human Pose Estimation via Deep Neural Networks
[DLHacks 実装] DeepPose: Human Pose Estimation via Deep Neural Networks[DLHacks 実装] DeepPose: Human Pose Estimation via Deep Neural Networks
[DLHacks 実装] DeepPose: Human Pose Estimation via Deep Neural Networks
 
Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018Introduction to Chainer 11 may,2018
Introduction to Chainer 11 may,2018
 

Similaire à Introducing Zap Q-Learning

DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
Ryan White
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
Per Kristian Lehre
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
PK Lehre
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
Ryan White
 

Similaire à Introducing Zap Q-Learning (20)

Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
 
DeepLearn2022 2. Variance Matters
DeepLearn2022  2. Variance MattersDeepLearn2022  2. Variance Matters
DeepLearn2022 2. Variance Matters
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
stochastic processes assignment help
stochastic processes assignment helpstochastic processes assignment help
stochastic processes assignment help
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...
QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...
QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...
 
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
 

Plus de Sean Meyn

Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Sean Meyn
 
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
Sean Meyn
 
Distributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridDistributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power Grid
Sean Meyn
 
2012 Tutorial: Markets for Differentiated Electric Power Products
2012 Tutorial:  Markets for Differentiated Electric Power Products2012 Tutorial:  Markets for Differentiated Electric Power Products
2012 Tutorial: Markets for Differentiated Electric Power Products
Sean Meyn
 
The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010
Sean Meyn
 

Plus de Sean Meyn (20)

Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
 
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfDeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
 
DeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q LearningDeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q Learning
 
Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019
 
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
 
Irrational Agents and the Power Grid
Irrational Agents and the Power GridIrrational Agents and the Power Grid
Irrational Agents and the Power Grid
 
State estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchState estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatch
 
Demand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesDemand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary Services
 
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
 
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
 
Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?
 
Distributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridDistributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power Grid
 
Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...
 
2012 Tutorial: Markets for Differentiated Electric Power Products
2012 Tutorial:  Markets for Differentiated Electric Power Products2012 Tutorial:  Markets for Differentiated Electric Power Products
2012 Tutorial: Markets for Differentiated Electric Power Products
 
Control Techniques for Complex Systems
Control Techniques for Complex SystemsControl Techniques for Complex Systems
Control Techniques for Complex Systems
 
Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010
 
Panel Lecture for Energy Systems Week
Panel Lecture for Energy Systems WeekPanel Lecture for Energy Systems Week
Panel Lecture for Energy Systems Week
 
The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010
 
Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...
 
Anomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov ModelsAnomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov Models
 

Dernier

怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
wsppdmt
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 

Dernier (20)

Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 

Introducing Zap Q-Learning

  • 1. Zap Q-Learning Reinforcement Learning: Hidden Theory, and New Super-Fast Algorithms Center for Systems and Control (CSC@USC) and Ming Hsieh Institute for Electrical Engineering February 21, 2018 Adithya M. Devraj Sean P. Meyn Department of Electrical and Computer Engineering — University of Florida
  • 2. Zap Q-Learning Outline 1 Stochastic Approximation 2 Fastest Stochastic Approximation 3 Reinforcement Learning 4 Zap Q-Learning 5 Conclusions & Future Work 6 References
  • 4. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 1 / 31
  • 5. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 / 31
  • 6. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 The function f and the distribution of the random vector W may not be known – we may only know something about the structure of the problem 1 / 31
  • 7. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 The function f and the distribution of the random vector W may not be known – we may only know something about the structure of the problem 2 Even if everything is known, computation of the expectation may be expensive. For root finding, we may need to compute the expectation for many values of θ 1 / 31
  • 8. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 The function f and the distribution of the random vector W may not be known – we may only know something about the structure of the problem 2 Even if everything is known, computation of the expectation may be expensive. For root finding, we may need to compute the expectation for many values of θ 3 Motivates stochastic approximation: θ(n + 1) = θ(n) + αnf(θ(n), W(n)) The recursive algorithms we come up with are often slow, and their variance may be infinite: typical in Q-learning [Devraj & M 2017] 1 / 31
  • 9. Stochastic Approximation ODE Method Algorithm and Convergence Analysis Algorithm: θ(n + 1) = θ(n) + αnf(θ(n), W(n)) Goal: ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Interpretation: θ∗ ≡ stationary point of the ODE d dt ϑ(t) = ¯f(ϑ(t)) 2 / 31
  • 10. Stochastic Approximation ODE Method Algorithm and Convergence Analysis Algorithm: θ(n + 1) = θ(n) + αnf(θ(n), W(n)) Goal: ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Interpretation: θ∗ ≡ stationary point of the ODE d dt ϑ(t) = ¯f(ϑ(t)) Analysis: Stability of the ODE ⊕ (See Borkar’s monograph) =⇒ lim n→∞ θ(n) = θ∗ 2 / 31
  • 11. Stochastic Approximation SA Example Stochastic Approximation Example Example: Monte-Carlo Monte-Carlo Estimation Estimate the mean η = c(X), where X is a random variable: η = c(x) fX(x) dx 3 / 31
  • 12. Stochastic Approximation SA Example Stochastic Approximation Example Example: Monte-Carlo Monte-Carlo Estimation Estimate the mean η = c(X), where X is a random variable SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ] Algorithm: θ(n) = 1 n n i=1 c(X(i)) 3 / 31
  • 13. Stochastic Approximation SA Example Stochastic Approximation Example Example: Monte-Carlo Monte-Carlo Estimation Estimate the mean η = c(X), where X is a random variable SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ] Algorithm: θ(n) = 1 n n i=1 c(X(i)) =⇒ (n + 1)θ(n + 1) = n+1 i=1 c(X(i)) = nθ(n) + c(X(n + 1)) =⇒ (n + 1)θ(n + 1) = (n + 1)θ(n) + [c(X(n + 1)) − θ(n)] SA Recursion: θ(n + 1) = θ(n) + αnf(θ(n), X(n + 1)) αn = ∞, α2 n < ∞ 3 / 31
  • 15. Fastest Stochastic Approximation Algorithm Performance Performance Criteria Two standard approaches to evaluate performance, ˜θ(n) := θ(n) − θ∗: 1 Finite-n bound: P{ ˜θ(n) ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2 ) 2 Asymptotic covariance: Σ = lim n→∞ nE ˜θ(n)˜θ(n)T , √ n˜θ(n) ≈ N(0, Σ) 4 / 31
  • 16. Fastest Stochastic Approximation Algorithm Performance Asymptotic Covariance Σ = lim n→∞ Σn = lim n→∞ nE ˜θ(n)˜θ(n)T , √ n˜θ(n) ≈ N(0, Σ) SA recursion for covariance: Σn+1 ≈ Σn + 1 n (A + 1 2I)Σn + Σn(A + 1 2I)T + Σ∆ A = d dθ ¯f (θ∗) Conclusions 1 If Re λ(A) ≥ −1 2 for some eigenvalue then Σ is (typically) infinite 2 If Re λ(A) < −1 2 for all, then Σ = limn→∞ Σn is the unique solution to the Lyapunov equation: 0 = (A + 1 2I)Σ + Σ(A + 1 2I)T + Σ∆ 5 / 31
  • 17. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) 6 / 31
  • 18. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) Assume it converges, and linearize: ˜θ(n + 1) ≈ ˜θ(n) + 1 n + 1 G A˜θ(n) + ∆(n + 1) , A = d dθ ¯f (θ∗ ) . 6 / 31
  • 19. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) Assume it converges, and linearize: ˜θ(n + 1) ≈ ˜θ(n) + 1 n + 1 G A˜θ(n) + ∆(n + 1) , A = d dθ ¯f (θ∗ ) . If G = G∗ := −A−1 then Resembles Monte-Carlo estimate Resembles Newton-Rapshon It is optimal: Σ∗ = G∗Σ∆G∗T ≤ ΣG any other G 6 / 31
  • 20. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) Assume it converges, and linearize: ˜θ(n + 1) ≈ ˜θ(n) + 1 n + 1 G A˜θ(n) + ∆(n + 1) , A = d dθ ¯f (θ∗ ) . If G = G∗ := −A−1 then Resembles Monte-Carlo estimate Resembles Newton-Rapshon It is optimal: Σ∗ = G∗Σ∆G∗T ≤ ΣG any other G Polyak-Ruppert averaging is also optimal, but first two bullets are missing. 6 / 31
  • 21. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance Example: return to Monte-Carlo θ(n + 1) = θ(n) + g n + 1 −θ(n) + X(n + 1) 7 / 31
  • 22. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance Example: return to Monte-Carlo θ(n + 1) = θ(n) + g n + 1 −θ(n) + X(n + 1) ∆(n) = X(n) − E[X(n)] 7 / 31
  • 23. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) 7 / 31
  • 24. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) Example: X(n) = W2(n), W ∼ N(0, 1) 7 / 31
  • 25. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) Example: X(n) = W2(n), W ∼ N(0, 1) 0 1 2 3 4 5 g σ2 ∆ Σ = σ2 ∆ 2 g2 g − 1/2 Asymptotic variance as a function of g 7 / 31
  • 26. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) Example: X(n) = W2(n), W ∼ N(0, 1) 0 1 2 3 4 5 t 104 0.4 0.6 0.8 1 1.2 (t) 20 30.8 10 15.8 1 3 0.5 0.1 g SA estimates of E[W2 ], W ∼ N(0, 1) 7 / 31
  • 27. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) Requires An ≈ A(θn) := d dθ ¯f (θn) 8 / 31
  • 28. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) 8 / 31
  • 29. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ 8 / 31
  • 30. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1) 8 / 31
  • 31. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1) ODE for Zap-SNR d dt xt = (−A(xt))−1 ¯f (xt), A(x) = d dx ¯f (x) 8 / 31
  • 32. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1) ODE for Zap-SNR d dt xt = (−A(xt))−1 ¯f (xt), A(x) = d dx ¯f (x) Not necessarily stable (just like in deterministic Newton-Raphson) General conditions for convergence is open 8 / 31
  • 35. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? 9 / 31
  • 36. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Φ(n) = (state, action) 9 / 31
  • 37. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Galerkin relaxation: 0 = E[F(hθ∗ , Φ(n + 1))ζn] , θ∗ = ? 9 / 31
  • 38. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Galerkin relaxation: 0 = E[F(hθ∗ , Φ(n + 1))ζn] , θ∗ = ? Necessary Ingredients: Parameterized family {hθ : θ ∈ Rd} Adapted, d-dimensional stochastic process {ζn} Examples are TD- and Q-Learning 9 / 31
  • 39. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Galerkin relaxation: 0 = E[F(hθ∗ , Φ(n + 1))ζn] , θ∗ = ? Necessary Ingredients: Parameterized family {hθ : θ ∈ Rd} Adapted, d-dimensional stochastic process {ζn} Examples are TD- and Q-Learning These algorithms are thus special cases of stochastic approximation (as we all know) 9 / 31
  • 40. Reinforcement Learning MDP Theory Stochastic Optimal Control MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A) c: X × U → R is a cost function β < 1 a discount factor 10 / 31
  • 41. Reinforcement Learning MDP Theory Stochastic Optimal Control MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A) c: X × U → R is a cost function β < 1 a discount factor Value function: h∗ (x) = min U ∞ n=0 βn E[c(X(n), U(n)) | X(0) = x] 10 / 31
  • 42. Reinforcement Learning MDP Theory Stochastic Optimal Control MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A) c: X × U → R is a cost function β < 1 a discount factor Value function: h∗ (x) = min U ∞ n=0 βn E[c(X(n), U(n)) | X(0) = x] Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} 10 / 31
  • 43. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} 11 / 31
  • 44. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} Q-function: Q∗ (x, u) := c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u] 11 / 31
  • 45. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} Q-function: Q∗ (x, u) := c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u] h∗ (x) = min u Q∗ (x, u) 11 / 31
  • 46. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} Q-function: Q∗ (x, u) := c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u] h∗ (x) = min u Q∗ (x, u) Another Bellman equation: Q∗ (x, u) = c(x, u) + βE[Q∗ (X(n + 1)) | X(n) = x, U(n) = u] Q∗ (x) = min u Q∗ (x, u) 11 / 31
  • 47. Reinforcement Learning Q-Learning Q-Learning and Galerkin Relaxation Dynamic programming Find function Q∗ that solves E c(X(n), U(n)) + βQ∗ (X(n + 1)) − Q∗ (X(n), U(n)) | Fn = 0 12 / 31
  • 48. Reinforcement Learning Q-Learning Q-Learning and Galerkin Relaxation Dynamic programming Find function Q∗ that solves E c(X(n), U(n)) + βQ∗ (X(n + 1)) − Q∗ (X(n), U(n)) | Fn = 0 That is, 0 = E[F(Q∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , with Φ(n + 1) = (X(n + 1), X(n), U(n)). 12 / 31
  • 49. Reinforcement Learning Q-Learning Q-Learning and Galerkin Relaxation Dynamic programming Find function Q∗ that solves E c(X(n), U(n)) + βQ∗ (X(n + 1)) − Q∗ (X(n), U(n)) | Fn = 0 Q-Learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 The family {Qθ} and eligibility vectors {ζn} are part of algorithm design. 12 / 31
  • 50. Reinforcement Learning Q-Learning Watkins’ Q-learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 13 / 31
  • 51. Reinforcement Learning Q-Learning Watkins’ Q-learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 Watkin’s algorithm is Stochastic Approximation The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) and ψn(x, u) = 1{x = xn, u = un} (complete basis) 13 / 31
  • 52. Reinforcement Learning Q-Learning Watkins’ Q-learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 Watkin’s algorithm is Stochastic Approximation The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) and ψn(x, u) = 1{x = xn, u = un} (complete basis) Asymptotic covariance is typically infinite 13 / 31
  • 53. Reinforcement Learning Q-Learning Watkins’ Q-learning Big Question: Can we Zap Q-Learning? Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 Watkin’s algorithm is Stochastic Approximation The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) and ψn(x, u) = 1{x = xn, u = un} (complete basis) Asymptotic covariance is typically infinite 13 / 31
  • 54. 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Zap Q-Learning
  • 55. Zap Q-Learning Asymptotic Covariance of Watkins’ Q-Learning Improvements are needed! 1 4 65 3 2 Histogram of parameter estimates after 106 iterations. 1000 200 300 400 486.6 0 10 20 30 40 n = 106 Histogram for θ θ* n(15) (15) Example from Devraj & M 2017 14 / 31
  • 56. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) 15 / 31
  • 57. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); 15 / 31
  • 58. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); At points of differentiability: A(θ) = E ζn βψ(X(n + 1), φθ (X(n + 1))) − ψ(X(n), U(n)) T φθ (X(n + 1)) := arg min u Qθ (X(n + 1), u) 15 / 31
  • 59. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); At points of differentiability: A(θ) = E ζn βψ(X(n + 1), φθ (X(n + 1))) − ψ(X(n), U(n)) T φθ (X(n + 1)) := arg min u Qθ (X(n + 1), u) Algorithm: θ(n + 1)= θ(n) + αn(−An)−1 (f(θ(n), Φ(n))); An = An−1 + γn(An − An−1); 15 / 31
  • 60. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); At points of differentiability: A(θ) = E ζn βψ(X(n + 1), φθ (X(n + 1))) − ψ(X(n), U(n)) T φθ (X(n + 1)) := arg min u Qθ (X(n + 1), u) Algorithm: θ(n + 1)= θ(n) + αn(−An)−1 (f(θ(n), Φ(n))); An = An−1 + γn(An − An−1); An+1 := d dθ f (θn, Φ(n)) = ζn βψ(X(n + 1), φθn (X(n + 1))) − ψ(X(n), U(n)) T 15 / 31
  • 61. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning ODE Analysis: change of variables q = Q∗(ς) Functional Q∗ maps cost functions to Q-functions: q(x, u) = ς(x, u) + β x Pu(x, x ) min u q(x , u ) 16 / 31
  • 62. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning ODE Analysis: change of variables q = Q∗(ς) Functional Q∗ maps cost functions to Q-functions: q(x, u) = ς(x, u) + β x Pu(x, x ) min u q(x , u ) ODE for Zap-Q qt = Q∗ (ςt), d dt ςt = −ςt + c ⇒ convergence, optimal covariance, ... 16 / 31
  • 63. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 17 / 31
  • 64. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Convergence of Zap-Q Learning Discount factor: β = 0.99 17 / 31
  • 65. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Zap, γn = αn Convergence of Zap-Q Learning Discount factor: β = 0.99 17 / 31
  • 66. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n Optimal scalar gain is approximately αn = 1500/n 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Watkins, g = 1500 Zap, γn = αn Convergence of Zap-Q Learning Discount factor: β = 0.99 17 / 31
  • 67. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 -2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8 n = 104 n = 106 Theoritical pdf Experimental pdf Empirical: 1000 trialsWn = √ n˜θn Entry #18: n = 104 n = 106 Entry #10: CLT gives good prediction of finite-n performance 18 / 31
  • 68. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 -2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8 n = 104 n = 106 Theoritical pdf Experimental pdf Empirical: 1000 trialsWn = √ n˜θn Entry #18: n = 104 n = 106 Entry #10: CLT gives good prediction of finite-n performance Discount factor: β = 0.99 19 / 31
  • 69. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2Local Convergence: θ(0) initialized in neighborhood of θ∗ g = 500 g = 1500 Speedy Poly g = 5000 Polyak-Ruppert B0 10 20 30 40 50 0 1 2 0 20 40 60 80 100 120 140 160 0 0.5 Zap-Q: Zap-Q: ≡ α0 85 n γn ≡ γn αn Watkins BellmanError Histogramsn=106 20 / 31
  • 70. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2Local Convergence: θ(0) initialized in neighborhood of θ∗ g = 500 g = 1500 Speedy Poly g = 5000 g = 500 g = 1500 Speedy Poly g = 5000 10 3 10 4 10 5 10 6 100 101 102 103 104 10 3 10 4 10 5 10 6 n Polyak-Ruppert Polyak-RuppertB B n 0 10 20 30 40 50 0 1 2 0 20 40 60 80 100 120 140 160 0 0.5 Zap-Q: Zap-Q: ≡ α0 85 n γn ≡ γn αn Zap-Q: Zap-Q: ≡ α0 85 n γn ≡ γn αn WatkinsWatkins BellmanErrorBellmanError Histogramsn=106 2σ confidence intervals for the Q-learning algorithms 20 / 31
  • 71. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10-3 -10-4 -10-5 -10 -6 Real for every eigenvalue λ Asymptotic covariance is infinite λ > − 1 2 Real λi(A) 21 / 31
  • 72. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10 -3 -10-4 -10-5 -10 -6 Real for every eigenvalue λ Authors observed slow convergence Proposed a matrix gain sequence (see refs for details) Asymptotic covariance is infinite λ > − 1 2 Real λi(A) {Gn} 21 / 31
  • 73. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10 -3 -10-4 -10-5 -10 -6 -0.525-30 -25 -20 -15 -10 -5 -10 -5 0 5 10 Re (λ(GA)) Co(λ(GA)) λi(GA)Real λi(A) Eigenvalues of A and GA for the finance example Favorite choice of gain in [23] barely meets the criterion Re(λ(GA)) < −1 2 21 / 31
  • 74. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100. Parameterized Q-function: Qθ with θ ∈ R10 Zap-Q G-Q -1000 0 1000 2000 3000 -600 -400 -200 0 200 400 600 800 -250 -200 -150 -100 -50 0 50 100 -200 -100 0 100 200 300 Theoritical pdf Experimental pdf Empirical: 1000 trials Wn = √ n˜θn Entry #1: n = 2 × 106 Entry #7: n = 2 × 106 22 / 31
  • 75. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100. Parameterized Q-function: Qθ with θ ∈ R10 Histograms of the average reward obtained using the different algorithms: 1 1.05 1.1 1.15 1.2 1.25 0 20 40 60 80 100 1 1.05 1.1 1.15 1.2 1.25 0 100 200 300 400 500 600 1 1.05 1.1 1.15 1.2 1.25 0 5 10 15 20 25 30 35 G-Q(0) G-Q(0) Zap-Q Zap-Q ρ = 0.8 ρ = 1.0 g = 100 g = 200 Zap-Q ρ = 0.85 n = 2 × 104 n = 2 × 105 n = 2 × 106 Zap-Q G-Q 23 / 31
  • 76. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance 24 / 31
  • 77. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: g∗ = 1500 was chosen based on asymptotic covariance 24 / 31
  • 78. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: g∗ = 1500 was chosen based on asymptotic covariance Future work: Q-learning with function-approximation Obtain conditions for a stable algorithm in a general setting Optimal stopping time problems Adaptive optimization of algorithm parameters 24 / 31
  • 79. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: g∗ = 1500 was chosen based on asymptotic covariance Future work: Q-learning with function-approximation Obtain conditions for a stable algorithm in a general setting Optimal stopping time problems Adaptive optimization of algorithm parameters Finite-time analysis 24 / 31
  • 80. Conclusions & Future Work Thank you! thankful 25 / 31
  • 81. References Control Techniques FOR Complex Networks Sean Meyn Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419 Markov Chains and Stochastic Stability S. P. Meyn and R. L. Tweedie August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009 π(f)<∞ ∆V (x) ≤ −f(x) + bIC(x) Pn (x, · ) − π f → 0 sup C Ex[SτC(f)]<∞ References 26 / 31
  • 82. References This lecture A. M. Devraj and S. P. Meyn, Zap Q-learning. Advances in Neural Information Processing Systems (NIPS). Dec. 2017. A. M. Devraj and S. P. Meyn, Fastest convergence for Q-learning. Available on ArXiv. Jul. 2017. 27 / 31
  • 83. References Selected References I [1] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017. [2] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson. [3] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008. [4] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. [5] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. [6] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. See last chapter on simulation and average-cost TD learning 28 / 31
  • 84. References Selected References II [7] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985. [8] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. [9] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990. [10] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992. [11] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004. [12] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011. 29 / 31
  • 85. References Selected References III [13] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010. [14] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989. [15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. [16] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997. [18] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997. [19] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011. [20] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1–25, 2003. 30 / 31
  • 86. References Selected References IV [21] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, 2011. [22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999. [23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 16(2):207–239, 2006. [24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Mach. Learn., 22(1-3):33–57, 1996. [25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2-3):233–246, 2002. [26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003. [27] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE Conference on Decision and Control, pages 3598–3605, Dec. 2009. 31 / 31