SlideShare une entreprise Scribd logo
1  sur  53
Télécharger pour lire hors ligne
Zap Q-Learning
Fastest Convergent Q-Learning
Advances in Reinforcement Learning Algorithms
Sean Meyn
Department of Electrical and Computer Engineering — University of Florida
Based on joint research with Adithya M. Devraj Ana Buˇsi´c
Thanks to to the National Science Foundation
Zap Q-Learning
Essential References
Simons tutorial, March 2018
Part I (Basics, with focus on variance of algorithms)
https://www.youtube.com/watch?v=dhEF5pfYmvc
Part II (Zap Q-learning)
https://www.youtube.com/watch?v=Y3w8f1xIb6s
A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
(full tutorial on stochastic approximation)
A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing
Systems, pages 2235–2244, 2017.
A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. Proceedings of
the fifth Indian Control Conference, 9-11 January 2019.
1 / 18
Zap Q-Learning
Outline
1 Stochastic Approximation
2 Remedies for S L O W Convergence
3 Reinforcement Learning with Momentum
4 Examples
5 Conclusions & Future Work
6 References
E[f(θ,W)]
θ=θ∗
= 0
Stochastic Approximation
Stochastic Approximation Problem, Algorithm, and Issues
Stochastic Approximation
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
1 / 18
Stochastic Approximation Problem, Algorithm, and Issues
Stochastic Approximation
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Linear example illustrates challenges:
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
1 / 18
Stochastic Approximation Problem, Algorithm, and Issues
Stochastic Approximation
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Linear example illustrates challenges:
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1),
θn+1 = θn + αn+1(An+1θn − bn+1)
1 / 18
Stochastic Approximation Problem, Algorithm, and Issues
Stochastic Approximation
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Linear example illustrates challenges:
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1),
θn+1 = θn + αn+1(An+1θn − bn+1)
CLT Variance:
√
n(θn −θ∗) → N(0, Σθ)
1 / 18
Stochastic Approximation Problem, Algorithm, and Issues
Stochastic Approximation
A simple goal: Find the solution θ∗ ∈ Rd to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Linear example illustrates challenges:
E[f(θ, W)]
θ=θ∗
= Aθ∗
− b := E[A(W)]θ∗
− E[b(W)] = 0
Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1),
θn+1 = θn + αn+1(An+1θn − bn+1)
CLT Variance:
√
n(θn −θ∗) → N(0, Σθ)
Typically, Σθ = ∞
[Devraj & M, 2017] -5000 0 5000 10000
0
2
4
6
8
10
θn(15)
θ∗
(15)≈500
Histogram of
Time horizon n = one million
1 / 18
θ(k)
k
Remedies for Slow Convergence
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n
2 / 18
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
2 / 18
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A
2 / 18
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A – estimate (Monte-Carlo): An+1 ≈ A
Stochastic Newton-Raphson :
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation (TD-learning!):
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A – estimate (Monte-Carlo): An+1 ≈ A
Stochastic Newton-Raphson (Least Squares TD-learning!):
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18
Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that ¯f(θ∗) = 0 fn+1 no longer linear
Stochastic Approximation (Q-learning!):
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A(θ∗
)−1
fn+1(θn)
Don’t know A(θ∗) – estimate (2-time-scale): An+1 ≈ A(θn) = ∂ ¯f(θn)
Zap Stochastic Newton-Raphson (Zap Q-learning!):
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18
Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Watkins’ algorithm: variance is infinite for β > 1
2 [Devraj and M., 2017]
3 / 18
Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Zap Q-learning has optimal variance:
3 / 18
Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Zap Q-learning has optimal variance:
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
3 / 18
Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Zap Q-learning has optimal variance:
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
ODE for Zap-Q (for Watkins’ setting)
qt = Q∗
(ςt),
d
dt
ςt = −ςt + c
⇒ convergence, optimal covariance, ...
3 / 18
Zap Q-Learning
0 1 2 3 4 5 6 7 8 9 10
Iterations 105
0
20
40
60
80
100BellmanError
Reinforcement Learning
with Momentum
Reinforcement Learning with Momentum Matrix Momentum
Momentum based Stochastic Approximation
∆θ(n) := θ(n) − θ(n − 1)
Matrix Gain Stochastic Approximation:
∆θ(n + 1) = αnGn+1fn+1(θ(n))
4 / 18
Reinforcement Learning with Momentum Matrix Momentum
Momentum based Stochastic Approximation
∆θ(n) := θ(n) − θ(n − 1)
Matrix Gain Stochastic Approximation:
∆θ(n + 1) = αnGn+1fn+1(θ(n))
Heavy Ball Stochastic Approximation:
∆θ(n + 1) = m∆θ(n) + αnfn+1(θ(n))
4 / 18
Reinforcement Learning with Momentum Matrix Momentum
Momentum based Stochastic Approximation
∆θ(n) := θ(n) − θ(n − 1)
Matrix Gain Stochastic Approximation:
∆θ(n + 1) = αnGn+1fn+1(θ(n))
Heavy Ball Stochastic Approximation:
∆θ(n + 1) = m∆θ(n) + αnfn+1(θ(n))
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
4 / 18
Reinforcement Learning with Momentum Matrix Momentum
Matrix Heavy Ball Stochastic Approximation
Optimizing {Mn+1} and {Gn+1}
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
Heuristic: Assume ∆θn → 0 much faster than θn → θ∗
5 / 18
Reinforcement Learning with Momentum Matrix Momentum
Matrix Heavy Ball Stochastic Approximation
Optimizing {Mn+1} and {Gn+1}
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
Heuristic: Assume ∆θn → 0 much faster than θn → θ∗
∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n))
5 / 18
Reinforcement Learning with Momentum Matrix Momentum
Matrix Heavy Ball Stochastic Approximation
Optimizing {Mn+1} and {Gn+1}
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
Heuristic: Assume ∆θn → 0 much faster than θn → θ∗
∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n))
≈ [I − Mn+1]−1
αnGn+1fn+1(θ(n))
5 / 18
Reinforcement Learning with Momentum Matrix Momentum
Matrix Heavy Ball Stochastic Approximation
Optimizing {Mn+1} and {Gn+1}
Matirx Heavy Ball Stochastic Approximation:
∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n))
Heuristic: Assume ∆θn → 0 much faster than θn → θ∗
∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n))
≈ [I − Mn+1]−1
αnGn+1fn+1(θ(n))
= −αnA−1
n+1fn+1(θ(n)) Zap!
Mn+1 = (I + ζAn+1) and Gn+1 = ζI with ζ > 0
5 / 18
Reinforcement Learning with Momentum Coupling
Momentum based Stochastic Approximation
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
6 / 18
Reinforcement Learning with Momentum Coupling
Momentum based Stochastic Approximation
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
(linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
6 / 18
Reinforcement Learning with Momentum Coupling
Momentum based Stochastic Approximation
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
(linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
Coupling of SNR and PolSA: θSNR
n − θPolSA
n = O(n−1)
6 / 18
Reinforcement Learning with Momentum Coupling
Momentum based Stochastic Approximation
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
(linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
Coupling of SNR and PolSA: θSNR
n − θPolSA
n = O(n−1)
Linear model: fn+1(θ) = A(θ − θ∗) + ∆n+1
{∆n+1} square-integrable martingale difference sequence
A Hurwitz and eig(I + ζA) ∈ open unit disk
6 / 18
Reinforcement Learning with Momentum Coupling
Momentum based Stochastic Approximation
An+1 ≈ d
dθ
¯f (θn)
SNR: ∆θ(n + 1) = −αnA−1
n+1fn+1(θ(n))
PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
(linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n))
Coupling of SNR and PolSA: θSNR
n − θPolSA
n = O(n−1)
Linear model: fn+1(θ) = A(θ − θ∗) + ∆n+1
{∆n+1} square-integrable martingale difference sequence
A Hurwitz and eig(I + ζA) ∈ open unit disk
PolSA has optimal asymptotic variance
6 / 18
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Examples
Examples Much like Newton-Raphson
Zap Q-Learning: Fastest Convergent Q-Learning
Classical Q-Learning
Speedy Q-Learning
R-P Averaging
Polynomial Learning Rate
Q-Learning: Optimal Scalar Gain
Zap O(d)
Zap Q-Learning
0 1 2 3 4 5 6 7 8 9 10
n : number of iterations 105
0
20
40
60
80
100
BellmanError
7 / 18
Examples Much like Newton-Raphson
Zap Q-Learning: Fastest Convergent Q-Learning
Classical Q-Learning
Speedy Q-Learning
R-P Averaging
Polynomial Learning Rate
Q-Learning: Optimal Scalar Gain
Zap O(d)
Zap Q-Learning PolSA
NeSA
0 1 2 3 4 5 6 7 8 9 10
n : number of iterations 105
0
20
40
60
80
100
BellmanError
7 / 18
Examples Much like Newton-Raphson
Zap Q-Learning: Fastest Convergent Q-Learning
Convergence for larger models:
103
104
105
106
107
100
10
1
102
103
104
10
5
103
104
105
106
107
MaxBelmanError
nn
10
-1
100
10
1
10
2
10
3
10
4
MaxBelmanError
Online Clock Sampling
Watkins
Watkins:
Zap
PolSA
NeSA
1/(1−β)
d=19d=117
Coupling is amazing
8 / 18
Examples Much like Newton-Raphson
Zap Q-Learning: Fastest Convergent Q-Learning
Convergence for larger models:
103
104
105
106
107
100
10
1
102
103
104
10
5
103
104
105
106
107
MaxBelmanError
nn
10
-1
100
10
1
10
2
10
3
10
4
MaxBelmanError
Online Clock Sampling
Watkins
Watkins:
Zap
PolSA
NeSA
1/(1−β)
d=19d=117
Avoid randomized policies if possible
8 / 18
Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
9 / 18
Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Authors observed slow convergence
Proposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
{Gn}
9 / 18
Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
-0.525-30 -25 -20 -15 -10 -5
-10
-5
0
5
10
Re (λ(GA))
Co(λ(GA))
λi(GA)Real λi(A)
Eigenvalues of A and GA for the finance example
Favorite choice of gain in [22] barely meets the criterion Re(λ(GA)) < −1
2
9 / 18
Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100.
Parameterized Q-function: Qθ with θ ∈ R10
Histograms of the average reward obtained using the different algorithms:
1 1.05 1.1 1.15 1.2 1.25
0
20
40
60
80
100
1 1.05 1.1 1.15 1.2 1.25
0
100
200
300
400
500
600
1 1.05 1.1 1.15 1.2 1.25
0
5
10
15
20
25
30
35 G-Q(0)
G-Q(0)
Zap-Q
Zap-Q ρ = 0.8
ρ = 1.0
(gain doubled)
Zap-Q ρ = 0.85
n = 2 × 104
n = 2 × 105
n = 2 × 106
Zap-Q G-Q
10 / 18
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
11 / 18
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: optimal g∗
was chosen based on asymptotic covariance
11 / 18
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: optimal g∗
was chosen based on asymptotic covariance
PolSA and NeSA based RL provide efficient alternatives to the
expensive stochastic Newton-Raphson
PolSA : same asymptotic variance as Zap
without matrix inversion
11 / 18
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: optimal g∗
was chosen based on asymptotic covariance
PolSA and NeSA based RL provide efficient alternatives to the
expensive stochastic Newton-Raphson
PolSA : same asymptotic variance as Zap
without matrix inversion
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Adaptive optimization of algorithm parameters: g∗, ζ∗, G∗, M∗
11 / 18
Conclusions & Future Work
Thank you!
thankful
12 / 18
References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f)<∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − π f → 0
sup
C
Ex[SτC(f)]<∞
References
13 / 18
References
This lecture
A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing
Systems, pages 2235–2244, 2017.
A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. In Proceedings of
the fifth Indian Control Conference, 9-11 January 2019.
A. M. Devraj, A. Buˇsi´c, and S. Meyn. (stay tuned)
14 / 18
References
Selected References I
[1] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic
approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,
Berlin, 1990. Translated from the French by Stephen S. Wilson.
[2] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan
Book Agency and Cambridge University Press (jointly), 2008.
[3] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic
approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[4] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
Mathematical Library, second edition, 2009.
[5] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
[6] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.
The Annals of Statistics, 13(1):236–245, 1985.
[7] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes.
Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research
and Industrial Engineering, Ithaca, NY, 1988.
15 / 18
References
Selected References II
[8] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i
telemekhanika. Translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990.
[9] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Control Optim., 30(4):838–855, 1992.
[10] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic
approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[11] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems
24, pages 451–459. Curran Associates, Inc., 2011.
[12] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
[13] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, Cambridge, UK, 1989.
[14] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
16 / 18
References
Selected References III
[15] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[16] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[17] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.
[18] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[19] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
[20] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
[21] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
17 / 18
References
Selected References IV
[22] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
[23] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[24] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[25] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[26] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE
Conference on Decision and Control, pages 3598–3605, Dec. 2009.
18 / 18

Contenu connexe

Tendances

QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
The Statistical and Applied Mathematical Sciences Institute
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methods
Stephane Senecal
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
Ryan White
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
Ryan White
 

Tendances (20)

Linear Bayesian update surrogate for updating PCE coefficients
Linear Bayesian update surrogate for updating PCE coefficientsLinear Bayesian update surrogate for updating PCE coefficients
Linear Bayesian update surrogate for updating PCE coefficients
 
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
QMC: Transition Workshop - Probabilistic Integrators for Deterministic Differ...
 
Signal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems RegularizationSignal Processing Course : Inverse Problems Regularization
Signal Processing Course : Inverse Problems Regularization
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
sada_pres
sada_pressada_pres
sada_pres
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
6. graphs of trig functions x
6. graphs of trig functions x6. graphs of trig functions x
6. graphs of trig functions x
 
Sampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methodsSampling strategies for Sequential Monte Carlo (SMC) methods
Sampling strategies for Sequential Monte Carlo (SMC) methods
 
Optimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-peripheryOptimal L-shaped matrix reordering, aka graph's core-periphery
Optimal L-shaped matrix reordering, aka graph's core-periphery
 
Engineering formula sheet
Engineering formula sheetEngineering formula sheet
Engineering formula sheet
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
 
On the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract meansOn the Jensen-Shannon symmetrization of distances relying on abstract means
On the Jensen-Shannon symmetrization of distances relying on abstract means
 
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
 
The dual geometry of Shannon information
The dual geometry of Shannon informationThe dual geometry of Shannon information
The dual geometry of Shannon information
 
CLIM: Transition Workshop - Projected Data Assimilation - Erik Van Vleck, Ma...
CLIM: Transition Workshop - Projected Data Assimilation  - Erik Van Vleck, Ma...CLIM: Transition Workshop - Projected Data Assimilation  - Erik Van Vleck, Ma...
CLIM: Transition Workshop - Projected Data Assimilation - Erik Van Vleck, Ma...
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
 
Sparse-Bayesian Approach to Inverse Problems with Partial Differential Equati...
Sparse-Bayesian Approach to Inverse Problems with Partial Differential Equati...Sparse-Bayesian Approach to Inverse Problems with Partial Differential Equati...
Sparse-Bayesian Approach to Inverse Problems with Partial Differential Equati...
 
Non-sampling functional approximation of linear and non-linear Bayesian Update
Non-sampling functional approximation of linear and non-linear Bayesian UpdateNon-sampling functional approximation of linear and non-linear Bayesian Update
Non-sampling functional approximation of linear and non-linear Bayesian Update
 
22 01 2014_03_23_31_eee_formula_sheet_final
22 01 2014_03_23_31_eee_formula_sheet_final22 01 2014_03_23_31_eee_formula_sheet_final
22 01 2014_03_23_31_eee_formula_sheet_final
 

Similaire à Zap Q-Learning - ISMP 2018

Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Sean Meyn
 

Similaire à Zap Q-Learning - ISMP 2018 (20)

DeepLearn2022 2. Variance Matters
DeepLearn2022  2. Variance MattersDeepLearn2022  2. Variance Matters
DeepLearn2022 2. Variance Matters
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel TricksLinear Machine Learning Models with L2 Regularization and Kernel Tricks
Linear Machine Learning Models with L2 Regularization and Kernel Tricks
 
stochastic processes assignment help
stochastic processes assignment helpstochastic processes assignment help
stochastic processes assignment help
 
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
 
Asymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de FranceAsymptotics of ABC, lecture, Collège de France
Asymptotics of ABC, lecture, Collège de France
 
Monte Carlo Statistical Methods
Monte Carlo Statistical MethodsMonte Carlo Statistical Methods
Monte Carlo Statistical Methods
 
Mathematics and AI
Mathematics and AIMathematics and AI
Mathematics and AI
 
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfDeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
 
QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...
QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...
QMC: Operator Splitting Workshop, Stochastic Block-Coordinate Fixed Point Alg...
 
lec-10-perceptron-upload.pdf
lec-10-perceptron-upload.pdflec-10-perceptron-upload.pdf
lec-10-perceptron-upload.pdf
 
Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...Research internship on optimal stochastic theory with financial application u...
Research internship on optimal stochastic theory with financial application u...
 
Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...Presentation on stochastic control problem with financial applications (Merto...
Presentation on stochastic control problem with financial applications (Merto...
 
10 - 29 Jan - Recursion Part 2
10 - 29 Jan - Recursion Part 210 - 29 Jan - Recursion Part 2
10 - 29 Jan - Recursion Part 2
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Truth, deduction, computation lecture g
Truth, deduction, computation   lecture gTruth, deduction, computation   lecture g
Truth, deduction, computation lecture g
 
stat-phys-appis-reduced.pdf
stat-phys-appis-reduced.pdfstat-phys-appis-reduced.pdf
stat-phys-appis-reduced.pdf
 
100 things I know
100 things I know100 things I know
100 things I know
 
07 periodic functions and fourier series
07 periodic functions and fourier series07 periodic functions and fourier series
07 periodic functions and fourier series
 

Plus de Sean Meyn

State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
Sean Meyn
 
Distributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridDistributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power Grid
Sean Meyn
 
2012 Tutorial: Markets for Differentiated Electric Power Products
2012 Tutorial:  Markets for Differentiated Electric Power Products2012 Tutorial:  Markets for Differentiated Electric Power Products
2012 Tutorial: Markets for Differentiated Electric Power Products
Sean Meyn
 
The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010
Sean Meyn
 

Plus de Sean Meyn (20)

DeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q LearningDeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q Learning
 
Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019
 
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
 
Irrational Agents and the Power Grid
Irrational Agents and the Power GridIrrational Agents and the Power Grid
Irrational Agents and the Power Grid
 
State estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchState estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatch
 
Demand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesDemand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary Services
 
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
 
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
 
Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?
 
Distributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridDistributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power Grid
 
Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...
 
2012 Tutorial: Markets for Differentiated Electric Power Products
2012 Tutorial:  Markets for Differentiated Electric Power Products2012 Tutorial:  Markets for Differentiated Electric Power Products
2012 Tutorial: Markets for Differentiated Electric Power Products
 
Control Techniques for Complex Systems
Control Techniques for Complex SystemsControl Techniques for Complex Systems
Control Techniques for Complex Systems
 
Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010
 
Panel Lecture for Energy Systems Week
Panel Lecture for Energy Systems WeekPanel Lecture for Energy Systems Week
Panel Lecture for Energy Systems Week
 
The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010
 
Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...
 
Anomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov ModelsAnomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov Models
 
Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009Markov Tutorial CDC Shanghai 2009
Markov Tutorial CDC Shanghai 2009
 
Q-Learning and Pontryagin's Minimum Principle
Q-Learning and Pontryagin's Minimum PrincipleQ-Learning and Pontryagin's Minimum Principle
Q-Learning and Pontryagin's Minimum Principle
 

Dernier

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
RizalinePalanog2
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
PirithiRaju
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Sérgio Sacani
 

Dernier (20)

Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Clean In Place(CIP).pptx .
Clean In Place(CIP).pptx                 .Clean In Place(CIP).pptx                 .
Clean In Place(CIP).pptx .
 
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Alandi Call Me 7737669865 Budget Friendly No Advance Booking
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptxSCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
SCIENCE-4-QUARTER4-WEEK-4-PPT-1 (1).pptx
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
High Profile 🔝 8250077686 📞 Call Girls Service in GTB Nagar🍑
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
module for grade 9 for distance learning
module for grade 9 for distance learningmodule for grade 9 for distance learning
module for grade 9 for distance learning
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 

Zap Q-Learning - ISMP 2018

  • 1. Zap Q-Learning Fastest Convergent Q-Learning Advances in Reinforcement Learning Algorithms Sean Meyn Department of Electrical and Computer Engineering — University of Florida Based on joint research with Adithya M. Devraj Ana Buˇsi´c Thanks to to the National Science Foundation
  • 2. Zap Q-Learning Essential References Simons tutorial, March 2018 Part I (Basics, with focus on variance of algorithms) https://www.youtube.com/watch?v=dhEF5pfYmvc Part II (Zap Q-learning) https://www.youtube.com/watch?v=Y3w8f1xIb6s A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017. (full tutorial on stochastic approximation) A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing Systems, pages 2235–2244, 2017. A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. Proceedings of the fifth Indian Control Conference, 9-11 January 2019. 1 / 18
  • 3. Zap Q-Learning Outline 1 Stochastic Approximation 2 Remedies for S L O W Convergence 3 Reinforcement Learning with Momentum 4 Examples 5 Conclusions & Future Work 6 References
  • 5. Stochastic Approximation Problem, Algorithm, and Issues Stochastic Approximation A simple goal: Find the solution θ∗ ∈ Rd to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 1 / 18
  • 6. Stochastic Approximation Problem, Algorithm, and Issues Stochastic Approximation A simple goal: Find the solution θ∗ ∈ Rd to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Linear example illustrates challenges: E[f(θ, W)] θ=θ∗ = Aθ∗ − b := E[A(W)]θ∗ − E[b(W)] = 0 1 / 18
  • 7. Stochastic Approximation Problem, Algorithm, and Issues Stochastic Approximation A simple goal: Find the solution θ∗ ∈ Rd to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Linear example illustrates challenges: E[f(θ, W)] θ=θ∗ = Aθ∗ − b := E[A(W)]θ∗ − E[b(W)] = 0 Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1), θn+1 = θn + αn+1(An+1θn − bn+1) 1 / 18
  • 8. Stochastic Approximation Problem, Algorithm, and Issues Stochastic Approximation A simple goal: Find the solution θ∗ ∈ Rd to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Linear example illustrates challenges: E[f(θ, W)] θ=θ∗ = Aθ∗ − b := E[A(W)]θ∗ − E[b(W)] = 0 Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1), θn+1 = θn + αn+1(An+1θn − bn+1) CLT Variance: √ n(θn −θ∗) → N(0, Σθ) 1 / 18
  • 9. Stochastic Approximation Problem, Algorithm, and Issues Stochastic Approximation A simple goal: Find the solution θ∗ ∈ Rd to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Linear example illustrates challenges: E[f(θ, W)] θ=θ∗ = Aθ∗ − b := E[A(W)]θ∗ − E[b(W)] = 0 Algorithm: Observe An+1 := A(Wn+1), bn+1 := b(Wn+1), θn+1 = θn + αn+1(An+1θn − bn+1) CLT Variance: √ n(θn −θ∗) → N(0, Σθ) Typically, Σθ = ∞ [Devraj & M, 2017] -5000 0 5000 10000 0 2 4 6 8 10 θn(15) θ∗ (15)≈500 Histogram of Time horizon n = one million 1 / 18
  • 11. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1 Stochastic Approximation : θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n 2 / 18
  • 12. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1 Stochastic Approximation : θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n Optimal Asymptotic Variance: θn+1 = θn − αn+1A−1 fn+1(θn) 2 / 18
  • 13. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1 Stochastic Approximation : θn+1 = θn + αn+1fn+1(θn) Optimal Asymptotic Variance: θn+1 = θn − αn+1A−1 fn+1(θn) Don’t know A 2 / 18
  • 14. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1 Stochastic Approximation : θn+1 = θn + αn+1fn+1(θn) Optimal Asymptotic Variance: θn+1 = θn − αn+1A−1 fn+1(θn) Don’t know A – estimate (Monte-Carlo): An+1 ≈ A Stochastic Newton-Raphson : θn+1 = θn − αn+1A−1 n+1fn+1(θn) 2 / 18
  • 15. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1 Stochastic Approximation (TD-learning!): θn+1 = θn + αn+1fn+1(θn) Optimal Asymptotic Variance: θn+1 = θn − αn+1A−1 fn+1(θn) Don’t know A – estimate (Monte-Carlo): An+1 ≈ A Stochastic Newton-Raphson (Least Squares TD-learning!): θn+1 = θn − αn+1A−1 n+1fn+1(θn) 2 / 18
  • 16. Remedies for S L O W Convergence Stochastic Newton Raphson Fixing (Optimizing) the Variance Goal: Find θ∗ such that ¯f(θ∗) = 0 fn+1 no longer linear Stochastic Approximation (Q-learning!): θn+1 = θn + αn+1fn+1(θn) Optimal Asymptotic Variance: θn+1 = θn − αn+1A(θ∗ )−1 fn+1(θn) Don’t know A(θ∗) – estimate (2-time-scale): An+1 ≈ A(θn) = ∂ ¯f(θn) Zap Stochastic Newton-Raphson (Zap Q-learning!): θn+1 = θn − αn+1A−1 n+1fn+1(θn) 2 / 18
  • 17. Remedies for S L O W Convergence Convergence for Zapped Watkins Zap Q-learning Discounted-cost optimal control problem: Q∗ = Q∗ (c) Watkins’ algorithm: variance is infinite for β > 1 2 [Devraj and M., 2017] 3 / 18
  • 18. Remedies for S L O W Convergence Convergence for Zapped Watkins Zap Q-learning Discounted-cost optimal control problem: Q∗ = Q∗ (c) Zap Q-learning has optimal variance: 3 / 18
  • 19. Remedies for S L O W Convergence Convergence for Zapped Watkins Zap Q-learning Discounted-cost optimal control problem: Q∗ = Q∗ (c) Zap Q-learning has optimal variance: ODE Analysis: change of variables q = Q∗(ς) Functional Q∗ maps cost functions to Q-functions: q(x, u) = ς(x, u) + β x Pu(x, x ) min u q(x , u ) 3 / 18
  • 20. Remedies for S L O W Convergence Convergence for Zapped Watkins Zap Q-learning Discounted-cost optimal control problem: Q∗ = Q∗ (c) Zap Q-learning has optimal variance: ODE Analysis: change of variables q = Q∗(ς) Functional Q∗ maps cost functions to Q-functions: q(x, u) = ς(x, u) + β x Pu(x, x ) min u q(x , u ) ODE for Zap-Q (for Watkins’ setting) qt = Q∗ (ςt), d dt ςt = −ςt + c ⇒ convergence, optimal covariance, ... 3 / 18
  • 21. Zap Q-Learning 0 1 2 3 4 5 6 7 8 9 10 Iterations 105 0 20 40 60 80 100BellmanError Reinforcement Learning with Momentum
  • 22. Reinforcement Learning with Momentum Matrix Momentum Momentum based Stochastic Approximation ∆θ(n) := θ(n) − θ(n − 1) Matrix Gain Stochastic Approximation: ∆θ(n + 1) = αnGn+1fn+1(θ(n)) 4 / 18
  • 23. Reinforcement Learning with Momentum Matrix Momentum Momentum based Stochastic Approximation ∆θ(n) := θ(n) − θ(n − 1) Matrix Gain Stochastic Approximation: ∆θ(n + 1) = αnGn+1fn+1(θ(n)) Heavy Ball Stochastic Approximation: ∆θ(n + 1) = m∆θ(n) + αnfn+1(θ(n)) 4 / 18
  • 24. Reinforcement Learning with Momentum Matrix Momentum Momentum based Stochastic Approximation ∆θ(n) := θ(n) − θ(n − 1) Matrix Gain Stochastic Approximation: ∆θ(n + 1) = αnGn+1fn+1(θ(n)) Heavy Ball Stochastic Approximation: ∆θ(n + 1) = m∆θ(n) + αnfn+1(θ(n)) Matirx Heavy Ball Stochastic Approximation: ∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n)) 4 / 18
  • 25. Reinforcement Learning with Momentum Matrix Momentum Matrix Heavy Ball Stochastic Approximation Optimizing {Mn+1} and {Gn+1} Matirx Heavy Ball Stochastic Approximation: ∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n)) Heuristic: Assume ∆θn → 0 much faster than θn → θ∗ 5 / 18
  • 26. Reinforcement Learning with Momentum Matrix Momentum Matrix Heavy Ball Stochastic Approximation Optimizing {Mn+1} and {Gn+1} Matirx Heavy Ball Stochastic Approximation: ∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n)) Heuristic: Assume ∆θn → 0 much faster than θn → θ∗ ∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n)) 5 / 18
  • 27. Reinforcement Learning with Momentum Matrix Momentum Matrix Heavy Ball Stochastic Approximation Optimizing {Mn+1} and {Gn+1} Matirx Heavy Ball Stochastic Approximation: ∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n)) Heuristic: Assume ∆θn → 0 much faster than θn → θ∗ ∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n)) ≈ [I − Mn+1]−1 αnGn+1fn+1(θ(n)) 5 / 18
  • 28. Reinforcement Learning with Momentum Matrix Momentum Matrix Heavy Ball Stochastic Approximation Optimizing {Mn+1} and {Gn+1} Matirx Heavy Ball Stochastic Approximation: ∆θ(n + 1) = Mn+1∆θ(n) + αnGn+1fn+1(θ(n)) Heuristic: Assume ∆θn → 0 much faster than θn → θ∗ ∆θ(n + 1) ≈ Mn+1∆θ(n + 1) + αnGn+1fn+1(θ(n)) ≈ [I − Mn+1]−1 αnGn+1fn+1(θ(n)) = −αnA−1 n+1fn+1(θ(n)) Zap! Mn+1 = (I + ζAn+1) and Gn+1 = ζI with ζ > 0 5 / 18
  • 29. Reinforcement Learning with Momentum Coupling Momentum based Stochastic Approximation An+1 ≈ d dθ ¯f (θn) SNR: ∆θ(n + 1) = −αnA−1 n+1fn+1(θ(n)) 6 / 18
  • 30. Reinforcement Learning with Momentum Coupling Momentum based Stochastic Approximation An+1 ≈ d dθ ¯f (θn) SNR: ∆θ(n + 1) = −αnA−1 n+1fn+1(θ(n)) PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) (linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) 6 / 18
  • 31. Reinforcement Learning with Momentum Coupling Momentum based Stochastic Approximation An+1 ≈ d dθ ¯f (θn) SNR: ∆θ(n + 1) = −αnA−1 n+1fn+1(θ(n)) PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) (linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) Coupling of SNR and PolSA: θSNR n − θPolSA n = O(n−1) 6 / 18
  • 32. Reinforcement Learning with Momentum Coupling Momentum based Stochastic Approximation An+1 ≈ d dθ ¯f (θn) SNR: ∆θ(n + 1) = −αnA−1 n+1fn+1(θ(n)) PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) (linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) Coupling of SNR and PolSA: θSNR n − θPolSA n = O(n−1) Linear model: fn+1(θ) = A(θ − θ∗) + ∆n+1 {∆n+1} square-integrable martingale difference sequence A Hurwitz and eig(I + ζA) ∈ open unit disk 6 / 18
  • 33. Reinforcement Learning with Momentum Coupling Momentum based Stochastic Approximation An+1 ≈ d dθ ¯f (θn) SNR: ∆θ(n + 1) = −αnA−1 n+1fn+1(θ(n)) PolSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) (linearized) NeSA: ∆θ(n + 1) = [I + ζAn+1]∆θ(n) + αnζfn+1(θ(n)) Coupling of SNR and PolSA: θSNR n − θPolSA n = O(n−1) Linear model: fn+1(θ) = A(θ − θ∗) + ∆n+1 {∆n+1} square-integrable martingale difference sequence A Hurwitz and eig(I + ζA) ∈ open unit disk PolSA has optimal asymptotic variance 6 / 18
  • 34. 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Examples
  • 35. Examples Much like Newton-Raphson Zap Q-Learning: Fastest Convergent Q-Learning Classical Q-Learning Speedy Q-Learning R-P Averaging Polynomial Learning Rate Q-Learning: Optimal Scalar Gain Zap O(d) Zap Q-Learning 0 1 2 3 4 5 6 7 8 9 10 n : number of iterations 105 0 20 40 60 80 100 BellmanError 7 / 18
  • 36. Examples Much like Newton-Raphson Zap Q-Learning: Fastest Convergent Q-Learning Classical Q-Learning Speedy Q-Learning R-P Averaging Polynomial Learning Rate Q-Learning: Optimal Scalar Gain Zap O(d) Zap Q-Learning PolSA NeSA 0 1 2 3 4 5 6 7 8 9 10 n : number of iterations 105 0 20 40 60 80 100 BellmanError 7 / 18
  • 37. Examples Much like Newton-Raphson Zap Q-Learning: Fastest Convergent Q-Learning Convergence for larger models: 103 104 105 106 107 100 10 1 102 103 104 10 5 103 104 105 106 107 MaxBelmanError nn 10 -1 100 10 1 10 2 10 3 10 4 MaxBelmanError Online Clock Sampling Watkins Watkins: Zap PolSA NeSA 1/(1−β) d=19d=117 Coupling is amazing 8 / 18
  • 38. Examples Much like Newton-Raphson Zap Q-Learning: Fastest Convergent Q-Learning Convergence for larger models: 103 104 105 106 107 100 10 1 102 103 104 10 5 103 104 105 106 107 MaxBelmanError nn 10 -1 100 10 1 10 2 10 3 10 4 MaxBelmanError Online Clock Sampling Watkins Watkins: Zap PolSA NeSA 1/(1−β) d=19d=117 Avoid randomized policies if possible 8 / 18
  • 39. Examples Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10-3 -10-4 -10-5 -10 -6 Real for every eigenvalue λ Asymptotic covariance is infinite λ > − 1 2 Real λi(A) 9 / 18
  • 40. Examples Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10 -3 -10-4 -10-5 -10 -6 Real for every eigenvalue λ Authors observed slow convergence Proposed a matrix gain sequence (see refs for details) Asymptotic covariance is infinite λ > − 1 2 Real λi(A) {Gn} 9 / 18
  • 41. Examples Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10 -3 -10-4 -10-5 -10 -6 -0.525-30 -25 -20 -15 -10 -5 -10 -5 0 5 10 Re (λ(GA)) Co(λ(GA)) λi(GA)Real λi(A) Eigenvalues of A and GA for the finance example Favorite choice of gain in [22] barely meets the criterion Re(λ(GA)) < −1 2 9 / 18
  • 42. Examples Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100. Parameterized Q-function: Qθ with θ ∈ R10 Histograms of the average reward obtained using the different algorithms: 1 1.05 1.1 1.15 1.2 1.25 0 20 40 60 80 100 1 1.05 1.1 1.15 1.2 1.25 0 100 200 300 400 500 600 1 1.05 1.1 1.15 1.2 1.25 0 5 10 15 20 25 30 35 G-Q(0) G-Q(0) Zap-Q Zap-Q ρ = 0.8 ρ = 1.0 (gain doubled) Zap-Q ρ = 0.85 n = 2 × 104 n = 2 × 105 n = 2 × 106 Zap-Q G-Q 10 / 18
  • 43. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance 11 / 18
  • 44. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: optimal g∗ was chosen based on asymptotic covariance 11 / 18
  • 45. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: optimal g∗ was chosen based on asymptotic covariance PolSA and NeSA based RL provide efficient alternatives to the expensive stochastic Newton-Raphson PolSA : same asymptotic variance as Zap without matrix inversion 11 / 18
  • 46. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: optimal g∗ was chosen based on asymptotic covariance PolSA and NeSA based RL provide efficient alternatives to the expensive stochastic Newton-Raphson PolSA : same asymptotic variance as Zap without matrix inversion Future work: Q-learning with function-approximation Obtain conditions for a stable algorithm in a general setting Adaptive optimization of algorithm parameters: g∗, ζ∗, G∗, M∗ 11 / 18
  • 47. Conclusions & Future Work Thank you! thankful 12 / 18
  • 48. References Control Techniques FOR Complex Networks Sean Meyn Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419 Markov Chains and Stochastic Stability S. P. Meyn and R. L. Tweedie August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009 π(f)<∞ ∆V (x) ≤ −f(x) + bIC(x) Pn (x, · ) − π f → 0 sup C Ex[SτC(f)]<∞ References 13 / 18
  • 49. References This lecture A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017. A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing Systems, pages 2235–2244, 2017. A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. In Proceedings of the fifth Indian Control Conference, 9-11 January 2019. A. M. Devraj, A. Buˇsi´c, and S. Meyn. (stay tuned) 14 / 18
  • 50. References Selected References I [1] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson. [2] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), 2008. [3] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. [4] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge Mathematical Library, second edition, 2009. [5] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. See last chapter on simulation and average-cost TD learning [6] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985. [7] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. 15 / 18
  • 51. References Selected References II [8] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika. Translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990. [9] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992. [10] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004. [11] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011. [12] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010. [13] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989. [14] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. 16 / 18
  • 52. References Selected References III [15] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988. [16] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997. [17] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997. [18] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011. [19] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1–25, 2003. [20] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, 2011. [21] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999. 17 / 18
  • 53. References Selected References IV [22] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 16(2):207–239, 2006. [23] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Mach. Learn., 22(1-3):33–57, 1996. [24] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2-3):233–246, 2002. [25] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003. [26] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE Conference on Decision and Control, pages 3598–3605, Dec. 2009. 18 / 18