1. Zap Q-Learning
Fastest Convergent Q-Learning
Advances in Reinforcement Learning Algorithms
Sean Meyn
Department of Electrical and Computer Engineering — University of Florida
Based on joint research with Adithya M. Devraj Ana Buˇsi´c
Thanks to to the National Science Foundation
2. Zap Q-Learning
Essential References
Simons tutorial, March 2018
Part I (Basics, with focus on variance of algorithms)
https://www.youtube.com/watch?v=dhEF5pfYmvc
Part II (Zap Q-learning)
https://www.youtube.com/watch?v=Y3w8f1xIb6s
A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
(full tutorial on stochastic approximation)
A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing
Systems, pages 2235–2244, 2017.
A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. Proceedings of
the fifth Indian Control Conference, 9-11 January 2019.
1 / 18
3. Zap Q-Learning
Outline
1 Stochastic Approximation
2 Remedies for S L O W Convergence
3 Reinforcement Learning with Momentum
4 Examples
5 Conclusions & Future Work
6 References
11. Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n
2 / 18
12. Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn) , we take αn = 1/n
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
2 / 18
13. Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A
2 / 18
14. Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation :
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A – estimate (Monte-Carlo): An+1 ≈ A
Stochastic Newton-Raphson :
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18
15. Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that Aθ∗ − b = 0 fn+1(θn) = An+1θn − bn+1
Stochastic Approximation (TD-learning!):
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A−1
fn+1(θn)
Don’t know A – estimate (Monte-Carlo): An+1 ≈ A
Stochastic Newton-Raphson (Least Squares TD-learning!):
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18
16. Remedies for S L O W Convergence Stochastic Newton Raphson
Fixing (Optimizing) the Variance
Goal: Find θ∗ such that ¯f(θ∗) = 0 fn+1 no longer linear
Stochastic Approximation (Q-learning!):
θn+1 = θn + αn+1fn+1(θn)
Optimal Asymptotic Variance:
θn+1 = θn − αn+1A(θ∗
)−1
fn+1(θn)
Don’t know A(θ∗) – estimate (2-time-scale): An+1 ≈ A(θn) = ∂ ¯f(θn)
Zap Stochastic Newton-Raphson (Zap Q-learning!):
θn+1 = θn − αn+1A−1
n+1fn+1(θn)
2 / 18
17. Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Watkins’ algorithm: variance is infinite for β > 1
2 [Devraj and M., 2017]
3 / 18
18. Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Zap Q-learning has optimal variance:
3 / 18
19. Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Zap Q-learning has optimal variance:
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
3 / 18
20. Remedies for S L O W Convergence Convergence for Zapped Watkins
Zap Q-learning
Discounted-cost optimal control problem: Q∗
= Q∗
(c)
Zap Q-learning has optimal variance:
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
ODE for Zap-Q (for Watkins’ setting)
qt = Q∗
(ςt),
d
dt
ςt = −ςt + c
⇒ convergence, optimal covariance, ...
3 / 18
39. Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
9 / 18
40. Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Authors observed slow convergence
Proposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
{Gn}
9 / 18
41. Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
-0.525-30 -25 -20 -15 -10 -5
-10
-5
0
5
10
Re (λ(GA))
Co(λ(GA))
λi(GA)Real λi(A)
Eigenvalues of A and GA for the finance example
Favorite choice of gain in [22] barely meets the criterion Re(λ(GA)) < −1
2
9 / 18
42. Examples Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100.
Parameterized Q-function: Qθ with θ ∈ R10
Histograms of the average reward obtained using the different algorithms:
1 1.05 1.1 1.15 1.2 1.25
0
20
40
60
80
100
1 1.05 1.1 1.15 1.2 1.25
0
100
200
300
400
500
600
1 1.05 1.1 1.15 1.2 1.25
0
5
10
15
20
25
30
35 G-Q(0)
G-Q(0)
Zap-Q
Zap-Q ρ = 0.8
ρ = 1.0
(gain doubled)
Zap-Q ρ = 0.85
n = 2 × 104
n = 2 × 105
n = 2 × 106
Zap-Q G-Q
10 / 18
43. Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
11 / 18
44. Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: optimal g∗
was chosen based on asymptotic covariance
11 / 18
45. Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: optimal g∗
was chosen based on asymptotic covariance
PolSA and NeSA based RL provide efficient alternatives to the
expensive stochastic Newton-Raphson
PolSA : same asymptotic variance as Zap
without matrix inversion
11 / 18
46. Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: optimal g∗
was chosen based on asymptotic covariance
PolSA and NeSA based RL provide efficient alternatives to the
expensive stochastic Newton-Raphson
PolSA : same asymptotic variance as Zap
without matrix inversion
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Adaptive optimization of algorithm parameters: g∗, ζ∗, G∗, M∗
11 / 18
48. References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f)<∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − π f → 0
sup
C
Ex[SτC(f)]<∞
References
13 / 18
49. References
This lecture
A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
A. M. Devraj and S. Meyn. Zap Q-Learning. In Advances in Neural Information Processing
Systems, pages 2235–2244, 2017.
A. M. Devraj, A. Buˇsi´c, and S. Meyn. Zap Q Learning — a user’s guide. In Proceedings of
the fifth Indian Control Conference, 9-11 January 2019.
A. M. Devraj, A. Buˇsi´c, and S. Meyn. (stay tuned)
14 / 18
50. References
Selected References I
[1] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic
approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,
Berlin, 1990. Translated from the French by Stephen S. Wilson.
[2] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan
Book Agency and Cambridge University Press (jointly), 2008.
[3] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic
approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[4] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
Mathematical Library, second edition, 2009.
[5] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
[6] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.
The Annals of Statistics, 13(1):236–245, 1985.
[7] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes.
Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research
and Industrial Engineering, Ithaca, NY, 1988.
15 / 18
51. References
Selected References II
[8] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i
telemekhanika. Translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990.
[9] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Control Optim., 30(4):838–855, 1992.
[10] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic
approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[11] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems
24, pages 451–459. Curran Associates, Inc., 2011.
[12] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
[13] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, Cambridge, UK, 1989.
[14] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
16 / 18
52. References
Selected References III
[15] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[16] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[17] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.
[18] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[19] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
[20] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
[21] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
17 / 18
53. References
Selected References IV
[22] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
[23] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[24] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[25] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[26] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE
Conference on Decision and Control, pages 3598–3605, Dec. 2009.
18 / 18