Q-learning

Christopher J. C. H. Watkins; Peter Dayan

doi:10.1007/bf00992698

journal article May 01, 1992

Q-learning

Christopher J. C. H. Watkins Peter Dayan

Machine Learning Vol. 8 No. 3-4 pp. 279-292 · Springer Science and Business Media LLC

View at Publisher Save 10.1007/bf00992698

Topics

No keywords indexed for this article. Browse by subject →

References

14

[1]

Barto, A.G., Bradtke, S.J. & Singh, S.P. (1991).Real-time learning and control using asynchronous dynamic programming. (COINS technical report 91-57). Amherst: University of Massachusetts.

[2]

Barto, A.G. & Singh, S.P. (1990). On the computational economics of reinforcement learning. In D.S. Touretzky, J. Elman, T.J. Sejnowski & G.E. Hinton, (Eds.),Proceedings of the 1990 Connectionist Models Summer School. San Mateo, CA: Morgan Kaufmann.

[3]

Bellman, R.E. & Dreyfus, S.E. (1962).Applied dynamic programming. RAND Corporation. 10.1515/9781400874651

[4]

Chapman, D. & Kaelbling, L.P. (1991). Input generalization in delayed reinforcement learning: An algorithm and performance comparisons.Proceedings of the 1991 International Joint Conference on Artificial Intelligence (pp. 726?731).

[5]

Kushner, H. & Clark, D. (1978).Stochastic approximation methods for constrained and unconstrained systems. Berlin, Germany: Springer-Verlag. 10.1007/978-1-4684-9352-8

[6]

Lin, L. (1992). Self-improving reactive agents based on reinforcement learning, planning and teaching.Machine Learning, 8. 10.1007/bf00992699

[7]

Mahadevan & Connell (1991). Automatic programming of behavior-based robots using reinforcement learning.Proceedings of the 1991 National Conference on AI (pp. 768?773).

[8]

Ross, S. (1983).Introduction to stochastic dynamic programming. New York, Academic Press.

[9]

Sato, M., Abe, K. & Takeda, H. (1988). Learning control of finite Markov chains with explicit trade-off between estimation and control.IEEE Transactions on Systems, Man and Cybernetics, 18, pp. 677?684. 10.1109/21.21595

[10]

Sutton, R.S. (1984).Temporal credit assignment in reinforcement learning. PhD Thesis, University of Massachusetts, Amherst, MA.

[11]

Sutton, R.S. (1988). Learning to predict by the methods of temporal difference.Machine Learning, 3, pp. 9?44.

[12]

Sutton, R.S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming.Proceedings of the Seventh International Conference on Machine Learning. San Mateo, CA: Morgan Kaufmann.

[13]

Watkins, C.J.C.H. (1989).Learning from delayed rewards. PhD Thesis, University of Cambridge, England.

[14]

Werbos, P.J. (1977). Advanced forecasting methods for global crisis warning and models of intelligence.General Systems Yearbook, 22, pp. 25?38.

Cited By

7,429

Judge: Effective State Abstraction for Guiding Automated Web GUI Testing

Chenxu Liu, Wei Yang · 2026

ACM Transactions on Software Engine...

Universal Stabilization for Maximum Entropy Optimization in Reinforcement Learning

Xing Chen, Xiaofeng Cao · 2026

IEEE Transactions on Neural Network...