# Reinforcement Learning

**Home * Learning * Reinforcement Learning**

**Reinforcement Learning**,

a learning paradigm inspired by behaviourist psychology and classical conditioning - learning by trial and error, interacting with an environment to map situations to actions in such a way that some notion of cumulative reward is maximized. In computer games, reinforcement learning deals with adjusting feature weights based on results or their subsequent predictions during self play.

Reinforcement learning is indebted to the idea of Markov decision processes (MDPs) in the field of optimal control utilizing dynamic programming techniques. The crucial exploitation and exploration tradeoff in multi-armed bandit problems as also considered in UCT of Monte-Carlo Tree Search - between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines - is also faced in reinforcement learning.

## Contents

# Q-Learning

Q-Learning, introduced by Chris Watkins in 1989, is a simple way for agents to learn how to act optimally in controlled Markovian domains ^{[2]}. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states. Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely ^{[3]}. Q-learning has been successfully applied to deep learning by a Google DeepMind team in playing some Atari 2600 games as published in Nature, 2015, dubbed *deep reinforcement learning* or *deep Q-networks* ^{[4]}, soon followed by the spectacular AlphaGo and AlphaZero breakthroughs.

# Temporal Difference Learning

*see main page Temporal Difference Learning*

Q-learning at its simplest uses tables to store data. This very quickly loses viability with increasing sizes of state/action space of the system it is monitoring/controlling. One solution to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Gerald Tesauro in his Backgammon playing temporal difference learning research ^{[5]} ^{[6]}.

Temporal Difference Learning is a prediction method primarily used for reinforcement learning. In the domain of computer games and computer chess, TD learning is applied through self play, subsequently predicting the probability of winning a game during the sequence of moves from the initial position until the end, to adjust weights for a more reliable prediction.

# See also

- AlphaZero
- Deep Learning
- Dynamic Programming
- Markov Models by Michael L. Littman
- MENACE by Donald Michie
- Monte-Carlo Tree Search

# Selected Publications

## 1954 ...

- Richard E. Bellman (
**1954**).*On a new Iterative Algorithm for Finding the Solutions of Games and Linear Programming Problems*. Technical Report P-473, RAND Corporation, U. S. Air Force Project RAND - Arthur Samuel (
**1959**).*Some Studies in Machine Learning Using the Game of Checkers*. IBM Journal July 1959

## 1960 ...

- Richard E. Bellman (
**1960**).*Sequential Machines, Ambiguity, and Dynamic Programming*. Journal of the ACM, Vol. 7, No. 1 - Ronald A. Howard (
**1960**).*Dynamic Programming and Markov Processes*. MIT Press, amazon - Donald Michie (
**1961**).*Trial and Error*. Penguin Science Survey - Donald Michie, Roger A. Chambers (
**1968**).*Boxes: An experiment on adaptive control*. Machine Intelligence 2, Edinburgh: Oliver & Boyd, pdf

## 1970 ...

- A. Harry Klopf (
**1972**).*Brain Function and Adaptive Systems - A Heterostatic Theory*. Air Force Cambridge Research Laboratories, Special Reports, No. 133, pdf - John H. Holland (
**1975**).*Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence*. amazon.com

## 1980 ...

- Richard Sutton (
**1984**).*Temporal Credit Assignment in Reinforcement Learning*. Ph.D. dissertation, University of Massachusetts - Leslie Valiant (
**1984**).*A Theory of the Learnable*. Communications of the ACM, Vol. 27, No. 11, pdf - Chris Watkins (
**1989**).*Learning from Delayed Rewards*. Ph.D. thesis, Cambridge University, pdf

## 1990 ...

- Richard Sutton, Andrew Barto (
**1990**).*Time Derivative Models of Pavlovian Reinforcement*. Learning and Computational Neuroscience: Foundations of Adaptive Networks: 497-537 - Chris Watkins, Peter Dayan (
**1992**).*Q-learning*. Machine Learning, Vol. 8, No. 2 - Gerald Tesauro (
**1992**).*Temporal Difference Learning of Backgammon Strategy*. ML 1992 - Justin A. Boyan, Michael L. Littman (
**1993**).*Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach*. NIPS 1993, pdf - Michael L. Littman (
**1994**).*Markov Games as a Framework for Multi-Agent Reinforcement Learning*. International Conference on Machine Learning, pdf

## 1995 ...

- Marco Wiering (
**1995**).*TD Learning of Game Evaluation Functions with Hierarchical Neural Architectures*. Master's thesis, University of Amsterdam, pdf - Gerald Tesauro (
**1995**).*Temporal Difference Learning and TD-Gammon*. Communications of the ACM, Vol. 38, No. 3 - Leemon C. Baird III, Mance E. Harmon, A. Harry Klopf (
**1996**).*Reinforcement Learning: An Alternative Approach to Machine Intelligence*. pdf - Leslie Pack Kaelbling, Michael L. Littman, Andrew W. Moore (
**1996**).*Reinforcement Learning: A Survey*. JAIR Vol. 4, pdf - Robert Levinson (
**1996**).*General Game-Playing and Reinforcement Learning*. Computational Intelligence, Vol. 12, No. 1 - Ronald Parr, Stuart Russell (
**1997**).*Reinforcement Learning with Hierarchies of Machines.*In Advances in Neural Information Processing Systems 10, MIT Press, zipped ps - William Uther, Manuela M. Veloso (
**1997**).*Adversarial Reinforcement Learning*. Carnegie Mellon University, ps - William Uther, Manuela M. Veloso (
**1997**).*Generalizing Adversarial Reinforcement Learning*. Carnegie Mellon University, ps - Marco Wiering, Jürgen Schmidhuber (
**1997**).*HQ-learning*. Adaptive Behavior, Vol. 6, No 2 - Csaba Szepesvári (
**1998**).*Reinforcement Learning: Theory and Practice*. Proceedings of the 2nd Slovak Conference on Artificial Neural Networks, zipped ps - Richard Sutton, Andrew Barto (
**1998**).*Reinforcement Learning: An Introduction*. MIT Press - Vassilis Papavassiliou, Stuart Russell (
**1999**).*Convergence of reinforcement learning with general function approximators.*In Proc. IJCAI-99, Stockholm, ps - Marco Wiering (
**1999**).*Explorations in Efficient Reinforcement Learning*. Ph.D. thesis, University of Amsterdam, advisors Frans Groen and Jürgen Schmidhuber

## 2000 ...

- Sebastian Thrun, Michael L. Littman (
**2000**).*A Review of Reinforcement Learning*. AI Magazine, Vol. 21, No. 1 - Robert Levinson, Ryan Weber (
**2000**).*Chess Neighborhoods, Function Combination, and Reinforcement Learning*. CG 2000 - Andrew Ng, Stuart Russell (
**2000**).*Algorithms for inverse reinforcement learning.*In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, California: Morgan Kaufmann, pdf - Dean F. Hougen, Maria Gini, James R. Slagle (
**2000**).*An Integrated Connectionist Approach to Reinforcement Learning for Robotic Control*. ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning - Jonathan Baxter, Peter Bartlett (
**2000**).*Reinforcement Learning on POMDPs via Direct Gradient Ascent*. ICML 2000, pdf - Doina Precup (
**2000**).*Temporal Abstraction in Reinforcement Learning*. Ph.D. Dissertation, Department of Computer Science, University of Massachusetts, Amherst. - Robert Levinson, Ryan Weber (
**2001**).*Chess Neighborhoods, Function Combinations and Reinforcements Learning*. In Computers and Games (eds. Tony Marsland and I. Frank). Lecture Notes in Computer Science,. Springer,. pdf - Marco Block-Berlitz (
**2003**).*Reinforcement Learning in der Schachprogrammierung*. Studienarbeit, Freie Universität Berlin, Dozent: Prof. Dr. Raúl Rojas, pdf (German) - Henk Mannen (
**2003**).*Learning to play chess using reinforcement learning with database games*. Master’s thesis, Cognitive Artiﬁcial Intelligence, Utrecht University - Joelle Pineau, Geoffrey Gordon, Sebastian Thrun (
**2003**).*Point-based value iteration: An anytime algorithm for POMDPs*. IJCAI, pdf - Yngvi Björnsson, Vignir Hafsteinsson, Ársæll Jóhannsson, Einar Jónsson (
**2004**).*Efficient Use of Reinforcement Learning in a Computer Game*. In Computer Games: Artificial Intellignece, Design and Education (CGAIDE'04), pp. 379–383, 2004. pdf - Imran Ghory (
**2004**).*Reinforcement learning in board games*. CSTR-04-004, Department of Computer Science, University of Bristol. pdf^{[7]} - Eric Wiewiora (
**2004**).*Efficient Exploration for Reinforcement Learning*. MSc thesis, pdf - Albert Xin Jiang (
**2004**).*Multiagent Reinforcement Learning in Stochastic Games with Continuous Action Spaces*. pdf

## 2005 ...

- Sylvain Gelly, Jérémie Mary, Olivier Teytaud (
**2006**).*Learning for stochastic dynamic programming*. pdf, pdf - Sylvain Gelly (
**2007**).*A Contribution to Reinforcement Learning; Application to Computer Go.*Ph.D. thesis, pdf - Yong Duan, Baoxia Cui, Xinhe Xu (
**2007**).*State Space Partition for Reinforcement Learning Based on Fuzzy Min-Max Neural Network*. ISNN 2007 - Yasuhiro Osaki, Kazutomo Shibahara, Yasuhiro Tajima, Yoshiyuki Kotani (
**2007**).*Reinforcement Learning of Evaluation Functions Using Temporal Difference-Monte Carlo learning method*. 12th Game Programming Workshop - David Silver, Richard Sutton, Martin Müller (
**2007**).*Reinforcement learning of local shape in the game of Go*. 20th IJCAI, pdf - Marco Block, Maro Bader, Ernesto Tapia, Marte Ramírez, Ketill Gunnarsson, Erik Cuevas, Daniel Zaldivar, Raúl Rojas (
**2008**).*Using Reinforcement Learning in Chess Engines*. CONCIBE SCIENCE 2008, Research in Computing Science: Special Issue in Electronics and Biomedical Engineering, Computer Science and Informatics, ISSN:1870-4069, Vol. 35, pp. 31-40, Guadalajara, Mexico, pdf - Cécile Germain-Renaud, Julien Pérez, Balázs Kégl, Charles Loomis (
**2008**).*Grid Differentiated Services: a Reinforcement Learning Approach*. In 8th IEEE Symposium on Cluster Computing and the Grid. Lyon, pdf - David Silver (
**2009**).*Reinforcement Learning and Simulation-Based Search*. Ph.D. thesis, University of Alberta. pdf

## 2010 ...

- Joel Veness, Kee Siong Ng, Marcus Hutter, David Silver (
**2010**).*Reinforcement Learning via AIXI Approximation*. Association for the Advancement of Artificial Intelligence (AAAI), pdf - Julien Pérez, Cécile Germain-Renaud, Balázs Kégl, Charles Loomis (
**2010**).*Multi-objective Reinforcement Learning for Responsive Grids*. In The Journal of Grid Computing. pdf - Csaba Szepesvári (
**2010**).*Algorithms for Reinforcement Learning*. Morgan & Claypool

**2011**

- Peter Auer (
**2011**).*Exploration and Exploitation in Online Learning*. ICAIS 2011 - Charles Elkan (
**2011**).*Reinforcement Learning with a Bilinear Q Function*. EWRL 2011

**2012**

- Marco Wiering, Martijn Van Otterlo (
**2012**).*Reinforcement learning: State-of-the-art*. Adaptation, Learning, and Optimization, Vol. 12, Springer

- István Szita (
**2012**).*Reinforcement Learning in Games*. Chapter 17

- Arthur Guez, David Silver, Peter Dayan (
**2012**).*Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search*. NIPS 2012, pdf

**2013**

- Arthur Guez, David Silver, Peter Dayan (
**2013**).*Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search*. Journal of Artificial Intelligence Research, Vol. 48, pdf - Michiel van der Ree, Marco Wiering (
**2013**).*Reinforcement Learning in the Game of Othello: Learning Against a Fixed Opponent and Learning from Self-Play*. ADPRL 2013 - Luuk Bom, Ruud Henken, Marco Wiering (
**2013**).*Reinforcement Learning to Train Ms. Pac-Man Using Higher-order Action-relative Inputs*. ADPRL 2013^{[8]} - Peter Auer, Marcus Hutter, Laurent Orseau (
**2013**).*Reinforcement Learning*. Dagstuhl Reports, Vol. 3, No. 8, DOI: 10.4230/DagRep.3.8.1, URN: urn:nbn:de:0030-drops-43409 - Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin Riedmiller (
**2013**).*Playing Atari with Deep Reinforcement Learning*. arXiv:1312.5602^{[9]}^{[10]} - Ari Weinstein (
**2013**).*Local Planning For Continuous Markov Decision Processes*. Ph.D. thesis, Rutgers University, advisor Michael L. Littman, pdf

**2014**

- Marcin Szubert (
**2014**).*Coevolutionary Shaping for Reinforcement Learning*. Ph.D. thesis, Poznań University of Technology, supervisor Krzysztof Krawiec, co-supervisor Wojciech Jaśkowski, pdf

## 2015 ...

- Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis (
**2015**).*Human-level control through deep reinforcement learning*. Nature, Vol. 518 - Tobias Graf, Marco Platzner (
**2015**).*Adaptive Playouts in Monte Carlo Tree Search with Policy Gradient Reinforcement Learning*. Advances in Computer Games 14 - Arun Nair, Praveen Srinivasan, Sam Blackwell, Cagdas Alcicek, Rory Fearon, Alessandro De Maria, Veda Panneershelvam, Mustafa Suleyman, Charles Beattie, Stig Petersen, Shane Legg, Volodymyr Mnih, Koray Kavukcuoglu, David Silver (
**2015**).*Massively Parallel Methods for Deep Reinforcement Learning*. arXiv:1507.04296 - Matthew Lai (
**2015**).*Giraffe: Using Deep Reinforcement Learning to Play Chess*. M.Sc. thesis, Imperial College London, arXiv:1509.01549v1 » Giraffe - Hado van Hasselt, Arthur Guez, David Silver (
**2015**).*Deep Reinforcement Learning with Double Q-learning*. arXiv:1509.06461

**2016**

- Ziyu Wang, Nando de Freitas, Marc Lanctot (
**2016**).*Dueling Network Architectures for Deep Reinforcement Learning*. arXiv:1511.06581 - David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis (
**2016**).*Mastering the game of Go with deep neural networks and tree search*. Nature, Vol. 529 » AlphaGo - Hung Guei, Tinghan Wei, Jin-Bo Huang, I-Chen Wu (
**2016**).*An Empirical Study on Applying Deep Reinforcement Learning to the Game 2048*. CG 2016 - Omid E. David, Nathan S. Netanyahu, Lior Wolf (
**2016**).*DeepChess: End-to-End Deep Neural Network for Automatic Learning in Chess*. ICAAN 2016, Lecture Notes in Computer Science, Vol. 9887, Springer, pdf preprint » DeepChess^{[11]}^{[12]} - Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, Koray Kavukcuoglu (
**2016**).*Asynchronous Methods for Deep Reinforcement Learning*. arXiv:1602.01783v2 - Shixiang Gu, Ethan Holly, Timothy Lillicrap, Sergey Levine (
**2016**).*Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates*. arXiv:1610.00633 - Max Jaderberg, Volodymyr Mnih, Wojciech Marian Czarnecki, Tom Schaul, Joel Z. Leibo, David Silver, Koray Kavukcuoglu (
**2016**).*Reinforcement Learning with Unsupervised Auxiliary Tasks*. arXiv:1611.05397v1 - Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer, Joel Z Leibo, Rémi Munos, Charles Blundell, Dharshan Kumaran, Matthew Botvinick (
**2016**).*Learning to reinforcement learn*. arXiv:1611.05763

**2017**

- Hirotaka Kameko, Jun Suzuki, Naoki Mizukami, Yoshimasa Tsuruoka (
**2017**).*Deep Reinforcement Learning with Hidden Layers on Future States*. Computer Games Workshop at IJCAI 2017, pdf - Marc Lanctot, Vinícius Flores Zambaldi, Audrunas Gruslys, Angeliki Lazaridou, Karl Tuyls, Julien Pérolat, David Silver, Thore Graepel (
**2017**).*A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning*. arXiv:1711.00832 - David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yutian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre, George van den Driessche, Thore Graepel, Demis Hassabis (
**2017**).*Mastering the game of Go without human knowledge*. Nature, Vol. 550, pdf^{[13]} - Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger (
**2017**).*Deep Reinforcement Learning that Matters*. arXiv:1709.06560 - David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (
**2017**).*Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm*. arXiv:1712.01815 » AlphaZero - Kei Takada, Hiroyuki Iizuka, Masahito Yamamoto (
**2017**).*Reinforcement Learning for Creating Evaluation Function Using Convolutional Neural Network in Hex*. TAAI 2017 » Hex, CNN - Ari Weinstein, Matthew Botvinick (
**2017**).*Structure Learning in Motor Control: A Deep Reinforcement Learning Model*. arXiv:1706.06827 - William Uther (
**2017**).*Markov Decision Processes*. in Claude Sammut, Geoffrey I. Webb (eds) (**2017**).*Encyclopedia of Machine Learning and Data Mining*. Springer

# Postings

- Parameter Tuning by Jonathan Baxter, CCC, October 01, 1998 » KnightCap
- any good experiences with genetic algos or temporal difference learning? by Rafael B. Andrist, CCC, January 01, 2001
- *First release* Giraffe, a new engine based on deep learning by Matthew Lai, CCC, July 08, 2015 » Deep Learning, Giraffe
- Demystifying Deep Reinforcement Learning by Tambet Matiisen, Nervana, December 22, 2015
- Deep Reinforcement Learning with Neon by Tambet Matiisen, Nervana, December 29, 2015
- Google's AlphaGo team has been working on chess by Peter Kappler, CCC, December 06, 2017 » AlphaZero
- Understanding the power of reinforcement learning by Michael Sherwin, CCC, December 12, 2017

# External Links

## Reinforcement Learning

- Reinforcement learning from Wikipedia
- Reinforcement Learning: An Introduction ebook by Richard Sutton and Andrew Barto
- Reinforcement Learning in Classic Board Games (pdf) by David Silver
- Category: Reinforcement Learning - Scholarpedia
- Reinforcement Learning - Scholarpedia
- Reinforcement Learning and Artificial Intelligence – Faculty of Science, University of Alberta

## MDP

- Markov decision process from Wikipedia
- Partially observable Markov decision process
- Reinforcement Learning and POMDPs by Jürgen Schmidhuber

## Q-Learning

- Q-learning from Wikipedia
- A Painless Q-Learning Tutorial
- State–action–reward–state–action from Wikipedia
- Probably approximately correct learning from Wikipedia

## Courses

- Reinforcement Learning Course by David Silver, University College London, 2015, YouTube Videos

- Lecture 1: Introduction to Reinforcement Learning
- Lecture 2: Markov Decision Process
- Lecture 3: Planning by Dynamic Programming
- Lecture 4: Model-Free Prediction
- Lecture 5: Model Free Control
- Lecture 6: Value Function Approximation
- Lecture 7: Policy Gradient Methods
- Lecture 8: Integrating Learning and Planning
- Lecture 9: Exploration and Exploitation
- Lecture 10: Classic Games

- Introduction to Reinforcement Learning by Joelle Pineau, McGill University, 2016, YouTube Video

# References

- ↑ Example of a simple Markov decision processes with three states (green circles) and two actions (orange circles), with two rewards (orange arrows), image by waldoalvarez CC BY-SA 4.0, Wikimedia Commons
- ↑ Q-learning from Wikipedia
- ↑ Chris Watkins, Peter Dayan (
**1992**).*Q-learning*. Machine Learning, Vol. 8, No. 2 - ↑ Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis (
**2015**).*Human-level control through deep reinforcement learning*. Nature, Vol. 518 - ↑ Q-learning from Wikipedia
- ↑ Gerald Tesauro (
**1995**).*Temporal Difference Learning and TD-Gammon*. Communications of the ACM, Vol. 38, No. 3 - ↑ University of Bristol - Department of Computer Science - Technical Reports
- ↑ Ms. Pac-Man from Wikipedia
- ↑ Demystifying Deep Reinforcement Learning by Tambet Matiisen, Nervana, December 22, 2015
- ↑ Patent US20150100530 - Methods and apparatus for reinforcement learning - Google Patents
- ↑ DeepChess: Another deep-learning based chess program by Matthew Lai, CCC, October 17, 2016
- ↑ ICANN 2016 | Recipients of the best paper awards
- ↑ AlphaGo Zero: Learning from scratch by Demis Hassabis and David Silver, DeepMind, October 18, 2017