Difference between revisions of "Reinforcement Learning"

From Chessprogramming wiki
Jump to: navigation, search
 
Line 164: Line 164:
 
* [https://scholar.google.co.in/citations?user=1QlrvHkAAAAJ&hl=en Chandramouli Kamanchi], [https://scholar.google.co.in/citations?user=nx4NlpsAAAAJ&hl=en Raghuram Bharadwaj Diddigi], [[Shalabh Bhatnagar]] ('''2019'''). ''Successive Over Relaxation Q-Learning''. [https://arxiv.org/abs/1903.03812 arXiv:1903.03812]
 
* [https://scholar.google.co.in/citations?user=1QlrvHkAAAAJ&hl=en Chandramouli Kamanchi], [https://scholar.google.co.in/citations?user=nx4NlpsAAAAJ&hl=en Raghuram Bharadwaj Diddigi], [[Shalabh Bhatnagar]] ('''2019'''). ''Successive Over Relaxation Q-Learning''. [https://arxiv.org/abs/1903.03812 arXiv:1903.03812]
 
* [https://scholar.google.co.in/citations?user=1QlrvHkAAAAJ&hl=en Chandramouli Kamanchi], [https://scholar.google.co.in/citations?user=nx4NlpsAAAAJ&hl=en Raghuram Bharadwaj Diddigi], [[Shalabh Bhatnagar]] ('''2019'''). ''Second Order Value Iteration in Reinforcement Learning''. [https://arxiv.org/abs/1905.03927 arXiv:1905.03927]
 
* [https://scholar.google.co.in/citations?user=1QlrvHkAAAAJ&hl=en Chandramouli Kamanchi], [https://scholar.google.co.in/citations?user=nx4NlpsAAAAJ&hl=en Raghuram Bharadwaj Diddigi], [[Shalabh Bhatnagar]] ('''2019'''). ''Second Order Value Iteration in Reinforcement Learning''. [https://arxiv.org/abs/1905.03927 arXiv:1905.03927]
* [[Marc Lanctot]], [[Edward Lockhart]], [[Jean-Baptiste Lespiau]], [[Vinicius Zambaldi]], [[Satyaki Upadhyay]], [[Julien Pérolat]], [[Sriram Srinivasan]], [[Finbarr Timbers]], [[Karl Tuyls]], [[Shayegan Omidshafiei]], [[Daniel Hennes]], [[Dustin Morrill]], [[Paul Muller]], [[Timo Ewalds]], [[Ryan Faulkner]], [[János Kramár]], [[Bart De Vylder]], [[Brennan Saeta]], [[James Bradbury]], [[David Ding]], [[Sebastian Borgeaud]], [[Matthew Lai]], [[Julian Schrittwieser]], [[Thomas Anthony]], [[Edward Hughes]], [[Ivo Danihelka]], [[Jonah Ryan-Davis]] ('''2019'''). ''OpenSpiel: A Framework for Reinforcement Learning in Games''. [https://arxiv.org/abs/1908.09453 arXiv:1908.09453] <ref>[https://github.com/deepmind/open_spiel/blob/master/docs/contributing.md open_spiel/contributing.md at master · deepmind/open_spiel · GitHub]</ref>
+
* [[Marc Lanctot]], [[Edward Lockhart]], [[Jean-Baptiste Lespiau]], [[Vinícius Flores Zambaldi]], [[Satyaki Upadhyay]], [[Julien Pérolat]], [[Sriram Srinivasan]], [[Finbarr Timbers]], [[Karl Tuyls]], [[Shayegan Omidshafiei]], [[Daniel Hennes]], [[Dustin Morrill]], [[Paul Muller]], [[Timo Ewalds]], [[Ryan Faulkner]], [[János Kramár]], [[Bart De Vylder]], [[Brennan Saeta]], [[James Bradbury]], [[David Ding]], [[Sebastian Borgeaud]], [[Matthew Lai]], [[Julian Schrittwieser]], [[Thomas Anthony]], [[Edward Hughes]], [[Ivo Danihelka]], [[Jonah Ryan-Davis]] ('''2019'''). ''OpenSpiel: A Framework for Reinforcement Learning in Games''. [https://arxiv.org/abs/1908.09453 arXiv:1908.09453] <ref>[https://github.com/deepmind/open_spiel/blob/master/docs/contributing.md open_spiel/contributing.md at master · deepmind/open_spiel · GitHub]</ref>
 
* [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Thomas Hubert]], [[Karen Simonyan]], [[Laurent Sifre]], [[Simon Schmitt]], [[Arthur Guez]], [[Edward Lockhart]], [[Demis Hassabis]], [[Thore Graepel]], [[Timothy Lillicrap]], [[David Silver]] ('''2019'''). ''Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model''. [https://arxiv.org/abs/1911.08265 arXiv:1911.08265] <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72381 New DeepMind paper] by GregNeto, [[CCC]], November 21, 2019</ref>
 
* [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Thomas Hubert]], [[Karen Simonyan]], [[Laurent Sifre]], [[Simon Schmitt]], [[Arthur Guez]], [[Edward Lockhart]], [[Demis Hassabis]], [[Thore Graepel]], [[Timothy Lillicrap]], [[David Silver]] ('''2019'''). ''Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model''. [https://arxiv.org/abs/1911.08265 arXiv:1911.08265] <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72381 New DeepMind paper] by GregNeto, [[CCC]], November 21, 2019</ref>
 
* [[Mathematician#SrbhBose|Sourabh Bose]] ('''2019'''). ''[https://rc.library.uta.edu/uta-ir/handle/10106/28094 Learning Representations Using Reinforcement Learning]''. Ph.D. thesis, [https://en.wikipedia.org/wiki/University_of_Texas_at_Arlington University of Texas at Arlington], advisor [[Mathematician#MHuber|Manfred Huber]] <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=72810&start=6 e: Board adaptive / tuning evaluation function - no NN/AI] by Tony P., [[CCC]], January 15, 2020</ref>
 
* [[Mathematician#SrbhBose|Sourabh Bose]] ('''2019'''). ''[https://rc.library.uta.edu/uta-ir/handle/10106/28094 Learning Representations Using Reinforcement Learning]''. Ph.D. thesis, [https://en.wikipedia.org/wiki/University_of_Texas_at_Arlington University of Texas at Arlington], advisor [[Mathematician#MHuber|Manfred Huber]] <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=72810&start=6 e: Board adaptive / tuning evaluation function - no NN/AI] by Tony P., [[CCC]], January 15, 2020</ref>
Line 226: Line 226:
 
: {{#evu:https://www.youtube.com/watch?v=O_1Z63EDMvQ|alignment=left|valignment=top}}
 
: {{#evu:https://www.youtube.com/watch?v=O_1Z63EDMvQ|alignment=left|valignment=top}}
 
==OpenSpiel==
 
==OpenSpiel==
* [https://github.com/deepmind/open_spiel GitHub - deepmind/open_spiel: OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games] <ref>[[Marc Lanctot]], [[Edward Lockhart]], [[Jean-Baptiste Lespiau]], [[Vinicius Zambaldi]], [[Satyaki Upadhyay]], [[Julien Pérolat]], [[Sriram Srinivasan]], [[Finbarr Timbers]], [[Karl Tuyls]], [[Shayegan Omidshafiei]], [[Daniel Hennes]], [[Dustin Morrill]], [[Paul Muller]], [[Timo Ewalds]], [[Ryan Faulkner]], [[János Kramár]], [[Bart De Vylder]], [[Brennan Saeta]], [[James Bradbury]], [[David Ding]], [[Sebastian Borgeaud]], [[Matthew Lai]], [[Julian Schrittwieser]], [[Thomas Anthony]], [[Edward Hughes]], [[Ivo Danihelka]], [[Jonah Ryan-Davis]] ('''2019'''). ''OpenSpiel: A Framework for Reinforcement Learning in Games''. [https://arxiv.org/abs/1908.09453 arXiv:1908.09453]</ref>
+
* [https://github.com/deepmind/open_spiel GitHub - deepmind/open_spiel: OpenSpiel is a collection of environments and algorithms for research in general reinforcement learning and search/planning in games] <ref>[[Marc Lanctot]], [[Edward Lockhart]], [[Jean-Baptiste Lespiau]], [[Vinícius Flores Zambaldi]], [[Satyaki Upadhyay]], [[Julien Pérolat]], [[Sriram Srinivasan]], [[Finbarr Timbers]], [[Karl Tuyls]], [[Shayegan Omidshafiei]], [[Daniel Hennes]], [[Dustin Morrill]], [[Paul Muller]], [[Timo Ewalds]], [[Ryan Faulkner]], [[János Kramár]], [[Bart De Vylder]], [[Brennan Saeta]], [[James Bradbury]], [[David Ding]], [[Sebastian Borgeaud]], [[Matthew Lai]], [[Julian Schrittwieser]], [[Thomas Anthony]], [[Edward Hughes]], [[Ivo Danihelka]], [[Jonah Ryan-Davis]] ('''2019'''). ''OpenSpiel: A Framework for Reinforcement Learning in Games''. [https://arxiv.org/abs/1908.09453 arXiv:1908.09453]</ref>
 
** [https://github.com/deepmind/open_spiel/tree/master/open_spiel/algorithms open_spiel/open_spiel/algorithms at master · deepmind/open_spiel · GitHub]
 
** [https://github.com/deepmind/open_spiel/tree/master/open_spiel/algorithms open_spiel/open_spiel/algorithms at master · deepmind/open_spiel · GitHub]
 
*** [https://github.com/deepmind/open_spiel/tree/master/open_spiel/algorithms/alpha_zero open_spiel/open_spiel/algorithms/alpha_zero at master · deepmind/open_spiel · GitHub]
 
*** [https://github.com/deepmind/open_spiel/tree/master/open_spiel/algorithms/alpha_zero open_spiel/open_spiel/algorithms/alpha_zero at master · deepmind/open_spiel · GitHub]

Latest revision as of 09:26, 17 April 2021

Home * Learning * Reinforcement Learning

Reinforcement Learning,
a learning paradigm inspired by behaviourist psychology and classical conditioning - learning by trial and error, interacting with an environment to map situations to actions in such a way that some notion of cumulative reward is maximized. In computer games, reinforcement learning deals with adjusting feature weights based on results or their subsequent predictions during self play.

Reinforcement learning is indebted to the idea of Markov decision processes (MDPs) in the field of optimal control utilizing dynamic programming techniques. The crucial exploitation and exploration tradeoff in multi-armed bandit problems as also considered in UCT of Monte-Carlo Tree Search - between "exploitation" of the machine that has the highest expected payoff and "exploration" to get more information about the expected payoffs of the other machines - is also faced in reinforcement learning.

Q-Learning

Q-Learning, introduced by Chris Watkins in 1989, is a simple way for agents to learn how to act optimally in controlled Markovian domains [2]. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states. Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely [3]. Q-learning has been successfully applied to deep learning by a Google DeepMind team in playing some Atari 2600 games as published in Nature, 2015, dubbed deep reinforcement learning or deep Q-networks [4], soon followed by the spectacular AlphaGo and AlphaZero breakthroughs.

Temporal Difference Learning

see main page Temporal Difference Learning

Q-learning at its simplest uses tables to store data. This very quickly loses viability with increasing sizes of state/action space of the system it is monitoring/controlling. One solution to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Gerald Tesauro in his Backgammon playing temporal difference learning research [5] [6].

Temporal Difference Learning is a prediction method primarily used for reinforcement learning. In the domain of computer games and computer chess, TD learning is applied through self play, subsequently predicting the probability of winning a game during the sequence of moves from the initial position until the end, to adjust weights for a more reliable prediction.

See also

UCT

Selected Publications

1954 ...

1960 ...

1970 ...

1980 ...

1990 ...

1995 ...

2000 ...

2005 ...

2010 ...

2011

2012

István Szita (2012). Reinforcement Learning in Games. Chapter 17

2013

2014

2015 ...

2016

2017

2018

2019

2020 ...

Postings

1995 ...

Re: Parameter Tuning by Don Beal, CCC, October 02, 1998

2000 ...

2010 ...

2020 ...

External Links

Reinforcement Learning

MDP

Q-Learning

Courses

  1. Lecture 1: Introduction to Reinforcement Learning
  2. Lecture 2: Markov Decision Process
  3. Lecture 3: Planning by Dynamic Programming
  4. Lecture 4: Model-Free Prediction
  5. Lecture 5: Model Free Control
  6. Lecture 6: Value Function Approximation
  7. Lecture 7: Policy Gradient Methods
  8. Lecture 8: Integrating Learning and Planning
  9. Lecture 9: Exploration and Exploitation
  10. Lecture 10: Classic Games

OpenSpiel

References

  1. Example of a simple Markov decision processes with three states (green circles) and two actions (orange circles), with two rewards (orange arrows), image by waldoalvarez CC BY-SA 4.0, Wikimedia Commons
  2. Q-learning from Wikipedia
  3. Chris Watkins, Peter Dayan (1992). Q-learning. Machine Learning, Vol. 8, No. 2
  4. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, Demis Hassabis (2015). Human-level control through deep reinforcement learning. Nature, Vol. 518
  5. Q-learning from Wikipedia
  6. Gerald Tesauro (1995). Temporal Difference Learning and TD-Gammon. Communications of the ACM, Vol. 38, No. 3
  7. University of Bristol - Department of Computer Science - Technical Reports
  8. Ms. Pac-Man from Wikipedia
  9. Demystifying Deep Reinforcement Learning by Tambet Matiisen, Nervana, December 22, 2015
  10. Patent US20150100530 - Methods and apparatus for reinforcement learning - Google Patents
  11. DeepChess: Another deep-learning based chess program by Matthew Lai, CCC, October 17, 2016
  12. ICANN 2016 | Recipients of the best paper awards
  13. AlphaGo Zero: Learning from scratch by Demis Hassabis and David Silver, DeepMind, October 18, 2017
  14. AlphaZero: Shedding new light on the grand games of chess, shogi and Go by David Silver, Thomas Hubert, Julian Schrittwieser and Demis Hassabis, DeepMind, December 03, 2018
  15. open_spiel/contributing.md at master · deepmind/open_spiel · GitHub
  16. New DeepMind paper by GregNeto, CCC, November 21, 2019
  17. e: Board adaptive / tuning evaluation function - no NN/AI by Tony P., CCC, January 15, 2020
  18. MuZero: Mastering Go, chess, shogi and Atari without rules
  19. Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinícius Flores Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, Daniel Hennes, Dustin Morrill, Paul Muller, Timo Ewalds, Ryan Faulkner, János Kramár, Bart De Vylder, Brennan Saeta, James Bradbury, David Ding, Sebastian Borgeaud, Matthew Lai, Julian Schrittwieser, Thomas Anthony, Edward Hughes, Ivo Danihelka, Jonah Ryan-Davis (2019). OpenSpiel: A Framework for Reinforcement Learning in Games. arXiv:1908.09453

Up one Level