Changes

Newer edit →

Reinforcement Learning

35,090 bytes added, 09:21, 19 May 2018

Created page with "'''Home * Learning * Reinforcement Learning''' FILE:Markov Decision Process.svg|border|right|thumb|[https://en.wikipedia.org/wiki/Markov_decision_process..."

'''[[Main Page|Home]] * [[Learning]] * Reinforcement Learning'''

[[FILE:Markov Decision Process.svg|border|right|thumb|[https://en.wikipedia.org/wiki/Markov_decision_process Markov decision processes] <ref>Example of a simple [https://en.wikipedia.org/wiki/Markov_decision_process Markov decision processes] with three states (green circles) and two actions (orange circles), with two rewards (orange arrows), [https://commons.wikimedia.org/wiki/File:Markov_Decision_Process.svg image] by [https://commons.wikimedia.org/wiki/User:Waldoalvarez waldoalvarez] [https://creativecommons.org/licenses/by-sa/4.0/deed.en CC BY-SA 4.0], [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]]

'''Reinforcement Learning''',<br/>
a learning paradigm inspired by [https://en.wikipedia.org/wiki/Behaviorism behaviourist] psychology and [https://en.wikipedia.org/wiki/Classical_conditioning classical conditioning] - learning by [[Trial and Error|trial and error]], interacting with an environment to map situations to [https://en.wikipedia.org/wiki/Action_selection actions] in such a way that some notion of cumulative [https://en.wikipedia.org/wiki/Reward_system reward] is maximized. In computer games, reinforcement learning deals with adjusting feature weights based on results or their subsequent predictions during self play.

Reinforcement learning is indebted to the idea of [https://en.wikipedia.org/wiki/Markov_decision_process Markov decision processes] (MDPs) in the field of [https://en.wikipedia.org/wiki/Optimal_control optimal control] utilizing [[Dynamic Programming|dynamic programming]] techniques. The crucial [https://en.wikipedia.org/wiki/Monte_Carlo_tree_search#Exploration_and_exploitation exploitation and exploration] tradeoff in [https://en.wikipedia.org/wiki/Multi-armed_bandit multi-armed bandit] problems as also considered in [[UCT]] of [[Monte-Carlo Tree Search]] - between "[https://en.wikipedia.org/wiki/Exploitation exploitation]" of the machine that has the highest expected payoff and "[https://en.wikipedia.org/wiki/Exploration exploration]" to get more information about the expected payoffs of the other machines - is also faced in reinforcement learning.

=Q-Learning=
Q-Learning, introduced by [[Chris Watkins]] in 1989, is a simple way for [https://en.wikipedia.org/wiki/Intelligent_agent agents] to learn how to act optimally in controlled Markovian domains <ref>[https://en.wikipedia.org/wiki/Q-learning Q-learning from Wikipedia]</ref>. It amounts to an incremental method for dynamic programming which imposes limited computational demands. It works by successively improving its evaluations of the quality of particular actions at particular states. Q-learning converges to the optimum action-values with probability 1 so long as all actions are repeatedly sampled in all states and the action-values are represented discretely <ref>[[Chris Watkins]], [[Peter Dayan]] ('''1992'''). ''[http://www.gatsby.ucl.ac.uk/~dayan/papers/wd92.html Q-learning]''. [https://en.wikipedia.org/wiki/Machine_Learning_(journal) Machine Learning], Vol. 8, No. 2</ref>. Q-learning has been successfully applied to [[Deep Learning|deep learning]] by a [[Google]] [[DeepMind]] team in playing some [[Atari 8-bit|Atari 2600]] [https://en.wikipedia.org/wiki/List_of_Atari_2600_games games] as published in [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], 2015, dubbed ''deep reinforcement learning'' or ''deep Q-networks'' <ref>[[Volodymyr Mnih]], [[Koray Kavukcuoglu]], [[David Silver]], [[Andrei A. Rusu]], [[Joel Veness]], [[Marc G. Bellemare]], [[Alex Graves]], [[Martin Riedmiller]], [[Andreas K. Fidjeland]], [[Georg Ostrovski]], [[Stig Petersen]], [[Charles Beattie]], [[Amir Sadik]], [[Ioannis Antonoglou]], [[Helen King]], [[Dharshan Kumaran]], [[Daan Wierstra]], [[Shane Legg]], [[Demis Hassabis]] ('''2015'''). ''[http://www.nature.com/nature/journal/v518/n7540/abs/nature14236.html Human-level control through deep reinforcement learning]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 518</ref>, soon followed by the spectacular [[AlphaGo]] and [[AlphaZero]] breakthroughs.

=Temporal Difference Learning=
''see main page [[Temporal Difference Learning]]''

Q-learning at its simplest uses tables to store data. This very quickly loses viability with increasing sizes of state/action space of the system it is monitoring/controlling. One solution to this problem is to use an (adapted) [[Neural Networks|artificial neural network]] as a function approximator, as demonstrated by [[Gerald Tesauro]] in his [[Backgammon]] playing temporal difference learning research <ref>[https://en.wikipedia.org/wiki/Q-learning Q-learning from Wikipedia]</ref> <ref>[[Gerald Tesauro]] ('''1995'''). ''Temporal Difference Learning and TD-Gammon''. [[ACM#Communications|Communications of the ACM]], Vol. 38, No. 3</ref>.

Temporal Difference Learning is a prediction method primarily used for reinforcement learning. In the domain of computer games and computer chess, TD learning is applied through self play, subsequently predicting the [https://en.wikipedia.org/wiki/Probability probability] of winning a [[Chess Game|game]] during the sequence of [[Moves|moves]] from the [[Initial Position|initial position]] until the end, to adjust weights for a more reliable prediction.

=See also=
* [[AlphaZero]]
* [[Deep Learning]]
* [[Dynamic Programming]]
* [[Michael L. Littman#MarkovModels|Markov Models]] by [[Michael L. Littman]]
* [[Donald Michie#MENACE|MENACE]] by [[Donald Michie]]
* [[Monte-Carlo Tree Search]]
: [[UCT]]
* [[Neural Networks]]
* [[Planning]]
* [[Temporal Difference Learning]]
* [[Trial and Error]]

=Selected Publications=
==1954 ...==
* [[Richard E. Bellman]] ('''1954'''). ''On a new Iterative Algorithm for Finding the Solutions of Games and Linear Programming Problems''. Technical Report P-473, [https://en.wikipedia.org/wiki/RAND_Corporation RAND Corporation], U. S. Air Force Project RAND
* [[Arthur Samuel]] ('''1959'''). ''[http://domino.watson.ibm.com/tchjr/journalindex.nsf/600cc5649e2871db852568150060213c/39a870213169f45685256bfa00683d74!OpenDocument Some Studies in Machine Learning Using the Game of Checkers]''. IBM Journal July 1959
==1960 ...==
* [[Richard E. Bellman]] ('''1960'''). ''[http://dl.acm.org/citation.cfm?id=321011 Sequential Machines, Ambiguity, and Dynamic Programming]''. [[ACM#Journal|Journal of the ACM]], Vol. 7, No. 1
* [[Mathematician#RAHoward|Ronald A. Howard]] ('''1960'''). ''Dynamic Programming and Markov Processes''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.amazon.com/Programming-Processes-Technology-Research-Monographs/dp/0262080095 amazon]
* [[Donald Michie]] ('''1961'''). ''Trial and Error''. Penguin Science Survey
* [[Donald Michie]], Roger A. Chambers ('''1968'''). ''Boxes: An experiment on adaptive control''. [http://www.doc.ic.ac.uk/%7Eshm/MI/mi2.html Machine Intelligence 2], Edinburgh: Oliver & Boyd, [http://aitopics.org/sites/default/files/classic/Machine_Intelligence_2/MI2-Ch9-MichieChambers.pdf pdf]
==1970 ...==
* [[A. Harry Klopf]] ('''1972'''). ''Brain Function and Adaptive Systems - A Heterostatic Theory''. [https://en.wikipedia.org/wiki/Air_Force_Cambridge_Research_Laboratories Air Force Cambridge Research Laboratories], Special Reports, No. 133, [http://www.dtic.mil/dtic/tr/fulltext/u2/742259.pdf pdf]
* [[Mathematician#Holland|John H. Holland]] ('''1975'''). ''Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence''. [http://www.amazon.com/Adaptation-Natural-Artificial-Systems-Introductory/dp/0262581116 amazon.com]
==1980 ...==
* [[Richard Sutton]] ('''1984'''). ''[http://scholarworks.umass.edu/dissertations/AAI8410337/ Temporal Credit Assignment in Reinforcement Learning]''. Ph.D. dissertation, [https://en.wikipedia.org/wiki/University_of_Massachusetts University of Massachusetts]
* [[Mathematician#LValiant|Leslie Valiant]] ('''1984'''). ''A Theory of the Learnable''. [[ACM#Communications|Communications of the ACM]], Vol. 27, No. 11, [http://web.mit.edu/6.435/www/Valiant84.pdf pdf]
* [[Chris Watkins]] ('''1989'''). ''Learning from Delayed Rewards''. Ph.D. thesis, [https://en.wikipedia.org/wiki/University_of_Cambridge Cambridge University], [http://www.cs.rhul.ac.uk/~chrisw/new_thesis.pdf pdf]
==1990 ...==
* [[Richard Sutton]], [[Andrew Barto]] ('''1990'''). ''Time Derivative Models of Pavlovian Reinforcement''. Learning and Computational Neuroscience: Foundations of Adaptive Networks: 497-537
* [[Chris Watkins]], [[Peter Dayan]] ('''1992'''). ''[http://www.gatsby.ucl.ac.uk/~dayan/papers/wd92.html Q-learning]''. [https://en.wikipedia.org/wiki/Machine_Learning_(journal) Machine Learning], Vol. 8, No. 2
* [[Gerald Tesauro]] ('''1992'''). ''Temporal Difference Learning of Backgammon Strategy''. [http://www.informatik.uni-trier.de/~ley/db/conf/icml/ml1992.html#Tesauro92 ML 1992]
* [[Justin A. Boyan]], [[Michael L. Littman]] ('''1993'''). ''Packet Routing in Dynamically Changing Networks: A Reinforcement Learning Approach''. [https://papers.nips.cc/book/advances-in-neural-information-processing-systems-6-1993 NIPS 1993], [https://www.cs.cmu.edu/~jab/cv/pubs/boyan.q-routing.pdf pdf]
* [[Michael L. Littman]] ('''1994'''). ''Markov Games as a Framework for Multi-Agent Reinforcement Learning''. International Conference on Machine Learning, [http://www.cs.duke.edu/courses/spring07/cps296.3/littman94markov.pdf pdf]
==1995 ...==
* [[Marco Wiering]] ('''1995'''). ''[https://scholar.google.com/citations?view_op=view_citation&hl=en&user=xVas0I8AAAAJ&cstart=20&citation_for_view=xVas0I8AAAAJ:roLk4NBRz8UC TD Learning of Game Evaluation Functions with Hierarchical Neural Architectures]''. Master's thesis, [https://en.wikipedia.org/wiki/University_of_Amsterdam University of Amsterdam], [http://webber.physik.uni-freiburg.de/~hon/vorlss02/Literatur/reinforcement/GameEvaluationWithNeuronal.pdf pdf]
* [[Gerald Tesauro]] ('''1995'''). ''Temporal Difference Learning and TD-Gammon''. [[ACM#Communications|Communications of the ACM]], Vol. 38, No. 3
* [http://dblp.uni-trier.de/pers/hd/b/Baird_III:Leemon_C= Leemon C. Baird III], [http://dblp.uni-trier.de/pers/hd/h/Harmon:Mance_E= Mance E. Harmon], [[A. Harry Klopf]] ('''1996'''). ''Reinforcement Learning: An Alternative Approach to Machine Intelligence''. [http://www.leemon.com/papers/1996bhk.pdf pdf]
* [[Mathematician#LPKaelbling|Leslie Pack Kaelbling]], [[Michael L. Littman]], [[Mathematician#AWMoore|Andrew W. Moore]] ('''1996'''). ''[http://www.cs.washington.edu/research/jair/volume4/kaelbling96a-html/rl-survey.html Reinforcement Learning: A Survey]''. [http://www.jair.org/vol/vol4.html JAIR Vol. 4], [http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume4/kaelbling96a.pdf pdf]
* [[Robert Levinson]] ('''1996'''). ''[http://onlinelibrary.wiley.com/doi/10.1111/j.1467-8640.1996.tb00257.x/abstract General Game-Playing and Reinforcement Learning]''. [http://dblp.uni-trier.de/db/journals/ci/ci12.html#PellEL96 Computational Intelligence, Vol. 12], No. 1
* [[Ronald Parr]], [[Stuart Russell]] ('''1997'''). ''Reinforcement Learning with Hierarchies of Machines.'' In Advances in Neural Information Processing Systems 10, MIT Press, [http://www.cs.berkeley.edu/~russell/papers/nips97-ham.ps.gz zipped ps]
* [[William Uther]], [[Manuela Veloso|Manuela M. Veloso]] ('''1997'''). ''Adversarial Reinforcement Learning''. [[Carnegie Mellon University]], [http://www.cse.unsw.edu.au/~willu/w/papers/Uther97a.ps ps]
* [[William Uther]], [[Manuela Veloso|Manuela M. Veloso]] ('''1997'''). ''Generalizing Adversarial Reinforcement Learning''. [[Carnegie Mellon University]], [http://www.cse.unsw.edu.au/~willu/w/papers/Uther97b.ps ps]
* [[Marco Wiering]], [[Jürgen Schmidhuber]] ('''1997'''). ''[https://scholar.google.com/citations?view_op=view_citation&hl=en&user=xVas0I8AAAAJ&citation_for_view=xVas0I8AAAAJ:u5HHmVD_uO8C HQ-learning]''. [https://en.wikipedia.org/wiki/Adaptive_Behavior_%28journal%29 Adaptive Behavior], Vol. 6, No 2
* [[Csaba Szepesvári]] ('''1998'''). ''Reinforcement Learning: Theory and Practice''. Proceedings of the 2nd Slovak Conference on Artificial Neural Networks, [http://www.sztaki.hu/%7Eszcsaba/papers/scann98.ps.gz zipped ps]
* [[Richard Sutton]], [[Andrew Barto]] ('''1998'''). ''[https://mitpress.mit.edu/books/reinforcement-learning Reinforcement Learning: An Introduction]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press]
* [http://www.ilsp.gr/homepages/papavasiliou_eng.html Vassilis Papavassiliou], [[Stuart Russell]] ('''1999'''). ''Convergence of reinforcement learning with general function approximators.'' In Proc. IJCAI-99, Stockholm, [http://www.cs.berkeley.edu/~russell/papers/ijcai99-bridge.ps ps]
* [[Marco Wiering]] ('''1999'''). ''[https://scholar.google.com/citations?view_op=view_citation&hl=en&user=xVas0I8AAAAJ&pagesize=100&citation_for_view=xVas0I8AAAAJ:9yKSN-GCB0IC Explorations in Efficient Reinforcement Learning]''. Ph.D. thesis, [https://en.wikipedia.org/wiki/University_of_Amsterdam University of Amsterdam], advisors [[Mathematician#FGroen|Frans Groen]] and [[Jürgen Schmidhuber]]
==2000 ...==
* [[Sebastian Thrun]], [[Michael L. Littman]] ('''2000'''). ''A Review of Reinforcement Learning''. [http://www.informatik.uni-trier.de/~ley/db/journals/aim/aim21.html#ThrunL00 AI Magazine, Vol. 21], No. 1
* [[Robert Levinson]], [[Ryan Weber]] ('''2000'''). ''[http://link.springer.com/chapter/10.1007/3-540-45579-5_9 Chess Neighborhoods, Function Combination, and Reinforcement Learning]''. [[CG 2000]]
* [[Andrew Ng]], [[Stuart Russell]] ('''2000'''). ''Algorithms for inverse reinforcement learning.'' In Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, California: Morgan Kaufmann, [http://www.cs.berkeley.edu/~russell/papers/ml00-irl.pdf pdf]
* [http://www.cs.ou.edu/~hougen/ Dean F. Hougen], [http://www-users.cs.umn.edu/~gini/ Maria Gini], [[James R. Slagle]] ('''2000'''). ''[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.23.2633 An Integrated Connectionist Approach to Reinforcement Learning for Robotic Control]''. ICML '00 Proceedings of the Seventeenth International Conference on Machine Learning
* [[Jonathan Baxter]], [[Peter Bartlett]] ('''2000'''). ''Reinforcement Learning on POMDPs via Direct Gradient Ascent''. [http://dblp.uni-trier.de/db/conf/icml/icml2000.html ICML 2000], [https://pdfs.semanticscholar.org/b874/98f0879d312c308889135203b17069aa0486.pdf pdf]
* [[Doina Precup]] ('''2000'''). ''Temporal Abstraction in Reinforcement Learning''. Ph.D. Dissertation, Department of Computer Science, [https://en.wikipedia.org/wiki/University_of_Massachusetts_Amherst University of Massachusetts], [https://en.wikipedia.org/wiki/Amherst,_Massachusetts Amherst].
* [[Robert Levinson]], [[Ryan Weber]] ('''2001'''). ''Chess Neighborhoods, Function Combinations and Reinforcements Learning''. In Computers and Games (eds. [[Tony Marsland]] and I. Frank). [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science],. Springer,. [http://users.soe.ucsc.edu/~levinson/Papers/CNFCRL.pdf pdf]
* [[Marco Block-Berlitz]] ('''2003'''). ''Reinforcement Learning in der Schachprogrammierung''. Studienarbeit, Freie Universität Berlin, Dozent: [[Raúl Rojas|Prof. Dr. Raúl Rojas]], [http://page.mi.fu-berlin.de/block/Skripte/Reinforcement.pdf pdf] (German)
* [[Henk Mannen]] ('''2003'''). ''Learning to play chess using reinforcement learning with database games''. Master’s thesis, [http://students.uu.nl/en/hum/cognitive-artificial-intelligence Cognitive Artiﬁcial Intelligence], [https://en.wikipedia.org/wiki/Utrecht_University Utrecht University]
* [[Joelle Pineau]], [[Geoffrey Gordon]], [[Sebastian Thrun]] ('''2003'''). ''Point-based value iteration: An anytime algorithm for POMDPs''. [[Conferences#IJCAI2003|IJCAI]], [http://www.fore.robot.cc/papers/Pineau03a.pdf pdf]
* [[Yngvi Björnsson]], Vignir Hafsteinsson, Ársæll Jóhannsson, Einar Jónsson ('''2004'''). ''Efficient Use of Reinforcement Learning in a Computer Game''. In Computer Games: Artificial Intellignece, Design and Education (CGAIDE'04), pp. 379–383, 2004. [http://www.ru.is/faculty/yngvi/pdf/BjornssonHJJ04.pdf pdf]
* [http://imranontech.com/ Imran Ghory] ('''2004'''). ''Reinforcement learning in board games''. CSTR-04-004, [http://www.cs.bris.ac.uk/ Department of Computer Science], [https://en.wikipedia.org/wiki/University_of_Bristol University of Bristol]. [http://www.cs.bris.ac.uk/Publications/Papers/2000100.pdf pdf] <ref>[http://www.cs.bris.ac.uk/Publications/pub_master.jsp?type=117 University of Bristol - Department of Computer Science - Technical Reports]</ref>
* [[Eric Wiewiora]] ('''2004'''). ''Efficient Exploration for Reinforcement Learning''. MSc thesis, [http://cseweb.ucsd.edu/%7Eewiewior/04efficient.pdf pdf]
* [[Albert Xin Jiang]] ('''2004'''). ''Multiagent Reinforcement Learning in Stochastic Games with Continuous Action Spaces''. [http://www.cs.ubc.ca/%7Ejiang/papers/continuous.pdf pdf]
==2005 ...==
* [[Sylvain Gelly]], [[Jérémie Mary]], [[Olivier Teytaud]] ('''2006'''). ''Learning for stochastic dynamic programming''. [http://www.lri.fr/%7Egelly/paper/lfordp.pdf pdf], [http://www.grappa.univ-lille3.fr/~mary/paper/lfordp.pdf pdf]
* [[Sylvain Gelly]] ('''2007'''). ''A Contribution to Reinforcement Learning; Application to Computer Go.'' Ph.D. thesis, [http://www.lri.fr/~gelly/paper/SylvainGellyThesis.pdf pdf]
* [http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/d/Duan:Yong.html Yong Duan], [http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/c/Cui:Baoxia.html Baoxia Cui], [[Xinhe Xu]] ('''2007'''). ''State Space Partition for Reinforcement Learning Based on Fuzzy Min-Max Neural Network''. [http://www.informatik.uni-trier.de/~ley/db/conf/isnn/isnn2007-2.html#DuanCX07 ISNN 2007]
* [[Yasuhiro Osaki]], [[Kazutomo Shibahara]], [[Yasuhiro Tajima]], [[Yoshiyuki Kotani]] ('''2007'''). ''Reinforcement Learning of Evaluation Functions Using Temporal Difference-Monte Carlo learning method''. [[Conferences#GPW|12th Game Programming Workshop]]
* [[Marco Block-Berlitz|Marco Block]], Maro Bader, [http://page.mi.fu-berlin.de/tapia/ Ernesto Tapia], Marte Ramírez, Ketill Gunnarsson, Erik Cuevas, Daniel Zaldivar, [[Raúl Rojas]] ('''2008'''). ''Using Reinforcement Learning in Chess Engines''. CONCIBE SCIENCE 2008, [http://www.micai.org/rcs/ Research in Computing Science]: Special Issue in Electronics and Biomedical Engineering, Computer Science and Informatics, ISSN:1870-4069, Vol. 35, pp. 31-40, [https://en.wikipedia.org/wiki/Guadalajara Guadalajara], Mexico, [http://page.mi.fu-berlin.de/block/concibe2008.pdf pdf]
* [[Cécile Germain-Renaud]], [[Julien Pérez]], [[Balázs Kégl]], [[Charles Loomis]] ('''2008'''). ''Grid Differentiated Services: a Reinforcement Learning Approach''. In 8th [[IEEE]] Symposium on Cluster Computing and the Grid. Lyon, [http://hal.inria.fr/docs/00/28/78/26/PDF/RLccg08.pdf pdf]
* [[David Silver]] ('''2009'''). ''Reinforcement Learning and Simulation-Based Search''. Ph.D. thesis, [[University of Alberta]]. [http://webdocs.cs.ualberta.ca/~silver/David_Silver/Publications_files/thesis.pdf pdf]
==2010 ...==
* [[Joel Veness]], [[Kee Siong Ng]], [[Marcus Hutter]], [[David Silver]] ('''2010'''). ''Reinforcement Learning via AIXI Approximation''. Association for the Advancement of Artificial Intelligence (AAAI), [http://jveness.info/publications/veness_rl_via_aixi_approx.pdf pdf]
* [[Julien Pérez]], [[Cécile Germain-Renaud]], [[Balázs Kégl]], [[Charles Loomis]] ('''2010'''). ''Multi-objective Reinforcement Learning for Responsive Grids''. In The Journal of Grid Computing. [http://hal.archives-ouvertes.fr/docs/00/49/15/60/PDF/RLGrid_JGC09_V7.pdf pdf]
'''2011'''
* [[Peter Auer]] ('''2011'''). ''Exploration and Exploitation in Online Learning''. [http://dblp.uni-trier.de/db/conf/icais/icais2011.html#Auer11 ICAIS 2011]
* [[Charles Elkan]] ('''2011'''). ''Reinforcement Learning with a Bilinear Q Function''. [http://www.informatik.uni-trier.de/~ley/db/conf/ewrl/ewrl2011.html#Elkan11 EWRL 2011]
'''2012'''
* [[Marco Wiering]], [http://martijnvanotterlo.nl/ Martijn Van Otterlo] ('''2012'''). ''[https://scholar.google.com/citations?view_op=view_citation&hl=en&user=xVas0I8AAAAJ&citation_for_view=xVas0I8AAAAJ:abG-DnoFyZgC Reinforcement learning: State-of-the-art]''. [http://link.springer.com/book/10.1007/978-3-642-27645-3 Adaptation, Learning, and Optimization, Vol. 12], [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]
: [[István Szita]] ('''2012'''). ''[http://link.springer.com/chapter/10.1007%2F978-3-642-27645-3_17 Reinforcement Learning in Games]''. Chapter 17
* [[Arthur Guez]], [[David Silver]], [[Peter Dayan]] ('''2012'''). ''Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search''. [http://papers.nips.cc/book/advances-in-neural-information-processing-systems-25-2012 NIPS 2012], [https://papers.nips.cc/paper/4767-efficient-bayes-adaptive-reinforcement-learning-using-sample-based-search.pdf pdf]
'''2013'''
* [[Arthur Guez]], [[David Silver]], [[Peter Dayan]] ('''2013'''). ''Scalable and Efficient Bayes-Adaptive Reinforcement Learning Based on Monte-Carlo Tree Search''. [https://en.wikipedia.org/wiki/Journal_of_Artificial_Intelligence_Research Journal of Artificial Intelligence Research], Vol. 48, [https://www.jair.org/media/4117/live-4117-7507-jair.pdf pdf]
* [http://dblp.uni-trier.de/pers/hd/r/Ree:M=_van_der Michiel van der Ree], [[Marco Wiering]] ('''2013'''). ''[https://scholar.google.com/citations?view_op=view_citation&hl=en&user=xVas0I8AAAAJ&cstart=60&pagesize=80&citation_for_view=xVas0I8AAAAJ:K3LRdlH-MEoC Reinforcement Learning in the Game of Othello: Learning Against a Fixed Opponent and Learning from Self-Play]''. [http://dblp.uni-trier.de/db/conf/adprl/adprl2013.html#ReeW13 ADPRL 2013]
* [http://dblp.uni-trier.de/pers/hd/b/Bom:Luuk Luuk Bom], [http://dblp.uni-trier.de/pers/hd/h/Henken:Ruud Ruud Henken], [[Marco Wiering]] ('''2013'''). ''[https://scholar.google.com/citations?view_op=view_citation&hl=en&user=xVas0I8AAAAJ&cstart=40&citation_for_view=xVas0I8AAAAJ:l7t_Zn2s7bgC Reinforcement Learning to Train Ms. Pac-Man Using Higher-order Action-relative Inputs]''. [http://dblp.uni-trier.de/db/conf/adprl/adprl2013.html#BomHW13 ADPRL 2013] <ref>[https://en.wikipedia.org/wiki/Ms._Pac-Man Ms. Pac-Man from Wikipedia]</ref>
* [[Peter Auer]], [[Marcus Hutter]], [[Laurent Orseau]] ('''2013'''). ''[http://drops.dagstuhl.de/opus/volltexte/2013/4340/ Reinforcement Learning]''. [http://dblp.uni-trier.de/db/journals/dagstuhl-reports/dagstuhl-reports3.html#AuerHO13 Dagstuhl Reports, Vol. 3, No. 8], DOI: [http://drops.dagstuhl.de/opus/volltexte/2013/4340/ 10.4230/DagRep.3.8.1], URN: [http://drops.dagstuhl.de/opus/volltexte/2013/4340/ urn:nbn:de:0030-drops-43409]
* [[Volodymyr Mnih]], [[Koray Kavukcuoglu]], [[David Silver]], [[Alex Graves]], [[Ioannis Antonoglou]], [[Daan Wierstra]], [[Martin Riedmiller]] ('''2013'''). ''Playing Atari with Deep Reinforcement Learning''. [http://arxiv.org/abs/1312.5602 arXiv:1312.5602] <ref>[http://www.nervanasys.com/demystifying-deep-reinforcement-learning/ Demystifying Deep Reinforcement Learning] by [http://www.nervanasys.com/author/tambet/ Tambet Matiisen], [http://www.nervanasys.com/ Nervana], December 22, 2015</ref> <ref>[http://www.google.com/patents/US20150100530 Patent US20150100530 - Methods and apparatus for reinforcement learning - Google Patents]</ref>
'''2014'''
* [[Marcin Szubert]] ('''2014'''). ''Coevolutionary Shaping for Reinforcement Learning''. Ph.D. thesis, [https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology Poznań University of Technology], supervisor [[Krzysztof Krawiec]], co-supervisor [[Wojciech Jaśkowski]], [http://www.cs.put.poznan.pl/mszubert/pub/phdthesis.pdf pdf]
==2015 ...==
* [[Volodymyr Mnih]], [[Koray Kavukcuoglu]], [[David Silver]], [[Andrei A. Rusu]], [[Joel Veness]], [[Marc G. Bellemare]], [[Alex Graves]], [[Martin Riedmiller]], [[Andreas K. Fidjeland]], [[Georg Ostrovski]], [[Stig Petersen]], [[Charles Beattie]], [[Amir Sadik]], [[Ioannis Antonoglou]], [[Helen King]], [[Dharshan Kumaran]], [[Daan Wierstra]], [[Shane Legg]], [[Demis Hassabis]] ('''2015'''). ''[http://www.nature.com/nature/journal/v518/n7540/abs/nature14236.html Human-level control through deep reinforcement learning]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 518
* [[Tobias Graf]], [[Marco Platzner]] ('''2015'''). ''Adaptive Playouts in Monte Carlo Tree Search with Policy Gradient Reinforcement Learning''. [[Advances in Computer Games 14]]
* [[Arun Nair]], [[Praveen Srinivasan]], [[Sam Blackwell]], [[Cagdas Alcicek]], [[Rory Fearon]], [[Alessandro De Maria]], [[Veda Panneershelvam]], [[Mustafa Suleyman]], [[Charles Beattie]], [[Stig Petersen]], [[Shane Legg]], [[Volodymyr Mnih]], [[Koray Kavukcuoglu]], [[David Silver]] ('''2015'''). ''Massively Parallel Methods for Deep Reinforcement Learning''. [http://arxiv.org/abs/1507.04296 arXiv:1507.04296]
* [[Matthew Lai]] ('''2015'''). ''Giraffe: Using Deep Reinforcement Learning to Play Chess''. M.Sc. thesis, [https://en.wikipedia.org/wiki/Imperial_College_London Imperial College London], [http://arxiv.org/abs/1509.01549v1 arXiv:1509.01549v1] » [[Giraffe]]
* [[Hado van Hasselt]], [[Arthur Guez]], [[David Silver]] ('''2015'''). ''Deep Reinforcement Learning with Double Q-learning''. [http://arxiv.org/abs/1509.06461 arXiv:1509.06461]
'''2016'''
* [[Ziyu Wang]], [[Nando de Freitas]], [[Marc Lanctot]] ('''2016'''). ''Dueling Network Architectures for Deep Reinforcement Learning''. [http://arxiv.org/abs/1511.06581 arXiv:1511.06581]
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]
* [[Hung Guei]], [[Tinghan Wei]], [[Jin-Bo Huang]], [[I-Chen Wu]] ('''2016'''). ''An Empirical Study on Applying Deep Reinforcement Learning to the Game 2048''. [[CG 2016]]
* [[Omid David|Omid E. David]], [[Nathan S. Netanyahu]], [[Lior Wolf]] ('''2016'''). ''[http://link.springer.com/chapter/10.1007%2F978-3-319-44781-0_11 DeepChess: End-to-End Deep Neural Network for Automatic Learning in Chess]''. [http://icann2016.org/ ICAAN 2016], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 9887, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer], [http://www.cs.tau.ac.il/~wolf/papers/deepchess.pdf pdf preprint] » [[DeepChess]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61748 DeepChess: Another deep-learning based chess program] by [[Matthew Lai]], [[CCC]], October 17, 2016</ref> <ref>[http://icann2016.org/index.php/conference-programme/recipients-of-the-best-paper-awards/ ICANN 2016 | Recipients of the best paper awards]</ref>
* [[Volodymyr Mnih]], [[Adrià Puigdomènech Badia]], [[Mehdi Mirza]], [[Alex Graves]], [[Timothy Lillicrap]], [[Tim Harley]], [[David Silver]], [[Koray Kavukcuoglu]] ('''2016'''). ''Asynchronous Methods for Deep Reinforcement Learning''. [https://arxiv.org/abs/1602.01783 arXiv:1602.01783v2]
* [[Shixiang Gu]], [[Ethan Holly]], [[Timothy Lillicrap]], [[Sergey Levine]] ('''2016'''). ''Deep Reinforcement Learning for Robotic Manipulation with Asynchronous Off-Policy Updates''. [https://arxiv.org/abs/1610.00633 arXiv:1610.00633]
* [[Max Jaderberg]], [[Volodymyr Mnih]], [[Wojciech Marian Czarnecki]], [[Tom Schaul]], [[Joel Z. Leibo]], [[David Silver]], [[Koray Kavukcuoglu]] ('''2016'''). ''Reinforcement Learning with Unsupervised Auxiliary Tasks''. [https://arxiv.org/abs/1611.05397v1 arXiv:1611.05397v1]
* [[Jane X Wang]], [[Zeb Kurth-Nelson]], [[Dhruva Tirumala]], [[Hubert Soyer]], [[Joel Z Leibo]], [[Rémi Munos]], [[Charles Blundell]], [[Dharshan Kumaran]], [[Matt Botvinick]] ('''2016'''). ''Learning to reinforcement learn''. [https://arxiv.org/abs/1611.05763 arXiv:1611.05763]
'''2017'''
* [[Hirotaka Kameko]], [[Jun Suzuki]], [[Naoki Mizukami]], [[Yoshimasa Tsuruoka]] ('''2017'''). ''Deep Reinforcement Learning with Hidden Layers on Future States''. [[Conferences#IJCA2017|Computer Games Workshop at IJCAI 2017]], [http://www.lamsade.dauphine.fr/~cazenave/cgw2017/Kameko.pdf pdf]
* [[Marc Lanctot]], [[Vinícius Flores Zambaldi]], [[Audrunas Gruslys]], [[Angeliki Lazaridou]], [[Karl Tuyls]], [[Julien Pérolat]], [[David Silver]], [[Thore Graepel]] ('''2017'''). ''A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning''. [https://arxiv.org/abs/1711.00832 arXiv:1711.00832]
* [[David Silver]], [[Julian Schrittwieser]], [[Karen Simonyan]], [[Ioannis Antonoglou]], [[Shih-Chieh Huang|Aja Huang]], [[Arthur Guez]], [[Thomas Hubert]], [[Lucas Baker]], [[Matthew Lai]], [[Adrian Bolton]], [[Yutian Chen]], [[Timothy Lillicrap]], [[Fan Hui]], [[Laurent Sifre]], [[George van den Driessche]], [[Thore Graepel]], [[Demis Hassabis]] ('''2017'''). ''[https://www.nature.com/nature/journal/v550/n7676/full/nature24270.html Mastering the game of Go without human knowledge]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 550, [https://www.gwern.net/docs/rl/2017-silver.pdf pdf] <ref>[https://deepmind.com/blog/alphago-zero-learning-scratch/ AlphaGo Zero: Learning from scratch] by [[Demis Hassabis]] and [[David Silver]], [[DeepMind]], October 18, 2017</ref>
* [http://www.peterhenderson.co/ Peter Henderson], [https://scholar.google.ca/citations?user=2_4Rs44AAAAJ&hl=en Riashat Islam], [[Philip Bachman]], [[Joelle Pineau]], [[Doina Precup]], [https://scholar.google.ca/citations?user=gFwEytkAAAAJ&hl=en David Meger] ('''2017'''). ''Deep Reinforcement Learning that Matters''. [https://arxiv.org/abs/1709.06560 arXiv:1709.06560]
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]

=Postings=
* [https://www.stmintz.com/ccc/index.php?id=28584 Parameter Tuning] by [[Jonathan Baxter]], [[CCC]], October 01, 1998 » [[KnightCap]]
* [https://www.stmintz.com/ccc/index.php?id=147377 any good experiences with genetic algos or temporal difference learning?] by [[Rafael B. Andrist]], [[CCC]], January 01, 2001
* [http://talkchess.com/forum/viewtopic.php?t=56913 *First release* Giraffe, a new engine based on deep learning] by [[Matthew Lai]], [[CCC]], July 08, 2015 » [[Deep Learning]], [[Giraffe]]
* [http://www.nervanasys.com/demystifying-deep-reinforcement-learning/ Demystifying Deep Reinforcement Learning] by [http://www.nervanasys.com/author/tambet/ Tambet Matiisen], [http://www.nervanasys.com/ Nervana], December 22, 2015
* [https://ai.intel.com/deep-reinforcement-learning-with-neon/ Deep Reinforcement Learning with Neon] by [http://www.nervanasys.com/author/tambet/ Tambet Matiisen], [http://www.nervanasys.com/ Nervana], December 29, 2015
* [http://www.talkchess.com/forum/viewtopic.php?t=65909 Google's AlphaGo team has been working on chess] by [[Peter Kappler]], [[CCC]], December 06, 2017 » [[AlphaZero]]
* [http://www.talkchess.com/forum/viewtopic.php?t=65990 Understanding the power of reinforcement learning] by [[Michael Sherwin]], [[CCC]], December 12, 2017

=External Links=
==Reinforcement Learning==
* [https://en.wikipedia.org/wiki/Reinforcement_learning Reinforcement learning from Wikipedia]
* [http://www.incompleteideas.net/sutton/book/ebook/the-book.html Reinforcement Learning: An Introduction] ebook by [[Richard Sutton]] and [[Andrew Barto]]
* [http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching_files/games.pdf Reinforcement Learning in Classic Board Games] (pdf) by [[David Silver]]
* [http://www.scholarpedia.org/article/Category:Reinforcement_Learning Category: Reinforcement Learning - Scholarpedia]
* [http://www.scholarpedia.org/article/Reinforcement_learning Reinforcement Learning - Scholarpedia]
* [http://spaces.facsci.ualberta.ca/rlai/ Reinforcement Learning and Artificial Intelligence – Faculty of Science], [[University of Alberta]]
==MDP==
* [https://en.wikipedia.org/wiki/Markov_decision_process Markov decision process from Wikipedia]
* [https://en.wikipedia.org/wiki/Partially_observable_Markov_decision_process Partially observable Markov decision process]
* [http://www.idsia.ch/%7Ejuergen/rl.html Reinforcement Learning and POMDPs] by [[Jürgen Schmidhuber]]
==Q-Learning==
* [https://en.wikipedia.org/wiki/Q-learning Q-learning from Wikipedia]
* [http://mnemstudio.org/path-finding-q-learning-tutorial.htm A Painless Q-Learning Tutorial]
* [https://en.wikipedia.org/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action State–action–reward–state–action from Wikipedia]
* [https://en.wikipedia.org/wiki/Probably_approximately_correct_learning Probably approximately correct learning from Wikipedia]
==Courses==
* <span id="MOOC"></span>[http://www0.cs.ucl.ac.uk/staff/d.silver/web/Teaching.html Reinforcement Learning Course] by [[David Silver]], [https://en.wikipedia.org/wiki/University_College_London University College London], 2015, [https://en.wikipedia.org/wiki/YouTube YouTube] Videos
# [https://youtu.be/2pWv7GOvuf0 Lecture 1: Introduction to Reinforcement Learning]
# [https://youtu.be/lfHX2hHRMVQ Lecture 2: Markov Decision Process]
# [https://youtu.be/Nd1-UUMVfz4 Lecture 3: Planning by Dynamic Programming]
# [https://youtu.be/PnHCvfgC_ZA Lecture 4: Model-Free Prediction]
# [https://youtu.be/0g4j2k_Ggc4 Lecture 5: Model Free Control]
# [https://youtu.be/UoPei5o4fps Lecture 6: Value Function Approximation]
# [https://youtu.be/KHZVXao4qXs Lecture 7: Policy Gradient Methods]
# [https://youtu.be/ItMutbeOHtc Lecture 8: Integrating Learning and Planning]
# [https://youtu.be/sGuiWX07sKw Lecture 9: Exploration and Exploitation]
# [https://youtu.be/kZ_AUmFcZtk Lecture 10: Classic Games]
* [http://videolectures.net/deeplearning2016_pineau_reinforcement_learning/ Introduction to Reinforcement Learning] by [[Joelle Pineau]], [[McGill University]], 2016, [https://en.wikipedia.org/wiki/YouTube YouTube] Video
: {{#evu:https://www.youtube.com/watch?v=O_1Z63EDMvQ|alignment=left|valignment=top}}

=References=
<references />

'''[[Learning|Up one Level]]'''

GerdIsenberg

Bureaucrats, Administrators

25,161

edits

Changes

Reinforcement Learning

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools