Difference between revisions of "Temporal Difference Learning"

From Chessprogramming wiki
Jump to: navigation, search
(7 intermediate revisions by the same user not shown)
Line 125: Line 125:
 
* [[Don Beal]] ('''2002'''). ''[https://www.researchgate.net/publication/221556841_TD_mu_A_Modificaiton_of_TD_lambda_That_Enables_a_Program_to_Learn_Weights_for_Good_Play_Even_if_It_Observes_Only_Bad_Play TD(µ): A Modification of TD(λ) That Enables a Program to Learn Weights for Good Play Even if It Observes Only Bad Play]''. [https://dblp.org/db/conf/jcis/jcis2002 JCIS 2002]
 
* [[Don Beal]] ('''2002'''). ''[https://www.researchgate.net/publication/221556841_TD_mu_A_Modificaiton_of_TD_lambda_That_Enables_a_Program_to_Learn_Weights_for_Good_Play_Even_if_It_Observes_Only_Bad_Play TD(µ): A Modification of TD(λ) That Enables a Program to Learn Weights for Good Play Even if It Observes Only Bad Play]''. [https://dblp.org/db/conf/jcis/jcis2002 JCIS 2002]
 
'''2003'''
 
'''2003'''
* [[Henk Mannen]] ('''2003'''). ''Learning to play chess using reinforcement learning with database games''. Master’s thesis, [http://students.uu.nl/en/hum/cognitive-artificial-intelligence Cognitive Artificial Intelligence], [https://en.wikipedia.org/wiki/Utrecht_University Utrecht University]
+
* [[Henk Mannen]] ('''2003'''). ''Learning to play chess using reinforcement learning with database games''. Master’s thesis, Cognitive Artificial Intelligence, [https://en.wikipedia.org/wiki/Utrecht_University Utrecht University], [https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.109.810&rep=rep1&type=pdf pdf]
'''2004'''
+
* [[Henk Mannen]], [[Marco Wiering]] ('''2004'''). ''[https://www.semanticscholar.org/paper/Learning-to-Play-Chess-using-TD(lambda)-learning-Mannen-Wiering/00a6f81c8ebe8408c147841f26ed27eb13fb07f3 Learning to play chess using TD(λ)-learning with database games]''. Cognitive Artificial Intelligence, [https://en.wikipedia.org/wiki/Utrecht_University Utrecht University], Benelearn’04, [https://www.ai.rug.nl/~mwiering/GROUP/ARTICLES/learning-chess.pdf pdf]
* [[Henk Mannen]], [[Marco Wiering]] ('''2004'''). ''Learning to play chess using TD(λ)-learning with database games''. [http://students.uu.nl/en/hum/cognitive-artificial-intelligence Cognitive Artificial Intelligence], [https://en.wikipedia.org/wiki/Utrecht_University Utrecht University], Benelearn’04
 
 
* [[Marco Block-Berlitz|Marco Block]] ('''2004'''). ''Verwendung von Temporale-Differenz-Methoden im Schachmotor FUSc#''. Diplomarbeit, Betreuer: [[Raúl Rojas]], [[Free University of Berlin]],  [http://page.mi.fu-berlin.de/block/Skripte/diplomarbeit.pdf pdf] (German)
 
* [[Marco Block-Berlitz|Marco Block]] ('''2004'''). ''Verwendung von Temporale-Differenz-Methoden im Schachmotor FUSc#''. Diplomarbeit, Betreuer: [[Raúl Rojas]], [[Free University of Berlin]],  [http://page.mi.fu-berlin.de/block/Skripte/diplomarbeit.pdf pdf] (German)
 
* [[Jacek Mańdziuk]], [[Daniel Osman]] ('''2004'''). ''Temporal Difference Approach to Playing Give-Away Checkers''. [http://www.informatik.uni-trier.de/~ley/db/conf/icaisc/icaisc2004.html#MandziukO04 ICAISC 2004], [http://www.mini.pw.edu.pl/~mandziuk/PRACE/ICAISC04-3.pdf pdf]
 
* [[Jacek Mańdziuk]], [[Daniel Osman]] ('''2004'''). ''Temporal Difference Approach to Playing Give-Away Checkers''. [http://www.informatik.uni-trier.de/~ley/db/conf/icaisc/icaisc2004.html#MandziukO04 ICAISC 2004], [http://www.mini.pw.edu.pl/~mandziuk/PRACE/ICAISC04-3.pdf pdf]
Line 134: Line 133:
 
* [[Thomas Philip Runarsson]], [[Simon Lucas]] ('''2005'''). ''Coevolution versus self-play temporal difference learning for acquiring position evaluation in small-board go''. [[IEEE#EC|IEEE Transactions on Evolutionary Computation]], Vol. 9, No. 6
 
* [[Thomas Philip Runarsson]], [[Simon Lucas]] ('''2005'''). ''Coevolution versus self-play temporal difference learning for acquiring position evaluation in small-board go''. [[IEEE#EC|IEEE Transactions on Evolutionary Computation]], Vol. 9, No. 6
 
'''2006'''
 
'''2006'''
* [[Simon Lucas]], [[Thomas Philip Runarsson]] ('''2006'''). ''[http://scholar.google.is/citations?view_op=view_citation&hl=en&user=4eWdc_sAAAAJ&citation_for_view=4eWdc_sAAAAJ:qjMakFHDy7sC Temporal Difference Learning versus Co-Evolution for Acquiring Othello Position Evaluation]''. [[IEEE#CIG|IEEE Symposium on Computational Intelligence and Games]]
+
* [[Simon Lucas]], [[Thomas Philip Runarsson]] ('''2006'''). ''[http://scholar.google.is/citations?view_op=view_citation&hl=en&user=4eWdc_sAAAAJ&citation_for_view=4eWdc_sAAAAJ:qjMakFHDy7sC Temporal Difference Learning versus Co-Evolution for Acquiring Othello Position Evaluation]''. [[IEEE#CIG|CIG 2006]]
 
'''2007'''
 
'''2007'''
 
* [[Edward P. Manning]] ('''2007'''). ''[http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4219046 Temporal Difference Learning of an Othello Evaluation Function for a Small Neural Network with Shared Weights]''. [[IEEE#CIG|IEEE Symposium on Computational Intelligence and AI in Games]]
 
* [[Edward P. Manning]] ('''2007'''). ''[http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4219046 Temporal Difference Learning of an Othello Evaluation Function for a Small Neural Network with Shared Weights]''. [[IEEE#CIG|IEEE Symposium on Computational Intelligence and AI in Games]]
Line 142: Line 141:
 
'''2008'''
 
'''2008'''
 
* [[Yasuhiro Osaki]], [[Kazutomo Shibahara]], [[Yasuhiro Tajima]], [[Yoshiyuki Kotani]] ('''2008'''). ''An Othello Evaluation Function Based on Temporal Difference Learning using Probability of Winning''. [http://www.csse.uwa.edu.au/cig08/Proceedings/toc.html CIG'08], [http://www.csse.uwa.edu.au/cig08/Proceedings/papers/8010.pdf pdf]
 
* [[Yasuhiro Osaki]], [[Kazutomo Shibahara]], [[Yasuhiro Tajima]], [[Yoshiyuki Kotani]] ('''2008'''). ''An Othello Evaluation Function Based on Temporal Difference Learning using Probability of Winning''. [http://www.csse.uwa.edu.au/cig08/Proceedings/toc.html CIG'08], [http://www.csse.uwa.edu.au/cig08/Proceedings/papers/8010.pdf pdf]
* [[Richard Sutton]], [[Csaba Szepesvári]], [[Hamid Reza Maei]] ('''2008'''). ''A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation''. [http://www.sztaki.hu/%7Eszcsaba/papers/gtdnips08.pdf pdf] (draft)
+
* [[Richard Sutton]], [[Csaba Szepesvári]], [[Hamid Reza Maei]] ('''2008'''). ''A Convergent O(n) Algorithm for Off-policy Temporal-difference Learning with Linear Function Approximation''. [https://dblp.uni-trier.de/db/conf/nips/nips2008.html#SuttonSM08 NIPS 2008], [https://proceedings.neurips.cc/paper/2008/file/e0c641195b27425bb056ac56f8953d24-Paper.pdf pdf]
 
* [[Sacha Droste]], [[Johannes Fürnkranz]] ('''2008'''). ''Learning of Piece Values for Chess Variants.'' Technical Report TUD–KE–2008-07, Knowledge Engineering Group, [[Darmstadt University of Technology|TU Darmstadt]], [http://www.ke.tu-darmstadt.de/publications/reports/tud-ke-2008-07.pdf pdf]
 
* [[Sacha Droste]], [[Johannes Fürnkranz]] ('''2008'''). ''Learning of Piece Values for Chess Variants.'' Technical Report TUD–KE–2008-07, Knowledge Engineering Group, [[Darmstadt University of Technology|TU Darmstadt]], [http://www.ke.tu-darmstadt.de/publications/reports/tud-ke-2008-07.pdf pdf]
 
* [[Sacha Droste]], [[Johannes Fürnkranz]] ('''2008'''). ''Learning the Piece Values for three Chess Variants''. [[ICGA Journal#31_4|ICGA Journal, Vol. 31, No. 4]]
 
* [[Sacha Droste]], [[Johannes Fürnkranz]] ('''2008'''). ''Learning the Piece Values for three Chess Variants''. [[ICGA Journal#31_4|ICGA Journal, Vol. 31, No. 4]]
Line 148: Line 147:
 
* [[Albrecht Fiebiger]] ('''2008'''). ''Einsatz von allgemeinen Evaluierungsheuristiken in Verbindung mit der Reinforcement-Learning-Strategie in der Schachprogrammierung''. [https://de.wikipedia.org/wiki/Besondere_Lernleistung Besondere Lernleistung] im [https://de.wikipedia.org/wiki/Fachgebiet Fachbereich] [https://de.wikipedia.org/wiki/Informatik Informatik], [https://en.wikipedia.org/wiki/Federal_School_of_Saxony%E2%80%93Saint_Afra Sächsischees Landesgymnasium Sankt Afra], Internal advisor: Ralf Böttcher, External advisors: [[Stefan Meyer-Kahlen]], [[Marco Block-Berlitz|Marco Block]], [http://page.mi.fu-berlin.de/block/abschlussarbeiten/Fiebiger_BeLL.pdf pdf] (German)
 
* [[Albrecht Fiebiger]] ('''2008'''). ''Einsatz von allgemeinen Evaluierungsheuristiken in Verbindung mit der Reinforcement-Learning-Strategie in der Schachprogrammierung''. [https://de.wikipedia.org/wiki/Besondere_Lernleistung Besondere Lernleistung] im [https://de.wikipedia.org/wiki/Fachgebiet Fachbereich] [https://de.wikipedia.org/wiki/Informatik Informatik], [https://en.wikipedia.org/wiki/Federal_School_of_Saxony%E2%80%93Saint_Afra Sächsischees Landesgymnasium Sankt Afra], Internal advisor: Ralf Böttcher, External advisors: [[Stefan Meyer-Kahlen]], [[Marco Block-Berlitz|Marco Block]], [http://page.mi.fu-berlin.de/block/abschlussarbeiten/Fiebiger_BeLL.pdf pdf] (German)
 
'''2009'''
 
'''2009'''
* [[Hamid Reza Maei]], [[Csaba Szepesvári]], [[Shalabh Bhatnagar]], [[Doina Precup]], [[David Silver]], [[Richard Sutton]] ('''2009'''). ''Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation.'' Accepted  in Advances in Neural Information Processing Systems 22, Vancouver, BC. December 2009. MIT Press. [http://books.nips.cc/papers/files/nips22/NIPS2009_1121.pdf pdf]
+
* [[Hamid Reza Maei]], [[Csaba Szepesvári]], [[Shalabh Bhatnagar]], [[Doina Precup]], [[David Silver]], [[Richard Sutton]] ('''2009'''). ''Convergent Temporal-Difference Learning with Arbitrary Smooth Function Approximation''. [https://dblp.uni-trier.de/db/conf/nips/nips2009.html#MaeiSBPSS09 NIPS 2009], [https://papers.nips.cc/paper/2009/file/3a15c7d0bbe60300a39f76f8a5ba6896-Paper.pdf pdf]
* [[Richard Sutton]], [[Hamid Reza Maei]], [[Doina Precup]], [[Shalabh Bhatnagar]], [[David Silver]], [[Csaba Szepesvári]], [[Eric Wiewiora]]. ('''2009'''). ''Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation''. In Proceedings of the 26th International Conference on Machine Learning (ICML-09). [http://www.sztaki.hu/~szcsaba/papers/GTD-ICML09.pdf pdf]
+
* [[Joel Veness]], [[David Silver]], [[William Uther]], [[Alan Blair]] ('''2009'''). ''[http://papers.nips.cc/paper/3722-bootstrapping-from-game-tree-search Bootstrapping from Game Tree Search]''. NIPS 2009, [http://jveness.info/publications/nips2009%20-%20bootstrapping%20from%20game%20tree%20search.pdf pdf]
* [[Joel Veness]], [[David Silver]], [[William Uther]], [[Alan Blair]] ('''2009'''). ''[http://papers.nips.cc/paper/3722-bootstrapping-from-game-tree-search Bootstrapping from Game Tree Search]''. [http://jveness.info/publications/nips2009%20-%20bootstrapping%20from%20game%20tree%20search.pdf pdf]
+
* [[Richard Sutton]], [[Hamid Reza Maei]], [[Doina Precup]], [[Shalabh Bhatnagar]], [[David Silver]], [[Csaba Szepesvári]], [[Eric Wiewiora]]. ('''2009'''). ''[https://dl.acm.org/doi/10.1145/1553374.1553501 Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation]''. [https://dblp.uni-trier.de/db/conf/icml/icml2009.html#SuttonMPBSSW09 ICML 2009]
* [[Marcin Szubert]], [[Wojciech Jaśkowski]], [[Krzysztof Krawiec]] ('''2009'''). ''Coevolutionary Temporal Difference Learning for Othello''. [[IEEE#CIG|IEEE Symposium on Computational Intelligence and Games]], [http://www.cs.put.poznan.pl/wjaskowski/pub/papers/szubert09coevolutionary.pdf pdf]
+
* [[Simon Lucas]] ('''2009'''). ''[https://ieeexplore.ieee.org/document/5286496 Temporal difference learning with interpolated table value functions]''. [[IEEE#CIG|CIG 2009]]
 +
* [[Marcin Szubert]], [[Wojciech Jaśkowski]], [[Krzysztof Krawiec]] ('''2009'''). ''Coevolutionary Temporal Difference Learning for Othello''. [[IEEE#CIG|GIG 2009]], [http://www.cs.put.poznan.pl/wjaskowski/pub/papers/szubert09coevolutionary.pdf pdf]
 +
* [[Marcin Szubert]] ('''2009'''). ''Coevolutionary Reinforcement Learning and its Application to Othello''. M.Sc. thesis, [https://en.wikipedia.org/wiki/Pozna%C5%84_University_of_Technology Poznań University of Technology], supervisor [[Krzysztof Krawiec]], [https://mszubert.github.io/papers/Szubert_2009_MSC.pdf pdf]
 
* [http://www.cs.cmu.edu/~zkolter/ J. Zico Kolter], [[Andrew Ng]] ('''2009'''). ''Regularization and Feature Selection in Least-Squares Temporal Difference Learning''. [http://www.machinelearning.org/archive/icml2009/ ICML 2009], [http://www.cs.cmu.edu/~zkolter/pubs/kolter-icml09b-full.pdf pdf]
 
* [http://www.cs.cmu.edu/~zkolter/ J. Zico Kolter], [[Andrew Ng]] ('''2009'''). ''Regularization and Feature Selection in Least-Squares Temporal Difference Learning''. [http://www.machinelearning.org/archive/icml2009/ ICML 2009], [http://www.cs.cmu.edu/~zkolter/pubs/kolter-icml09b-full.pdf pdf]
 
==2010 ...==
 
==2010 ...==
 
* [[Marco Wiering]] ('''2010'''). ''Self-play and using an expert to learn to play backgammon with temporal difference learning''. [http://www.scirp.org/journal/jilsa/ Journal of Intelligent Learning Systems and Applications], Vol. 2, No. 2
 
* [[Marco Wiering]] ('''2010'''). ''Self-play and using an expert to learn to play backgammon with temporal difference learning''. [http://www.scirp.org/journal/jilsa/ Journal of Intelligent Learning Systems and Applications], Vol. 2, No. 2
* [[Hamid Reza Maei]], [[Richard Sutton]] ('''2010'''). ''[http://www.incompleteideas.net/sutton/publications.html#GQ GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces]''. In Proceedings of the Third Conference on Artificial General Intelligence
+
* [[Hamid Reza Maei]], [[Richard Sutton]] ('''2010'''). ''[https://www.researchgate.net/publication/215990384_GQlambda_A_general_gradient_algorithm_for_temporal-difference_prediction_learning_with_eligibility_traces GQ(λ): A general gradient algorithm for temporal-difference prediction learning with eligibility traces]''. [https://agi-conf.org/2010/ AGI 2010]
'''2011'''
+
* [[Hamid Reza Maei]] ('''2011'''). ''[https://era.library.ualberta.ca/items/fd55edcb-ce47-4f84-84e2-be281d27b16a Gradient Temporal-Difference Learning Algorithms]''. Ph.D. thesis, [[University of Alberta]], advisor [[Richard Sutton]]
* [[Hamid Reza Maei]] ('''2011'''). ''Gradient Temporal-Difference Learning Algorithms''. Ph.D. thesis, [[University of Alberta]], advisor [[Richard Sutton]], [http://webdocs.cs.ualberta.ca/~sutton/papers/maei-thesis-2011.pdf pdf]  
 
 
* [[Joel Veness]] ('''2011'''). ''Approximate Universal Artificial Intelligence and Self-Play Learning for Games''. Ph.D. thesis, [https://en.wikipedia.org/wiki/University_of_New_South_Wales University of New South Wales], supervisors: [[Kee Siong Ng]], [[Marcus Hutter]], [[Alan Blair]], [[William Uther]], [[John Lloyd]]; [http://jveness.info/publications/veness_phd_thesis_final.pdf pdf]
 
* [[Joel Veness]] ('''2011'''). ''Approximate Universal Artificial Intelligence and Self-Play Learning for Games''. Ph.D. thesis, [https://en.wikipedia.org/wiki/University_of_New_South_Wales University of New South Wales], supervisors: [[Kee Siong Ng]], [[Marcus Hutter]], [[Alan Blair]], [[William Uther]], [[John Lloyd]]; [http://jveness.info/publications/veness_phd_thesis_final.pdf pdf]
 
* [[I-Chen Wu]], [[Hsin-Ti Tsai]], [[Hung-Hsuan Lin]], [[Yi-Shan Lin]], [[Chieh-Min Chang]], [[Ping-Hung Lin]] ('''2011'''). ''[https://www.conftool.net/acg13/index.php?page=browseSessions&form_session=5 Temporal Difference Learning for Connect6]''. [[Advances in Computer Games 13]]
 
* [[I-Chen Wu]], [[Hsin-Ti Tsai]], [[Hung-Hsuan Lin]], [[Yi-Shan Lin]], [[Chieh-Min Chang]], [[Ping-Hung Lin]] ('''2011'''). ''[https://www.conftool.net/acg13/index.php?page=browseSessions&form_session=5 Temporal Difference Learning for Connect6]''. [[Advances in Computer Games 13]]
 
* [[Nikolaos Papahristou]], [[Ioannis Refanidis]] ('''2011'''). ''[https://www.conftool.net/acg13/index.php?page=browseSessions&form_session=5 Improving Temporal Difference Performance in Backgammon Variants]''. [[Advances in Computer Games 13]], [http://ai.uom.gr/nikpapa/publications/Improving%20Temporal%20Difference%20Learning%20in%20Backgammon%20Variants_ACG13.pdf pdf]
 
* [[Nikolaos Papahristou]], [[Ioannis Refanidis]] ('''2011'''). ''[https://www.conftool.net/acg13/index.php?page=browseSessions&form_session=5 Improving Temporal Difference Performance in Backgammon Variants]''. [[Advances in Computer Games 13]], [http://ai.uom.gr/nikpapa/publications/Improving%20Temporal%20Difference%20Learning%20in%20Backgammon%20Variants_ACG13.pdf pdf]
 
* [[Krzysztof Krawiec]], [[Wojciech Jaśkowski]], [[Marcin Szubert]] ('''2011'''). ''[http://www.degruyter.com/view/j/amcs.2011.21.issue-4/v10006-011-0057-3/v10006-011-0057-3.xml Evolving small-board Go players using Coevolutionary Temporal Difference Learning with Archives]''. [http://www.degruyter.com/view/j/amcs Applied Mathematics and Computer Science], Vol. 21, No. 4
 
* [[Krzysztof Krawiec]], [[Wojciech Jaśkowski]], [[Marcin Szubert]] ('''2011'''). ''[http://www.degruyter.com/view/j/amcs.2011.21.issue-4/v10006-011-0057-3/v10006-011-0057-3.xml Evolving small-board Go players using Coevolutionary Temporal Difference Learning with Archives]''. [http://www.degruyter.com/view/j/amcs Applied Mathematics and Computer Science], Vol. 21, No. 4
* [[Marcin Szubert]], [[Wojciech Jaśkowski]], [[Krzysztof Krawiec]] ('''2011'''). ''Learning Board Evaluation Function for Othello by Hybridizing Coevolution with Temporal Difference Learning''. [http://control.ibspan.waw.pl:3000/mainpage Control and Cybernetics], Vol. 40, No. 3,[http://www.cs.put.poznan.pl/wjaskowski/pub/papers/szubert2011learning.pdf pdf]
+
* [[Marcin Szubert]], [[Wojciech Jaśkowski]], [[Krzysztof Krawiec]] ('''2011'''). ''Learning Board Evaluation Function for Othello by Hybridizing Coevolution with Temporal Difference Learning''. [http://control.ibspan.waw.pl:3000/mainpage Control and Cybernetics], Vol. 40, No. 3, [http://www.cs.put.poznan.pl/wjaskowski/pub/papers/szubert2011learning.pdf pdf]
'''2012'''
 
 
* [[István Szita]] ('''2012'''). ''[http://link.springer.com/chapter/10.1007%2F978-3-642-27645-3_17 Reinforcement Learning in Games]''. in [[Marco Wiering]], [http://martijnvanotterlo.nl/ Martijn Van Otterlo] (eds.). ''Reinforcement learning: State-of-the-art''. [http://link.springer.com/book/10.1007/978-3-642-27645-3 Adaptation, Learning, and Optimization, Vol. 12], [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]
 
* [[István Szita]] ('''2012'''). ''[http://link.springer.com/chapter/10.1007%2F978-3-642-27645-3_17 Reinforcement Learning in Games]''. in [[Marco Wiering]], [http://martijnvanotterlo.nl/ Martijn Van Otterlo] (eds.). ''Reinforcement learning: State-of-the-art''. [http://link.springer.com/book/10.1007/978-3-642-27645-3 Adaptation, Learning, and Optimization, Vol. 12], [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]
'''2013'''
 
 
* [[David Silver]], [[Richard Sutton]], [[Martin Müller|Martin Mueller]] ('''2013'''). ''Temporal-Difference Search in Computer Go''. Proceedings of the [http://icaps13.icaps-conference.org/technical-program/workshop-program/planning-and-learning/ ICAPS-13 Workshop on Planning and Learning], [http://webdocs.cs.ualberta.ca/~sutton/papers/SSM-ICAPS-13.pdf pdf]
 
* [[David Silver]], [[Richard Sutton]], [[Martin Müller|Martin Mueller]] ('''2013'''). ''Temporal-Difference Search in Computer Go''. Proceedings of the [http://icaps13.icaps-conference.org/technical-program/workshop-program/planning-and-learning/ ICAPS-13 Workshop on Planning and Learning], [http://webdocs.cs.ualberta.ca/~sutton/papers/SSM-ICAPS-13.pdf pdf]
 
* [[Florian Kunz]] ('''2013'''). ''An Introduction to Temporal Difference Learning''. Seminar on Autonomous Learning Systems, [[Darmstadt University of Technology|TU Darmstad]], [http://www.ausy.informatik.tu-darmstadt.de/uploads/Teaching/AutonomousLearningSystems/Kunz_ALS_2013.pdf pdf]
 
* [[Florian Kunz]] ('''2013'''). ''An Introduction to Temporal Difference Learning''. Seminar on Autonomous Learning Systems, [[Darmstadt University of Technology|TU Darmstad]], [http://www.ausy.informatik.tu-darmstadt.de/uploads/Teaching/AutonomousLearningSystems/Kunz_ALS_2013.pdf pdf]
'''2014'''
 
 
* [[I-Chen Wu]], [[Kun-Hao Yeh]], [[Chao-Chin Liang]], [[Chia-Chuan Chang]], [[Han Chiang]] ('''2014'''). ''Multi-Stage Temporal Difference Learning for 2048''. [[TAAI 2014]]
 
* [[I-Chen Wu]], [[Kun-Hao Yeh]], [[Chao-Chin Liang]], [[Chia-Chuan Chang]], [[Han Chiang]] ('''2014'''). ''Multi-Stage Temporal Difference Learning for 2048''. [[TAAI 2014]]
 
* [[Wojciech Jaśkowski]], [[Marcin Szubert]], [[Paweł Liskowski]] ('''2014'''). ''Multi-Criteria Comparison of Coevolution and Temporal Difference Learning on Othello''. [http://www.evostar.org/2014/ EvoApplications 2014], [http://www.springer.com/computer/theoretical+computer+science/book/978-3-662-45522-7 Springer, volume 8602]
 
* [[Wojciech Jaśkowski]], [[Marcin Szubert]], [[Paweł Liskowski]] ('''2014'''). ''Multi-Criteria Comparison of Coevolution and Temporal Difference Learning on Othello''. [http://www.evostar.org/2014/ EvoApplications 2014], [http://www.springer.com/computer/theoretical+computer+science/book/978-3-662-45522-7 Springer, volume 8602]
Line 175: Line 172:
 
* [[Matthew Lai]] ('''2015'''). ''Giraffe: Using Deep Reinforcement Learning to Play Chess''. M.Sc. thesis, [https://en.wikipedia.org/wiki/Imperial_College_London Imperial College London],  [http://arxiv.org/abs/1509.01549v1 arXiv:1509.01549v1] » [[Giraffe]]
 
* [[Matthew Lai]] ('''2015'''). ''Giraffe: Using Deep Reinforcement Learning to Play Chess''. M.Sc. thesis, [https://en.wikipedia.org/wiki/Imperial_College_London Imperial College London],  [http://arxiv.org/abs/1509.01549v1 arXiv:1509.01549v1] » [[Giraffe]]
 
* [[Markus Thill]] ('''2015'''). ''Temporal Difference Learning Methods with Automatic Step-Size Adaption for Strategic Board Games: Connect-4 and Dots-and-Boxes''. Master thesis, [https://en.wikipedia.org/wiki/Technical_University_of_Cologne Technical University of Cologne], Campus Gummersbach, [http://www.gm.fh-koeln.de/~konen/research/PaperPDF/MT-Thill2015-final.pdf pdf]
 
* [[Markus Thill]] ('''2015'''). ''Temporal Difference Learning Methods with Automatic Step-Size Adaption for Strategic Board Games: Connect-4 and Dots-and-Boxes''. Master thesis, [https://en.wikipedia.org/wiki/Technical_University_of_Cologne Technical University of Cologne], Campus Gummersbach, [http://www.gm.fh-koeln.de/~konen/research/PaperPDF/MT-Thill2015-final.pdf pdf]
'''2016'''
 
 
* [[Kazuto Oka]], [[Kiminori Matsuzaki]] ('''2016'''). ''Systematic Selection of N-tuple Networks for 2048''. [[CG 2016]]
 
* [[Kazuto Oka]], [[Kiminori Matsuzaki]] ('''2016'''). ''Systematic Selection of N-tuple Networks for 2048''. [[CG 2016]]
 
* [[Huizhen Yu]], [[A. Rupam Mahmood]], [[Richard Sutton]] ('''2017'''). ''On Generalized Bellman Equations and Temporal-Difference Learning''. Canadian Conference on AI 2017, [https://arxiv.org/abs/1704.04463 arXiv:1704.04463]
 
* [[Huizhen Yu]], [[A. Rupam Mahmood]], [[Richard Sutton]] ('''2017'''). ''On Generalized Bellman Equations and Temporal-Difference Learning''. Canadian Conference on AI 2017, [https://arxiv.org/abs/1704.04463 arXiv:1704.04463]
'''2017'''
 
 
* [[William Uther]] ('''2017'''). ''[https://link.springer.com/referenceworkentry/10.1007/978-1-4899-7687-1_817 Temporal Difference Learning]''. in [https://en.wikipedia.org/wiki/Claude_Sammut Claude Sammut], [https://en.wikipedia.org/wiki/Geoff_Webb Geoffrey I. Webb] (eds) ('''2017'''). ''[https://link.springer.com/referencework/10.1007%2F978-1-4899-7687-1 Encyclopedia of Machine Learning and Data Mining]''. [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]
 
* [[William Uther]] ('''2017'''). ''[https://link.springer.com/referenceworkentry/10.1007/978-1-4899-7687-1_817 Temporal Difference Learning]''. in [https://en.wikipedia.org/wiki/Claude_Sammut Claude Sammut], [https://en.wikipedia.org/wiki/Geoff_Webb Geoffrey I. Webb] (eds) ('''2017'''). ''[https://link.springer.com/referencework/10.1007%2F978-1-4899-7687-1 Encyclopedia of Machine Learning and Data Mining]''. [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]
 +
==2020 ...==
 +
* [https://scholar.google.ca/citations?user=yVtSOt8AAAAJ&hl=en Emmanuel Bengio], [[Joelle Pineau]], [[Doina Precup]] ('''2020'''). ''Interference and Generalization in Temporal Difference Learning''. [https://arxiv.org/abs/2003.06350 arXiv:2003.06350]
 +
* [https://scholar.google.ca/citations?user=4C5wrXIAAAAJ&hl=en Joshua Romoff], [https://scholar.google.com/citations?user=dy_JBs0AAAAJ&hl=en Peter Henderson], [https://scholar.google.com/citations?user=HUmLDxcAAAAJ&hl=en David Kanaa], [https://scholar.google.ca/citations?user=yVtSOt8AAAAJ&hl=en Emmanuel Bengio], [https://scholar.google.com/citations?user=D4LT5xAAAAAJ&hl=en Ahmed Touati], [https://scholar.google.ca/citations?user=9H77FYYAAAAJ&hl=en Pierre-Luc Bacon], [[Joelle Pineau]] ('''2020'''). ''TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?'' [https://arxiv.org/abs/2007.02786 arXiv:2007.02786]
  
 
=Forum Posts=
 
=Forum Posts=
Line 205: Line 203:
 
* [http://www.talkchess.com/forum/viewtopic.php?t=57860 td-leaf] by [[Alexandru Mosoi]], [[CCC]], October 06, 2015 » [[Automated Tuning]]
 
* [http://www.talkchess.com/forum/viewtopic.php?t=57860 td-leaf] by [[Alexandru Mosoi]], [[CCC]], October 06, 2015 » [[Automated Tuning]]
 
* [http://www.talkchess.com/forum/viewtopic.php?t=62053 TD-leaf(lambda)] by [[Robert Pope]], [[CCC]], November 09, 2016
 
* [http://www.talkchess.com/forum/viewtopic.php?t=62053 TD-leaf(lambda)] by [[Robert Pope]], [[CCC]], November 09, 2016
 +
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=695 TD(1)] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 20, 2019 » [[Automated Tuning]]
 +
==2020 ...==
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=72810 Board adaptive / tuning evaluation function - no NN/AI] by Moritz Gedig, [[CCC]], January 14, 2020
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=77053 TD learning by self play (TD-Gammon)] by [[Chris Whittington]], [[CCC]], April 10, 2021
  
 
=External Links=
 
=External Links=

Revision as of 08:53, 15 April 2021

Home * Learning * Temporal Difference Learning

Temporal Difference Learning, (TD learning)
is a machine learning method applied to multi-step prediction problems. As a prediction method primarily used for reinforcement learning, TD learning takes into account the fact that subsequent predictions are often correlated in some sense, while in supervised learning, one learns only from actually observed values. TD resembles Monte Carlo methods with dynamic programming techniques [1]. In the domain of computer games and computer chess, TD learning is applied through self play, subsequently predicting the probability of winning a game during the sequence of moves from the initial position until the end, to adjust weights for a more reliable prediction.

Prediction

Each prediction is a single number, derived from a formula using adjustable weights of features, for instance a neural network most simply a single neuron perceptron, that is a linear evaluation function ...

TDLForula1.jpg
Sigmoid and Derivative

... with the pawn advantage converted to a winning probability by the standard sigmoid squashing function, also topic in logistic regression in the domain of supervised learning and automated tuning, ...

TDLForula2.jpg

... which has the advantage of its simple derivative:

TDLForula3.jpg

TD(λ)

Each pair of temporally successive predictions P at time step t and t+1 gives rise to a recommendation for weight changes, to converge Pt to Pt+1, first applied in the late 50s by Arthur Samuel in his Checkers player for automamated evaluation tuning [2]. This TD method was improved, generalized and formalized by Richard Sutton et al. in the 80s, the term Temporal Difference Learning coined in 1988 [3], also introducing the decay or recency parameter λ, where proportions of the score came from the outcome of Monte Carlo simulated games, tapering between bootstrapping (λ = 0) and Monte Carlo predictions (λ = 1), the latter equivalent to gradient descent on the mean squared error function. Weight adjustments in TD(λ) are made according to ...

TDLForula4.jpg

... where P is the series of temporally successive predictions, w the set of adjustable weights. α is a parameter controlling the learning rate, also called step-size, ∇wPk [4] is the gradient, the vector of partial derivatives of Pt with respect of w. The process may be applied to any initial set of weights. Learning performance depends on λ and α, which have to be chosen appropriately for the domain. In principle, TD(λ) weight adjustments may be made after each move, or at any arbitrary interval. For game playing tasks the end of every game is a convenient point to actually alter the evaluation weights [5].

TD(λ) was famously applied by Gerald Tesauro in his Backgammon program TD-Gammon [6] [7], a stochastic game picking the action whose successor state minimizes the opponent's expected reward, i.e. looking one ply ahead.

TDLeaf(λ)

In games like chess or Othello, due to their tactical nature, deep searches are necessary for expert performance. The problem has already been recognized and solved by Arthur Samuel but seemed to have been forgotten later on [8] - rediscovered independently by Don Beal and Martin C. Smith in 1997 [9], and by Jonathan Baxter, Andrew Tridgell, and Lex Weaver [10], who coined the term TD-Leaf. TD-Leaf is the adaption of TD(λ) to minimax search, where instead of the corresponding positions of the root the leaf nodes of the principal variation are considered in the weight adjustments. TD-Leaf was successfully used in evaluation tuning of chess programs [11], with KnightCap [12] and CilkChess as most prominent samples, while the latter used the improved Temporal Coherence Learning [13], which automatically adjusts α and λ [14].

Quotes

Don Beal

Don Beal in a 1998 CCC discussion with Jonathan Baxter [15]:

With fixed learning rates (aka step size) we found piece values settle to consistent relative ordering in around 500 self-play games. The ordering remains in place despite considerable erratic movements. But piece-square values can take a lot longer - more like 5000.
The learning rate is critical - it has to be as large as one dares for fast learning, but low for stable values.  We've been experimenting with methods for automatically adjusting the learning rate. (Higher rates if the adjustments go in the same direction, lower if they keep changing direction.)
The other problem is learning weights for terms which only occur rarely. Then the learning process doesn't see enough examples to settle on good weights in a reasonable time. I suspect this is the main limitation of the method, but it may be possible to devise ways to generate extra games which exercise the rare conditions. 

Bas Hamstra

Bas Hamstra in a 2002 CCC discussion on TD learning [16]:

I have played with it. I am convinced it has possibilities, but one problem I encountered was the cause-effect problem. For say I am a piece down. After I lost the game TD will conclude that the winner had better mobility and will tune it up. However worse mobility was not the cause of the loss, it was the effect of simply being a piece down. In my case it kept tuning mobility up and up until ridiculous values. 

Don Dailey

Don Dailey in a reply [17] to Ben-Hur Carlos Vieira Langoni Junior, CCC, December 2010 [18] :

Another approach that may be more in line with what you want is called "temporal difference learning", and it is based on feedback from each move to the move that precedes it. For example if you play move 35 and the program thinks the position is equal, but then on move 36 you find that you are winning a pawn, it indicates that the evaluation of move 35 is in error, the position is better than the program thought it was. Little tiny incremental adjustments are made to the evaluation function so that it is ever so slightly biased in favor of being slightly more positive in this case, or slightly more negative in the case where you find your score is dropping. This is done recursively back through the moves of the game so that winning the game gives some credit to all the positions of the game. Look on the web and read up on the "credit assignment problem" and temporal difference learning. It's probably ideal for what you are looking for. It can be done at the end of the game one time and scores then updated. If you are not using floating point evaluation you may have to figure out how to modify this to be workable. 

Chess Programs

RootStrap
TreeStrap

See also

Publications

1959

1970 ...

1980 ...

1990 ...

1995 ...

1996

1997

1998

1999

2000 ...

2001

2002

2003

2005 ...

2006

2007

2008

2009

2010 ...

2015 ...

2020 ...

Forum Posts

1995 ...

Re: Parameter Tuning by Don Beal, CCC, October 02, 1998

2000 ...

Re: Temporal Differences by Guy Haworth, CCC, November 04, 2004 [21]

2010 ...

Re: Positional learning by Don Dailey, CCC, December 13, 2010
Re: Pawn Advantage, Win Percentage, and Elo by Don Dailey, CCC, April 15, 2012

2015 ...

2020 ...

External Links

temporal - Wiktionary

References

  1. Temporal difference learning from Wikipedia
  2. Arthur Samuel (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal July 1959
  3. Richard Sutton (1988). Learning to Predict by the Methods of Temporal Differences. Machine Learning, Vol. 3, No. 1, pdf
  4. Nabla symbol from Wikipedia
  5. Don Beal, Martin C. Smith (1998). First Results from Using Temporal Difference Learning in Shogi. CG 1998
  6. Gerald Tesauro (1992). Temporal Difference Learning of Backgammon Strategy. ML 1992
  7. Gerald Tesauro (1994). TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level Play. Neural Computation Vol. 6, No. 2
  8. Sacha Droste, Johannes Fürnkranz (2008). Learning of Piece Values for Chess Variants. Technical Report TUD–KE–2008-07, Knowledge Engineering Group, TU Darmstadt, pdf
  9. Don Beal, Martin C. Smith (1997). Learning Piece Values Using Temporal Differences. ICCA Journal, Vol. 20, No. 3
  10. Jonathan Baxter, Andrew Tridgell, Lex Weaver (1997) Knightcap: A chess program that learns by combining td(λ) with minimax search. 15th International Conference on Machine Learning, pdf via citeseerX
  11. Don Beal, Martin C. Smith (1999). Learning Piece-Square Values using Temporal Differences. ICCA Journal, Vol. 22, No. 4
  12. Jonathan Baxter, Andrew Tridgell, Lex Weaver (1998). Knightcap: A chess program that learns by combining td(λ) with game-tree search. Proceedings of the 15th International Conference on Machine Learning, pdf via citeseerX
  13. The Cilkchess Parallel Chess Program
  14. Don Beal, Martin C. Smith (1999). Temporal Coherence and Prediction Decay in TD Learning. IJCAI 1999, pdf
  15. Re: Parameter Tuning by Don Beal, CCC, October 02, 1998
  16. Re: Hello from Edmonton (and on Temporal Differences) by Bas Hamstra, CCC, August 05, 2002
  17. Re: Positional learning by Don Dailey, CCC, December 13, 2010
  18. Positional learning by Ben-Hur Carlos Vieira Langoni Junior, CCC, December 13, 2010
  19. Tao update by Bas Hamstra, CCC, January 12, 2001
  20. Nici Schraudolph’s go networks, review by Jay Scott
  21. Guy Haworth, Meel Velliste (1998). Chess Endgames and Neural Networks. ICCA Journal, Vol. 21, No. 4

Up one level