Home * Search * Monte-Carlo Tree Search

Edvard Munch - At the Roulette Table in Monte Carlo ^[1]

Monte Carlo Tree Search, (Monte-Carlo Tree Search, MCTS)
is a Best-First search algorithm, historically based on random playouts. In conjunction with UCT (Upper Confidence bounds applied to Trees) Monte-Carlo Tree Search has yielded in a breakthrough in Computer Go ^[2], and is also successful in Amazons ^[3] ^[4], Lines of Action ^[5], Havannah ^[6], Hex ^[7], Checkers ^[8] and other Games with some difficulties in position evaluation, but until December 2017, when a Google DeepMind team reported on AlphaZero ^[9], not for Chess ^[10] ^[11]. MCTS is based on randomized explorations of the search space. Using the results of previous explorations, the algorithm gradually grows a game tree in memory, and successively becomes better at accurately estimating the values of the most promising moves ^[12].

Four Phases

MCTS consists of four strategic phases, repeated as long as there is time left ^[13] :

In the Selection phase the tree is traversed from the root node until it selects a leaf node that is not added to the tree yet
The Expansion strategy adds the leaf node to the tree
The Simulation strategy plays moves in self-play until the end of the game. The result is either 1, 0 ,-1
The Backpropagation strategy propagates the results through the tree

Steps of Monte Carlo Tree Search ^[14]

Pure Monte-Carlo search

Pure Monte-Carlo search with parameter T means that for each feasible move T random games are generated. The move with the best average score is played. A game is called “Monte Carlo perfect” when this procedure converges to perfect play for each position, when T goes to infinity. However, with limited time per move, increasing T does not guarantee to find a better move ^[15].

UCT

UCT (Upper Confidence bounds applied to Trees) deals with the flaw of Monte-Carlo Tree Search, when a program may favor a losing move with only one or a few forced refutations, but due to the vast majority of other moves provides a better random playout score than other, better moves ^[16]. In UCT, upper confidence bounds guide the selection of a node, treating selection as a {https://en.wikipedia.org/wiki/Multi-armed_bandit multi-armed bandit] problem. PUCT modifies the original policy by approximately predicting good arms at the start of a sequence of multi-armed bandit trials ^[17].

Playouts by NN

Historically, at the root of MCTS were random and noisy playouts. Many such playouts were necessary to accurately evaluate a state. Since AlphaGo and AlphaZero it is not the case anymore. Strong policies and evaluations are now provided by neural networks that are trained with Reinforcement Learning. In AlphaGo and its descendants the policy is used as a prior in the PUCT bandit to explore first the most promising moves advised by the neural network policy and the evaluations replace the playouts ^[18].

Publications

1987

Bruce Abramson, Richard Korf (1987). A Model of Two-Player Evaluation Functions. AAAI-87, pdf

1990 ...

Bruce Abramson (1990). Expected-Outcome: A General Model of Static Evaluation. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 12, No. 2
Bruce Abramson (1990). An Analysis of Expected-Outcome. Journal of Experimental and Theoretical Artificial Intelligence, Vol. 2
Bruce Abramson (1991). The Expected-Outcome Model of Two-Player Games. Part of the series, Research Notes in Artificial Intelligence, Morgan Kaufmann
Bernd Brügmann (1993). Monte Carlo Go. pdf

2000 ...

Bruno Bouzy, Bernard Helmstetter (2003). Monte Carlo Go Developments. Advances in Computer Games 10, pdf
Bruno Bouzy (2004). Associating Shallow and Selective Global Tree Search with Monte Carlo for 9 × 9 Go. CG 2004
Tristan Cazenave (2004). Monte Carlo Real Time Strategy. pdf

2005 ...

Tristan Cazenave, Bernard Helmstetter (2005). Combining tactical search and Monte-Carlo in the game of Go. IEEE CIG 2005, pdf, pdf
Bruno Bouzy (2005). Move-Pruning Techniques for Monte-Carlo Go. Advances in Computer Games 11
Bruno Bouzy (2005). Associating domain-dependent knowledge and Monte Carlo approaches within a go program. Information Sciences, Heuristic Search and Computer Game Playing IV