Home * Engines * AlphaZero

AlphaZero,

a chess and Go playing entity by Google DeepMind based on a general reinforcement learning algorithm with the same name. On December 5, 2017 ^[1], the DeepMind team around David Silver, Thomas Hubert, and Julian Schrittwieser along with former Giraffe author Matthew Lai, reported on their generalized algorithm, combining Deep learning with Monte-Carlo Tree Search (MCTS) ^[2]. The final peer reviewed paper with various clarifications was published almost one year later in the Science magazine under the title A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play ^[3].

Starting from random play, and given no domain knowledge except the game rules, AlphaZero achieved a superhuman level of play in the games of chess and Shogi as well as in Go. The algorithm is a more generic version of the AlphaGo Zero algorithm that was first introduced in the domain of Go ^[4]. AlphaZero evaluates positions using non-linear function approximation based on a deep neural network, rather than the linear function approximation as used in classical chess programs. This neural network takes the board position as input and outputs a vector of move probabilities (policy) and a position evaluation. Once trained, these network is combined with a Monte-Carlo Tree Search (MCTS) using the policy to narrow down the search to high probability moves, and using the value in conjunction with a fast rollout policy to evaluate positions in the tree. The selection is done by a variation of Rosin's UCT improvement dubbed PUCT.

„Zero
ist die Stille. Zero ist der
Anfang. Zero ist rund. Zero dreht sich.
Zero ist der Mond. Die Sonne ist Zero.
Zero ist weiss. Die Wüste Zero. Der Himmel
über Zero. Die Nacht –, Zero fließt. Das Auge
Zero. Nabel. Mund. Kuß. Die Milch ist rund. Die
Blume Zero der Vogel. Schweigend. Schwebend. Ich
esse Zero, ich trinke Zero, ich schlafe Zero, ich wache
Zero, ich liebe Zero. Zero ist schön, dynamo, dynamo,
dynamo. Die Bäume im Frühling, der Schnee, Feuer,
Wasser, Meer. Rot orange gelb grün indigo blau violett
Zero Zero Regenbogen. 4 3 2 1 Zero. Gold und
Silber, Schall und Rauch. Wanderzirkus Zero.
Zero ist die Stille. Zero ist der Anfang.
Zero ist rund. Zero ist
Zero.“ ^[5]

Network Architecture

The deep neural network consists of a “body” with input and hidden layers of spatial NxN planes, 8x8 board arrays for chess, followed by both policy and value “heads” ^[6] ^[7]. Each square cell of the input plane contains 6x2 piece-type and color bits of the current chess position from the current player's point of view, plus two bits of a repetition counter concerning the draw rule, and to further address graph history and path-dependency issues - these 14 bits times eight, that is up to seven predecessor positions as well - so that en passant, or some sense of progress is implicit. Additional 7 input bits consider castling rights, total move count and side to move, yielding in 119 bits per square cell for chess.

The body consists of a rectified batch-normalized convolutional layer followed by 19 residual blocks. Each such block consists of two rectified batch-normalized residual convolutional layers with a skip connection ^[8] ^[9]. Each convolution applies 256 filters (shared weight vectors) of kernel size 3x3 with stride 1. These layers connect the pieces on different squares to each other due to consecutive convolutions, where a cell of a layer is connected to the correspondent 3x3 receptive field of the previous layer, so that after 4 convolutions, each square is connected to every other cell in the original input layer ^[10].

The policy head applies an additional rectified, batch-normalized convolutional layer, followed by a final convolution of 73 filters for chess, with the final policy output represented as an 8x8 board array as well, for every origin square up to 73 target square possibilities (NRayDirs x MaxRayLength + NKnightDirs + NPawnDirs * NMinorPromotions), encoding a probability distribution over 64x73 = 4,672 possible moves, where illegal moves were masked out by setting their probabilities to zero, re-normalising the probabilities for remaining moves. The value head applies an additional rectified, batch-normalized convolution of 1 filter of kernel size 1x1 with stride 1, followed by a rectified linear layer of size 256 and a tanh-linear layer of size 1.

Training

AlphaZero was trained in 700,000 steps or mini-batches of size 4096 each, starting from randomly initialized parameters, using 5,000 first-generation TPUs ^[11] to generate self-play games and 64 second-generation TPUs ^[12] ^[13] ^[14] to train the neural networks ^[15] .

Stockfish Match

As mentioned in the December 2017 paper ^[16], a 100 game match versus Stockfish 8 using 64 threads and a transposition table size of 1GiB, was won by AlphaZero using a single machine with 4 first-generation TPUs with +28=72-0, 10 games were published. Despite a possible hardware advantage of AlphaZero and criticized playing conditions ^[17], this is a tremendous achievement.

In the final peer reviewed paper, published in Science magazine in December 2018 ^[18] along with supplementary materials ^[19], a 1000 game match was reported with about 200 games published, versus various most recent Stockfish versions available at the time of the matches, that is Stockfish 8, a development version as of January 13, 2018 close to Stockfish 9, Brainfish with Cerebellum book, and Stockfish 9, in total AlphaZero winning 155 games and losing 6 games.

Stockfish was configured according to its 2016 TCEC Season 9 superfinal settings: 44 threads on 44 cores (two 2.2GHz Intel Xeon Broadwell x86-64 CPUs with 22 cores, running Linux), a transposition table size of 32 GiB, and 6-men Syzygy bases. Time control was 3 hours per side and game plus 15 seconds increment per move. AlphaZero used a simple time control strategy: thinking for 1/20th of the remaining time, and selects moves greedily with respect to the root visit count. Each MCTS was executed on a single machine with 4 first-generation TPUs.

AlphaZero and Stockfish (except Brainfish) used no opening book, 12 common human positions as well as the 2016 TCEC Season 9 superfinal positions were played, originally selected by Jeroen Noomen ^[20]. To ensure diversity against opponents (Brainfish) with a deterministic opening book, AlphaZero used a small amount of randomization in its opening moves. This avoided duplicate games but also resulted in more losses by AlphaZero.