NNUE

From Chessprogramming wiki
Revision as of 22:59, 26 December 2024 by ShawnXu (talk | contribs) (Lizard SCReLU)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Home * Learning * Neural Networks * NNUE

Toriyama Sekien - Nue (鵺) [1] [2]

NNUE, (ƎUИИ Efficiently Updatable Neural Networks)
a Neural Network architecture intended to replace the evaluation of Shogi, chess and other board game playing alpha-beta searchers running on a CPU. Inspired by Kunihito Hoki's approach of piece-square tables indexed by king location, and further two-piece locations and side to move as applied in his Shogi engine Bonanza [3], NNUE was introduced in 2018 by Yu Nasu [4], and was used in Shogi adaptations of Stockfish such as YaneuraOu [5], and Kristallweizen [6], apparently with AlphaZero strength [7].

Stockfish NNUE

As reported by Henk Drost [8], Nodchip incorporated NNUE into the chess playing Stockfish 10 as a proof of concept. Stockfish NNUE was born, and in summer 2020 the computer chess community bursts out enthusiastically due to its rapidly raising playing strength with different networks trained using a mixture of supervised and reinforcement learning methods - despite the approximately halved search speed, becoming stronger than its original [9], finally responsible for the huge strength improvement of Stockfish 12.

Adoption

see Category:NNUE

Being tempted by the success of Stockfish NNUE and attracted by how easy the method and small the code is, many engine developers have started testing and applying NNUE. For quick trials and evaluating before going into serious development, some of them borrowed and/or rewrote NNUE code and use networks from Stockfish NNUE. Most of them reported positive results, such as David Carteau with Orion [10], Ehsan Rashid with DON [11], various Stockfish derivatives by Michael Byrne [12], and Volodymyr Shcherbyna with Igel [13] using the Night Nurse NNUE net by Dietrich Kappe [14]. Daniel Shawul added NNUE support à la CFish into his egbbdll probing library of Scorpio [15] [16], making it even easier to use NNUE. The promising engines Halogen 7 and 8 by Kieren Pearson, and Seer by Connor McMonigle came with their own, distinct NNUE implementations, and on November 10, 2020, the commercial Dragon by Komodo Chess aka Komodo NNUE appeared [17], trying to close the gap to Stockfish NNUE. The commercial Fat Fritz 2.0, based on a slightly modified Stockfish 12 using a customized, double sized network, was released by ChessBase in February 2021.

Architecture

Basic NNUE

The most basic form of an NNUE network consists of three layers: an input layer of length 768, one hidden layer of arbitrary size, and an output layer consisting of one neuron, representing the evaluation of the position. A NNUE network also commonly consists of two perspectives. i.e. two hidden layers representing both sides are concatenated into a single hidden layer of twice the length, before being forwarded to the output layer.

const int INPUT_SIZE = 768;
const int HL_SIZE = ...;

// These parameters are part of the training configuration
// These are the commonly used values as of 2024
const int SCALE = 400;
const int QA = 255;
const int QB = 64;

struct Network {
    int16_t accumulator_weights[INPUT_SIZE][HL_SIZE];
    int16_t accumulator_biases[HL_SIZE];
    int16_t output_weights[2 * HL_SIZE];
    int16_t output_bias;
};

Input Layer

The input layer of a basic NNUE network contains 768 neurons of binary values, each representing one possible combination of piece, square, and color. Each neuron can be either on or off, depending on whether a piece with the corresponding combinations exist on the board. The following is a common encoding scheme seen in most NNUE engines.

// Square: 0 - 63
// PieceType: Pawn = 0, Knight = 1, Bishop, Rook, Queen, King
// Side: White = 0, Black = 1

int calculateIndex(Square square, PieceType pieceType, Side side)
{
    return side * 64 * 6 + pieceType * 64 + square;
}

Accumulator

The first hidden layer of a NNUE is also called the accumulator. The accumulator can be of arbitrary size, but integer multiples of powers of 2 being preferred as they allow the inference to be fully optimized. A large hidden layer trades speed for higher evaluation accuracy, and vise versa. Larger hidden layers are usually preferred as Elo gain from speedup tends to scale inversely with respect to time and computational power. Most top engines have accumulator sizes of more than 1024 neurons, with some even using accumulators as wide as 3072. A large accumulator is important as it is the only hidden layer able to be efficiently updated.

struct Accumulator {
    int16_t values[HL_SIZE];
};

Perspective

The vast majority of NNUE architectures require two accumulators to be maintained separately. Each represent the board from the the "perspective", or point-of-view, from each side. The current board position is always from the white perspective. To calculate the input values from the perspective of black, flip every piece along the horizontal line of symmetry of the board, and invert the colors of the pieces. Both perspectives use identical input weights and biases. During inference, the two accumulators are concatenated, with the accumulator representing side-to-move being first. This resulting vector of double the accumulator width is the "true" hidden layer, and is the vector that gets forwarded to the output layer.

int calculateIndex(Side perspective, Square square, PieceType pieceType, Side side)
{
    if (perspective == BLACK)
    {
        side   = ~side;             // Flip side
        square = square ^ 0b111000; // Mirror square
    }
    
    return side * 64 * 6 + pieceType * 64 + square;
}

struct AccumulatorPair {
    Accumulator white;
    Accumulator black;
};

Efficient Updates

The format of the input layer allows the accumulators to be efficiently updated, meaning that the neural network does not need to be fully re-evaluated after making a move. Quiet moves and promotions can only change the values of two neurons in the input layers, while captures and castling can change the values of three and four input neurons, respectively. Therefore, instead of fully calculating the accumulator values, one can take the values of the accumulator from the previous ply, subtract away the weight contributions from neurons that will be turned off, and add into it the weights from neurons that will be turned on. Due to the nonlinearity introduced by the activation function, this efficient updates can only be performed on the first hidden layer of the neural network.

The following is a basic implementation of efficient single add/sub:

void accumulatorAdd(struct Network* const network, struct Accumulator* accumulator, size_t index)
{
    for (int i = 0; i < HL_SIZE; i ++)
        accumulator->values[i] += network->accumulator_weights[index][i];
}

void accumulatorSub(struct Network* const network, struct Accumulator* accumulator, size_t index)
{
    for (int i = 0; i < HL_SIZE; i ++)
        accumulator->values[i] -= network->accumulator_weights[index][i];
}

Output Layer

Most NNUE networks have exactly one output neuron, indicating the heuristic score of a position. The accumulator values are passed through an activation function, then forwarded to the output layer by taking the dot product to the output weights, and adding the output bias.

During training, the final evaluation value is passed though sigmoid activation to be trained on a target value. The target value is the result linear interpolation of the game result (0, 0.5, 1), and sigmoid(evaluation/eval_scale).

However, during inference, the sigmoid function is unused, and the final evaluation is scaled by eval_scale. This will produce a reasonable evaluation value with a large range and suffient precision.

Activation Function Notes
ReLU (Rectified Linear Unit): f(x) = max(x, 0) Rarely used due to potential to overflow.
CReLU (Clipped ReLU): f(x) = clamp(x, 0, 1) Used less commonly. Easy to implement and can be auto-vectorized .
SCReLU (Squared Clipped ReLU): f(x) = clamp(x, 0, 1)^2 Most prevalent and produces the strongest network. Cannot be auto-vectorized, and requires hand written SIMD code to optimize.
int16_t CReLU(int16_t value, int16_t min, int16_t max)
{
    if (value <= min)
        return min;

    if (value >= max)
        return max;

    return value;
}

// Example: CReLU activation
int32_t activation(int16_t value)
{
    return CReLU(value, 0, QA);
}

// When forwarding the accumulator values, the network does not consider the color of the perspectives.
// Rather, we are more interested in whether the accumulator is from the perspective of the side-to-move.
int32_t forward(struct Network*     const network,
                struct Accumulator* const stm_accumulator,
                struct Accumulator* const nstm_accumulator)
{
    int32_t eval = 0;

    // Dot product to the weights
    for (int i = 0; i < HL_SIZE; i++)
    {
        // BEWARE of integer overflows here.
        eval += activation( stm_accumulator->values[i]) * network->output_weights[i];
        eval += activation(nstm_accumulator->values[i]) * network->output_weights[i + HL_SIZE];
    }

    // Uncomment the following dequantization step when using SCReLU
    // eval /= QA;
    eval += network->output_bias;

    eval *= SCALE;
    eval /= QA * QB;

    return eval;
}

Quantization

You may be wondering where the constants QA and QB came from, or why the final evaluation is divided by their product. The answer is that most NNUE networks are quantized. During training, the weights of the network in represented in floating points, since they offer a continuous range of values and reasonable accuracy. However, floating points are slow, and since floating point operations are not completely accurate, small errors could build up on each accumulator update, resulting in inaccurate evaluation. Therefore, trained floating-point weights are quantized before loading the network.

Quantization of a NNUE goes like the following: multiply weights and biases by some amount, then round the values into an integer. This will lose some precision, but not enough to significantly affect evaluation accuracy. Suppose we quantize the input-accumulator weights by QA, then one floating point at the accumulator will correspond to QA after quantization. Therefore, the accumulator biases are also scaled by QA, and the activation function no longer clips the value to 1, but to QA. Suppose we then quantize the accumulator-output weights by QB, then, assuming CReLU activation, one floating point value at the output node will correspond to QA * QB after quantization. Thus, the output biases are scaled by QA * QB. At the end of inference, we divide out the quantization, leaving us a very close approximation of the network evaluation.

SCReLU activation makes things messier. A floating point value of the accumulator values after SCReLU is actually QA * QA, so the output biases are scaled by QA * QA * QB. However, this does not fit inside a 16-bit integer. To prevent this, some trainers will still scale the output biases by QA * QB, so you would divide the dot product by QA preliminarily. This puts the output scaling back to QA * QB, and the dequantization can continue as usual.

Advanced NNUE Techniques

Output Buckets

In a network with output buckets, the output weights and biases are split into several versions. The index of the active output weights is typically chosen by considering the number of non-king pieces left on the board. The following code illustrates one simple formula:

const int OUTPUT_BUCKETS = 8; // can be adjusted
const int DIVISOR = (32 + OUTPUT_BUCKETS - 1) / OUTPUT_BUCKETS;

int choose_output_buckets(uint64_t occupancy_bb)
{
    return (bit_count(occupancy_bb) - 2) / DIVISOR;
}

Horizontal Mirroring

In a network with horizontal mirroring, the king will always be on the left side of the board. If the king goes over the middle line, then the entire board is flipped (from the accumulator's perspective), so that the king remains on one side of the board. Note that the accumulator cannot be efficiently updated when the king crosses the vertical line of symmetry, as the index of every input neuron would be recalculated. Horizontal mirroring doubles the effectiveness of training data, as well as providing more king location information to the network.

King Input Buckets

An application of the mixture of experts technique. With king input buckets, several version of accumulator weights are generated. The board is divided into regions, and the king's location in the region determines which set of accumulator weights are used. Like horizontal mirroring, a boundary-crossing king move is not efficiently updatable, and the accumulator must undergo a full refresh. However, the improvements in evaluation accuracy is oftentimes worth the slowdown.

Feature Factorizer

A technique for training king-bucketed nets. During training, a "shadow" bucket called the factorizer is added to the network. The factorizer is always activated, and contributes to the evaluate on every position. When the training run finishes, the factorizer weights are added back into each bucket. Factorizers allow faster learning for buckets, especially ones with low training data coverage.

Pairwise Multiplication

Before concatenating the accumulators, divide each accumulator into two parts of equal length. Multiply each element in the first part with a corresponding element in the second part, resulting in a vector half the size of the original accumulator. The hidden layer becomes the result of concatenating two such vectors. With pairwise multiplication, the hidden layer is reduced to the same width as each of the accumulators. At the same accumulator size, this speeds up inference but decreases evaluation accuracy. However, it allows even wider HL sizes as well as more complex architectures with multiple hidden layers.

Multiple Layers

Some engines, such as Stockfish, Ethereal and Viridithas, has networks that consists of multiple layers.

Performance Improvements

Fused Updates

Instead of calling accumulatorAdd or accumulatorSub multiple times to update the counter, merge these updates into a single function such as accumulatorAddSub or accumulatorAddSubSub. This improves the performance of efficient updates, since it saves loops and allows registers to be reused.

Copy-Make

Store a ply-indexed array of accumulators and only perform update once during make move. This approach is faster compared to make/unmake and allows optimizations such as Lazy Updates.

Lazy Updates

Delay accumulator updates until evaluation call. There are variants of lazy updates, but the following implementation is the simplest and more common. On make-move, instead of updating the accumulator, only mark it as dirty and save the features that were changed by the current move. Before the network is evaluated, travel back through the accumulator stack until reaching a clean accumulator. Do incremental updates from that clean accumulator back to the current accumulator, refreshing every accumulator in between. This is a speedup because evaluation is not necessary in certain nodes, as most engines cache static evaluation in Transposition Tables. Lazy updates is especially important in multi-threading because of the shared Transposition Table in Lazy SMP.

Accumulator Refresh Table

Introduced by Koivisto author Finn Eggers, and colloquially referred to as Finny Tables. When performing lazy updates on architectures with king bucketed inputs, one may never reach a clean accumulator reachable via efficient updates. Since a full refresh is very expensive, one may save some computation by storing a king-bucket-indexed table of accumulators and their associated bitboards. Whenever one reaches a point without a clean accumulator, probe the cached accumulator and bitboards from the refresh table, and update from the cached accumulator by computing the difference from the cached bitboards and the current bitboards. Once an updated accumulator has been obtained, update the cache entry with the current accumulator and bitboards.

SIMD

SIMD is an acronym for Single Instruction, Multiple Data. It allows the CPU to process multiple data simultaneously. Compilers can automatically generate SIMD instructions to improve the speed of NNUE inference. However, it is not perfect, and there are still opportunities to optimize NNUE inference with hand-written SIMD code.

Lizard SCReLU

This optimization is introduced by Liam McGuire in Lizard chess engine. It significantly speeds up inference on a simple NNUE architecture with SCReLU activation.

Consider the following loop to forward a SCReLU-activated hidden layer:

const int QA = 255;
const int QB = 64;

for (int i = 0; i < HL_SIZE; i++)
{
    const int us_clamped = clamp(stm_accumulator->values[i], 0, QA);
    const int them_clamped = clamp(mstm_accumulator->values[i], 0, QA);
    eval += us_clamped * us_clamped * network->output_weights[i];
    eval += them_clamped * them_clamped * network->output_weights[i + HL_SIZE];
}

How can one vectorize this loop? Since QA = 255, SCReLU(x) can be as large as 65025, which is outside of the range of 16-bit integers. One may consider using a lower quantization factor, but the loss of precision will be too much to account for the Elo loss. Fortunately, there is another way to approach this problem.

Say the value after clamping is v and the output weight is w. Instead of calculating (v * v) * w. One can instead calculate in a different order: (v * w) * v. This way, as long as v * w is in range, the code will not overflow. Therefore, during training, we can clip w to ensure it is always between [-1.98, 1.98]. And since we know that v has a maximum of QA, and that w is scaled by QB, the maximum of v * w is QA * QB * 1.98, which is within the maximum representable value of 16 bit integers. To finish the last step, we can use the multiply-add-adjacent intrinsic such as vpmaddwd or _mm256_madd_epi16 to pairwisely multiply the elements (implicitly promoting them to 32 bit integers), then sum adjacent pairs in the result vector to yield a result vector of identical length. In code, this looks like the following

const int QA = 255;
const int QB = 64;

const __m256i vec_zero = _mm256_setzero_si256();
const __m256i vec_qa   = _mm256_set1_epi16(QA);
__m256i sum      = vec_zero;

for (int i = 0; i < ...; i ...)
{
    const __m256i us   = ...; // Load from accumulator
    const __m256i them = ...;
    const __m256i us_weights   = ...; // Load from net
    const __m256i them_weights = ...;

    const __m256i us_clamped   = _mm256_min_epi16( _mm256_max_epi16(   us, vec_zero ), vec_qa );
    const __m256i them_clamped = _mm256_min_epi16( _mm256_max_epi16( them, vec_zero ), vec_qa );

    const __m256i us_results   = _mm256_madd_epi16( _mm256_mullo_epi16(   us_weights,   us_clamped ),   us_clamped );
    const __m256i them_results = _mm256_madd_epi16( _mm256_mullo_epi16( them_weights, them_clamped ), them_clamped );

    sum = _mm256_add_epi32(sum,   us_results);
    sum = _mm256_add_epi32(sum, them_results);
}

See also

Publications

Forum Posts

2020

January ...

Re: The Stockfish of shogi by Gian-Carlo Pascutto, CCC, January 18, 2020

July

Re: NNUE accessible explanation by Jonathan Rosenthal, CCC, July 23, 2020
Re: NNUE accessible explanation by Jonathan Rosenthal, CCC, July 24, 2020
Re: NNUE accessible explanation by Jonathan Rosenthal, CCC, August 03, 2020

August

September

Re: First success with neural nets by Jonathan Kreuzer, CCC, November 11, 2020 » Checkers

October

Re: Hacking around CFish NNUE by Daniel Shawul, CCC, October 15, 2020 » Scorpio
Re: How to scale stockfish NNUE score? by Daniel Shawul, CCC, October 17, 2020
July 01, 2021 continuation

November

Re: Speculations about NNUE development (was New engine releases 2020) by Connor McMonigle, CCC, November 12, 2020
Re: Speculations about NNUE development (was New engine releases 2020) by Connor McMonigle, CCC, November 12, 2020 » Dragon by Komodo Chess, Seer, Halogen

December

2021

January

February

March

April

May

June

Re: Commercial Release of Ethereal 13.00 (NNUE) for AVX2 Systems by Andrew Grant, CCC, June 04, 2021 » HalfKP
Re: will Tcec allow Stockfish with a Leela net to play? by Daniel Shawul, CCC, June 18, 2021 » Scorpio
Re: will Tcec allow Stockfish with a Leela net to play? by Vivien Clauzon, CCC, June 18, 2021 » Minic

July

Re: NNUE Question - King Placements by Daniel Shawul, July 01, 2021 » ScorpioNNUE

August ...

2022 ...

2024 ...

External Links

NNUE

Learn Next Generation Shogi Thinking Engine, NNUE Function (Part 1. Network Structure) - Computer Shogi
Let's Learn Next Generation Shogi Thinking Engine, NNUE Function (Part 2. Remodeling/Learning) - Computer Shogi

NNUE libraries

Some developers disintegrate and rewrite the Stockfish NNUE code into independent libraries which can be much easier to embed into other chess engines.

Source Code

Misc

References

  1. Nue (鵺) from the Konjaku Gazu Zoku Hyakki (今昔画図続百鬼) by Toriyama Sekien, circa 1779, Wikimedia Commons
  2. Re: What does NNUE actually mean by Henk Drost, CCC, July 29, 2020
  3. 機械学習エンジニアのための将棋AI開発入門その1 Introduction to Shogi AI development for machine learning engineers Part 1, May 03, 2020 (Japanese)
  4. Yu Nasu (2018). ƎUИИ Efficiently Updatable Neural-Network based Evaluation Functions for Computer Shogi. Ziosoft Computer Shogi Club, pdf (Japanese with English abstract)
  5. GitHub - yaneurao/YaneuraOu: YaneuraOu is the World's Strongest Shogi engine(AI player), WCSC29 1st winner, educational and USI compliant engine
  6. GitHub - Tama4649/Kristallweizen: 第29回世界コンピュータ将棋選手権 準優勝のKristallweizenです。
  7. The Stockfish of shogi by Larry Kaufman, CCC, January 07, 2020
  8. Stockfish NN release (NNUE) by Henk Drost, CCC, May 31, 2020
  9. Can the sardine! NNUE clobbers SF by Henk Drost, CCC, July 16, 2020
  10. Orion 0.7 : NNUE experiment by David Carteau, CCC, August 19, 2020
  11. Re: New engine releases 2020...Don NNUE 2020? by supersharp77, CCC, August 19, 2020
  12. ... the last shall be first ... by MikeB, CCC, 19 Aug 2020
  13. Introducing Igel chess engine by Volodymyr Shcherbyna, CCC, 20 Aug 2020
  14. Night Nurse 0.2 by Dietrich Kappe, CCC, August 19, 2020
  15. Re: Hacking around CFish NNUE by Daniel Shawul, CCC, October 15, 2020
  16. Re: How to scale stockfish NNUE score? by Daniel Shawul, CCC, October 17, 2020
  17. Dragon by Komodo Chess by Larry Kaufman, CCC, November 10, 2020
  18. Translation of Yu Nasu's NNUE paper by Dominik Klein, CCC, January 07, 2021
  19. Book about Neural Networks for Chess by dkl, CCC, September 29, 2021
  20. Outer product from Wikipedia
  21. Tensor product from Wikipedia
  22. PyTorch from Wikipedia
  23. TensorFlow from Wikipedia
  24. Vafra by Robert Jurjević
  25. Update default net to nn-8a08400ed089.nnue by Sopel97 · Pull Request #3474 · official-stockfish/Stockfish · GitHub by Tomasz Sobczyk
  26. nnue-trainer by Jon Dart, CCC, March 27, 2021

Up one Level