GPU (Graphics Processing Unit),
a specialized processor primarily intended fast image processing. GPUs may have more raw computing power than general purpose CPUs but need a specialized and massive parallelized way of programming. Leela Chess Zero has proven that a Best-first Monte-Carlo Tree Search (MCTS) with deep learning methodology will work with GPU architectures.

GPGPU

The traditional job of a GPU is to take the x,y,z coordinates of triangles, and map these triangles to screen-space through a matrix multiplication. And as the number of triangles and polygons grew to make more sophisticated models, GPUs would create massively parallel architectures capable of performing hundreds of millions of transformations hundreds of times per second.

These lists of triangles (as well as their colors, textures, reflectivity, and other attributes), were traditionally specified in a graphical language such as DirectX or OpenGL. But video game programmers demanded more and more flexibility from their hardware: such as lighting, transparency, reflections, and particles. This flexibility was granted with full scale programming languages, called vertex shaders or pixel shaders, where graphics programmers can customize the vertex-processing or pixel-processing portions of their graphical code.

Eventually, the fixed-functionality of GPUs disappeared, and GPUs became nothing more than massively parallel general purpose computers. Graphical languages, such as DirectX and OpenGL, still call these capabilities "vertex shaders" or "pixel shaders" for historical purposes. But to properly abstract these general purpose capabilities, modern GPGPU languages have been created.

These general purpose GPU (GPGPU) languages all have the same goal: To expose the SIMD-style architecture to the programmer as directly as possible.

Khronos OpenCL

The Khronos group is a standardization committee formed to oversee the OpenGL, OpenCL, and Vulkan standards. Although compute shaders exist in all languages, OpenCL is the designated general purpose compute language.

OpenCL 1.2 is widely supported by AMD, NVidia, and Intel. OpenCL 2.0, although specified in 2013, has had a slow rollout, and the specific features aren't necessarily widespread in modern GPUs yet. AMD continues to target OpenCL 2.0 support in their ROCm environment, while NVidia has implemented some OpenCL 2.0 features.

OpenCL 1.2 Specification: https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf
OpenCL 1.2 Reference: https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/

OpenCL 2.0 Specification: https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf
OpenCL 2.0 C Language Specification: https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf
OpenCL 2.0 Reference: http://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/

NVidia Software overview

NVidia CUDA is their general purpose compute framework. CUDA has a C++ compiler based on LLVM / clang, which compiles into an assembly-like language called PTX. NVidia device drivers take PTX and compile that down to the final machine code (called NVidia SASS). NVidia keeps PTX portable between its GPUs, while its SASS assembly language may change from year-to-year as NVidia releases new GPUs.

NVidia CUDA Zone: https://developer.nvidia.com/cuda-zone
NVidia PTX ISA: https://docs.nvidia.com/cuda/parallel-thread-execution/index.html
NVidia CUDA Toolkit Documentation: https://docs.nvidia.com/cuda/index.html

AMD Software Overview

AMD's original software stack, called AMDGPU-pro, provides OpenCL 1.2 and 2.0 capabilities on Linux and Windows. However, most of AMD's efforts today is on an experimental framework called ROCm. ROCm is AMD's open source compiler and device driver stack intended for general purpose compute. AMDGPU-pro drivers can run OpenCL 1.2 and OpenCL 2.0 code across Linux and Windows.

ROCm supports two languages: HIP (a CUDA-like interface), and OpenCL 2.0. ROCm only works on Linux machines supporting modern hardware, such as PCIe 3.0 and relatively recent GPUs (such as the Rx 580, and Vega GPUs).

ROCm: https://rocm.github.io/

Other 3rd party tools

DirectCompute (GPGPU API by Microsoft)
OpenMP 4.5 Device Offload

Inside

Modern GPUs consist of up to hundreds of SIMD or Vector units, coupled to compute units. Each compute unit processes multiple Warps (Nvidia term) resp. Wavefronts (AMD term) in SIMT fashion. Each Warp resp. Wavefront runs n (32 or 64) threads simultaneously.

The Nvidia GeForce GTX 580, for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. ^[2] The AMD Radeon HD 7970 is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores. ^[3]. In real life the register and shared memory size limits the amount of total threads.

Memory

The memory hierarchy of an GPU consists in main of private memory (registers accessed by an single thread resp. work-item), local memory (shared by threads of an block resp. work-items of an work-group ), constant memory, different types of cache and global memory. Size, latency and bandwidth vary between vendors and architectures.

Here the data for the Nvidia GeForce GTX 580 (Fermi) as an example: ^[4]

128 KiB private memory per compute unit
48 KiB (16 KiB) local memory per compute unit (configurable)
64 KiB constant memory
8 KiB constant cache per compute unit
16 KiB (48 KiB) L1 cache per compute unit (configurable)
768 KiB L2 cache
1.5 GiB to 3 GiB global memory

Here the data for the AMD Radeon HD 7970 (GCN) as an example: ^[5]

256 KiB private memory per compute unit
64 KiB local memory per compute unit
64 KiB constant memory
16 KiB constant cache per four compute units
16 KiB L1 cache per compute unit
768 KiB L2 cache
3 GiB to 6 GiB global memory

Instruction Throughput

GPUs are used in HPC environments because of their good FLOP/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's Terascale, GCN, RDNA), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, Radeon Pro, Radeon Instinct) and the specific model.

32 bit Integer Performance

The 32 bit integer performance can be architecture and operation depended less than 32 bit FLOP or 24 bit integer performance.

64 bit Integer Performance

Current GPU registers and Vector-ALUs are 32 bit wide and have to emulate 64 bit integer operations.^[6] ^[7]

Mixed Precision Support

Newer architectures like Nvidia Turing and AMD Vega have mixed precision support. Vega doubles the FP16 and quadruples the INT8 throughput.^[8]Turing doubles the FP16 throughput of its FPUs.^[9]

TensorCores

With Nvidia Volta series TensorCores were introduced. They offer fp16*fp16+fp32, matrix-multiplication-accumulate-units, used to accelerate neural networks.^[10] Turings 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.^[11]

Throughput Examples

Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations/clock cycle per compute unit ^[12]

   MAD 16
   MUL 16
   ADD 32
   Bit-shift 16
   Bitwise XOR 32

Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec

AMD Radeon HD 7970 (GCN 1.0) - 32 bit integer operations/clock cycle per processing element ^[13]

   MAD 1/4
   MUL 1/4
   ADD 1
   Bit-shift 1
   Bitwise XOR 1

Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec

Host-Device Latencies

One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an measurable latency for null-kernels of 5 microseconds ^[14] up to 100s of microseconds ^[15]. One solution to overcome this limitation is to couple tasks to batches to be executed in one run ^[16].

Deep Learning

GPUs are much more suited than CPUs to implement and train Convolutional Neural Networks (CNN), and were therefore also responsible for the deep learning boom, also affecting game playing programs combining CNN with MCTS, as pioneered by Google DeepMind's AlphaGo and AlphaZero entities in Go, Shogi and Chess using TPUs, and the open source projects Leela Zero headed by Gian-Carlo Pascutto for Go and its Leela Chess Zero adaption.