GPU

From Chessprogramming wiki
Jump to: navigation, search

Home * Hardware * GPU

GPU (Graphics Processing Unit),
a specialized processor primarily intended to fast image processing. GPUs may have more raw computing power than general purpose CPUs but need a specialized and massive parallelized way of programming. Leela Chess Zero has proven that a Best-first Monte-Carlo Tree Search (MCTS) with deep learning methodology will work with GPU architectures.

GPGPU

The traditional job of a GPU is to take the x,y,z coordinates of triangles, and map these triangles to screen space through a matrix multiplication. As video game graphics grew more sophisticated, the number of triangles per scene grew larger. GPUs similarly grew in size to massively parallel behemoths capable of performing billions of transformations hundreds of times per second.

These lists of triangles were specified in Graphics APIs like DirectX. But video game programmers demanded more flexibility from their hardware: such as lighting, transparency, and reflections. This flexibility was granted with specialized programming languages, called vertex shaders or pixel shaders. GPUs evolved to accelerate general purpose compute from pixel shader and vertex shader programmers, and even merged the functionality into "universal" shaders (which can perform either vertex shading or pixel shading).

Today, these universal shaders are flexible enough to provide General Purpose compute for GPUs (GPGPU). GPGPU languages, such as OpenCL or CUDA, is how the programmer can access this capability.

Khronos OpenCL

The Khronos group is a committee formed to oversee the OpenGL, OpenCL, and Vulkan standards. Although compute shaders exist in all languages, OpenCL is the designated general purpose compute language.

OpenCL 1.2 is widely supported by AMD, NVidia, and Intel. OpenCL 2.0, although specified in 2013, has had a slow rollout, and the specific features aren't necessarily widespread in modern GPUs yet. AMD continues to target OpenCL 2.0 support in their ROCm environment, while NVidia has implemented some OpenCL 2.0 features.

NVidia Software overview

NVidia CUDA is their general purpose compute framework. CUDA has a C++ compiler based on LLVM / clang, which compiles into an assembly-like language called PTX. NVidia device drivers take PTX and compile that down to the final machine code (called NVidia SASS). NVidia keeps PTX portable between its GPUs, while its SASS assembly language may change from year-to-year as NVidia releases new GPUs. A defining feature of CUDA was the "single source" C++ compiler, the same compiler would work with both CPU host-code and GPU device-code. This meant that the data-structures and even pointers from the CPU can be shared directly with the GPU code.

AMD Software Overview

AMD's original software stack, called AMDGPU-pro, provides OpenCL 1.2 and 2.0 capabilities on Linux and Windows. However, most of AMD's efforts today is on an experimental framework called ROCm. ROCm is AMD's open source compiler and device driver stack intended for general purpose compute. ROCm supports two languages: HIP (a CUDA-like single-source C++ compiler also based on LLVM/clang), and OpenCL 2.0. ROCm only works on Linux machines supporting modern hardware, such as PCIe 3.0 and relatively recent GPUs (such as the RX 580, and Vega GPUs).

AMD regularly publishes the assembly language details of their architectures. Their "GCN Assembly" changes slightly from generation to generation, but the fundamental principles have remained the same.

AMD's OpenCL documentation, especially the "OpenCL Programming Guide" and the "Optimization Guide" are good places to start for beginners looking to program their GPUs. For Linux developers, the ROCm environment is under active development and has enough features to get code working well.

Other 3rd party tools

  • DirectCompute (GPGPU API by Microsoft)
  • OpenMP 4.5 Device Offload

The SIMT Programming Model

CUDA, OpenCL, ROCm HIP, all have the same model of implicitly parallel programming. All threads are given an identifier: a threadIdx in CUDA or local_id in OpenCL. Aside from this index, all threads of a kernel will execute the same code. The only way to alter the behavior of code is to use this threadIdx to access different data.

The executed code is always implicitly SIMD. Instead of thinking of SIMD-lanes, each lane is considered its own thread. The smallest group of threads is called a CUDA Warp, or OpenCL Wavefront. NVidia GPUs execute 32-threads per warp, while AMD GCN GPUs execute 64-threads per wavefront. All threads within a Warp or Wavefront share an instruction pointer. Consider the following CUDA code:

   if(threadIdx.x == 0){
       doA(); 
   } else {
       doB(); 
   }

While there is only one thread in the warp that has threadIdx == 0, all 32 threads of the warp will have their shared instruction pointer execute doA() together. To keep the code semantically correct, threads #1 through #31 will have their NVidia Predicate-register cleared (or AMD Execution Mask cleared), which means the thread will throw away the work after executing a specific statement. For those familiar with x64 AVX code, a GPU thread is comparable to a SIMD-lane in AVX. All lanes of an AVX instruction will execute any particular instruction, but you may throw away the results of some registers using mask or comparison instructions.

Once doA() is complete, the machine will continue and doB(). In this case, thread#0 will have its execution mask-cleared, while threads #1 through #31 will actually complete the results of doB().

This highlights the fundamental trade off of the GPU platform. GPUs have many threads of execution, but they are forced to execute with their warps or wavefronts. In complicated loops or trees of if-statements, this thread divergence problem can cause your code to potentially leave many hardware threads idle.

Blocks and Workgroups

Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.

Grids and NDRange

While warps, blocks, wavefronts and workgroups are concepts that the machine executes... Grids and NDRanges are the scope of the problem specified by a programmer. For example, the 1920x1080 screen could be defined as a Grid with 2073600 threads to execute (likely organized as a 2-dimensional 1920x1080 grid for convenience). Specifying these 2,073,600 work items is the purpose of a CUDA Grid or OpenCL NDRange.

The programmer may choose to cut up the 1920x1080 screen into blocks of size 32x32 pixels. Or maybe an algorithm is horizontal in nature, and it may be more convenient to work with blocks of 1x1024 pixels instead. Or maybe the block-sizes have been set to some video standards, and maybe 8x8 blocks (64-threads) are the biggest you can practically work with (say MPEG-2 decoder 8x8 macroblocks). Regardless, the programmer chooses a block size which is most convenient and optimized for their purposes. To complete this hypothetical example, a 1920x1080 screen could be split up into 60x34 CUDA Blocks (or OpenCL Workgroups), each covering 32x32 pixels with 1024 CUDA Threads (or OpenCL Workitems) each.

These blocks and workgroups will execute with as much parallel processing as the underlying hardware can support. Roughly 150 CUDA Blocks or OpenCL Workgroups at a time on a typical midrange GPU circa from 2019 (such as a NVidia 2060 Super or AMD 5700). The most important note is that blocks within a grid (or workgroups within an NDRange) may not execute concurrently with each other. Some degree of sequential processing may happen. If thread #0 creates a Spinlock waiting for thread #1000000 to communicate with it, modern hardware will probably never have the two threads executing concurrently with each other, and the code would likely timeout. In practice, the easiest mechanism for Grid or NDRange sized synchronization is to wait for the kernel to finish executing: to have the CPU wait and process the results in between Grid or NDRanges.

For example: LeelaZero will schedule an NDRange for each Convolve operation, as well as merge and other primitives. The convolve operation is over a 3-dimensional NDRange for <channel, output, row_batch>. To build up a full CNN operation, the CPU will schedule different operations for the GPU: convolve, merge, transform and more.

Memory Model

OpenCL, CUDA, ROCM, and other GPU-languages all have a similar memory model.

  • __device__ (CUDA) or __global (OpenCL) memory -- OpenCL __global and CUDA __device__ memory exists on the GPU's VRAM. Any threads can access any part of __device__ or __global memory, although memory-ordering and caching details can get quite complicated if multiple threads simultaneously read and write to a particular memory location. Proper memory ordering with __threadfence() (CUDA) or mem_fence() (OpenCL) is essential to preventing memory-consistency issues.
  • __constant__ (CUDA) or __constant (OpenCL) memory -- Constants are not allowed to change during the execution of a particular kernel. Historically, this was used by Pixel Shaders as they read texture data. The texture-data could be computed and loaded onto the GPU, but the data was not allowed to change during the Pixel Shader's execution. Both NVidia and AMD GPUs have special caches (and in AMD's case: special registers called sGPRs) which accelerate constant-data. The caches associated with this memory space is sometimes called K$ (Konstant-cache), and has to be independently flushed if its data ever changes. The main benefit in both AMD and NVidia systems is that K$ values are broadcast extremely efficiently to all threads in a wavefront, but only if all threads in a wavefront are reading from the same memory location. Instead of haing 32-memory reads (NVidia) or 64-memory reads (AMD GCN), a read from K$ can be optimized into a single-read, broadcast to all 32 or 64-threads of a Warp or Wavefront.
  • __shared__ (CUDA) or __local (OpenCL) memory -- This is highly-accelerated memory regions designed for threads to exchange data within a CUDA Block or OpenCL Workgroup. On AMD Systems, there is more Local "LDS" memory than even L1 Cache (GCN) or L0 Cache (RDNA).
  • Default (CUDA) or __private (OpenCL) Memory -- Private memory typically maps to a GPU-register, and is inaccessible to other threads. If a kernel requires more memory than what can exist in GPU-registers, the data will automatically spill over into global VRAM (with an associated performance penalty). In practice, this spillover is well interleaved, well-optimized, and reduced to as small a subset as possible through compiler optimizations.


Here the data for the Nvidia GeForce GTX 580 (Fermi) as an example: [2]

  • 128 KiB private memory per compute unit
  • 48 KiB (16 KiB) local memory per compute unit (configurable)
  • 64 KiB constant memory
  • 8 KiB constant cache per compute unit
  • 16 KiB (48 KiB) L1 cache per compute unit (configurable)
  • 768 KiB L2 cache
  • 1.5 GiB to 3 GiB global memory

Here the data for the AMD Radeon HD 7970 (GCN) as an example: [3]

  • 256 KiB private memory per compute unit
  • 64 KiB local memory per compute unit
  • 64 KiB constant memory
  • 16 KiB constant cache per four compute units
  • 16 KiB L1 cache per compute unit
  • 768 KiB L2 cache
  • 3 GiB to 6 GiB global memory

Architectures and Physical Hardware

The market is split into three categories: server, professional, and consumer. Consumer cards are cheapest and are primarily targeted for the video game market. Professional cards have better driver support for 3d programs like Autocad. Finally, server cards provide virtualization services, allowing cloud companies to virtually split their cards between customers.

Consumer class GPUs cost anywhere from $100 to $1000. Professional cards can run to $2000, while server class cards can cost as much as $10,000.

GPUs use high-bandwidth RAM, such as GDDR6 or HBM2. GDDR6 and HBM2 are designed for the extremely parallel nature of GPUs, and can provide 200GBps to 1000GBps throughput. In comparison: a typical DDR4 channel can provide 20GBps. A dual channel desktop will typically have under 50GBps bandwidth to DDR4 main memory.

NVidia

NVidia's consumer line of cards is Geforce, branded with RTX or GTX labels. Nvidia's professional line of cards is "Quadro". Finally, Nvidia's server line of cards is "Tesla".

NVidia's "Titan" line of Geforce cards use consumer drivers, but use professional or server class chips. As such, the Titan line can cost anywhere from $1000 to $3000 per card.

Turing Architecture

Architectural Whitepaper

Turing cards were first released in 2018. They are the first consumer cores to launch with RTX, or raytracing, features. RTX instructions will more quickly traverse an aabb tree to discover ray-intersections with lists of bounding-boxes, accelerating raytracing performance. These are also the first consumer cards to launch with Tensor cores, 4x4 matrix multiplication FP16 instructions to accelerate convolutional neural networks.

  • RTX 2080 Ti
  • RTX 2080
  • RTX 2070 Ti
  • RTX 2070 Super
  • RTX 2070
  • RTX 2060 Super
  • RTX 2060
  • GTX 1660 -- Low-end GPU without Tensor cores or RTX Cores.

Volta Architecture

Architecture Whitepaper

Volta cards were released in 2017. Only Tesla and Titan cards were produced in this generation, aiming only for the most expensive end of the market. They were the first cards to launch with Tensor cores, supporting 4x4 FP16 matrix multiplications to accelerate convolutional neural networks.

  • Tesla V100
  • Titan V

Pascal Architecture

Architecture Whitepaper

Pascal cards were first released in 2016.

  • Tesla P100
  • Titan Xp
  • GTX 1080 Ti
  • GTX 1080
  • GTX 1070 Ti
  • GTX 1060
  • GTX 1050
  • GTX 1030

AMD

Navi RDNA 1.0

RDNA cards were first released in 2019. RDNA is a major change for AMD cards: the underlying hardware supports both Wave32 and Wave64 gangs of threads. Compute Units have 2x32 wide SIMD units, each of which executes 32 threads per clock tick. A Wave64 workgroup will execute on a single SIMD unit, but over two clock ticks. It should be noted that these Wave32 still have 5 cycles of latency before registers can be reused, so a Wave64 executing over two clock ticks will have fewer stalls than a Wave32.

  • Radeon 5700 XT
  • Radeon 5700

Vega GCN 5th gen

Architecture Whitepaper

Vega cards were first released in 2017. Vega is the last in the line of the GCN Architecture: 64 threads per wavefront. Each compute unit contains 4x SIMD units, supporting a total of 40 wavefronts per compute unit (a queue of 10-wavefronts per SIMD Unit). Each SIMD unit contains 16 vALUs for general compute + 1 sALU for branching and constant logic. Each SIMD unit executes the same instruction over four clock ticks (16 vALUs x 4 clock ticks == 64 threads per Wavefront).

Vega specifically added Packed FP16 instructions, such as dot-product and packed add and packed multiply. From a programming level, these packed FP16 instructions are SIMD-within-SIMD, each SIMD thread could operate its own SIMD FP16 instruction akin to AVX or SSE from the x64 architecture.

  • Radeon VII
  • Vega64
  • Vega56

Polaris GCN 4th gen

Polaris cards were first released in 2016 under the AMD Radeon 400 series name.

Architecture Whitepaper

  • RX 580
  • RX 570
  • RX 560

Instruction Throughput

GPUs are used in HPC environments because of their good FLOP/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's Terascale, GCN, RDNA), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, Radeon Pro, Radeon Instinct) and the specific model.

  • 32 bit Integer Performance
The 32 bit integer performance can be architecture and operation depended less than 32 bit FLOP or 24 bit integer performance.
  • 64 bit Integer Performance
Current GPU registers and Vector-ALUs are 32 bit wide and have to emulate 64 bit integer operations.[4] [5]
  • Mixed Precision Support
Newer architectures like Nvidia Turing and AMD Vega have mixed precision support. Vega doubles the FP16 and quadruples the INT8 throughput.[6]Turing doubles the FP16 throughput of its FPUs.[7]
  • TensorCores
With Nvidia Volta series TensorCores were introduced. They offer fp16*fp16+fp32, matrix-multiplication-accumulate-units, used to accelerate neural networks.[8] Turings 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.[9]

Throughput Examples

Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations/clock cycle per compute unit [10]

   MAD 16
   MUL 16
   ADD 32
   Bit-shift 16
   Bitwise XOR 32

Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec

AMD Radeon HD 7970 (GCN 1.0) - 32 bit integer operations/clock cycle per processing element [11]

   MAD 1/4
   MUL 1/4
   ADD 1
   Bit-shift 1
   Bitwise XOR 1

Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec

Host-Device Latencies

One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an measurable latency for null-kernels of 5 microseconds [12] up to 100s of microseconds [13]. One solution to overcome this limitation is to couple tasks to batches to be executed in one run [14].

Deep Learning

GPUs were originally intended to process matrix multiplications for graphical transformations and rendering. Convolutional Neural Networks can have their operations interpreted as a series of matrix multiplications. GPUs are therefore a natural fit to parallelize and process CNNs.

GPUs traditionally operated on 32-bit floating point numbers. However, CNNs can make due with 16-bit half floats (FP16), or even 8-bit or 4-bit numbers. One thousand single-precision floats will take up 4kB of space, while one-thousand FP16 will take up 2kB of space. A half-float uses half the memory, eats only half the memory bandwidth, and only half the space in caches. As such, GPUs such as AMD Vega or NVidia Volta added support for FP16 processing.

Specialized units, such as NVidia Volta's "Tensor cores", can perform an entire 4x4 block of FP16 matrix multiplications in just one PTX assembly language statement. It is with these instructions that CNN operations are accelerated.

GPUs are much more suited than CPUs to implement and train Convolutional Neural Networks (CNN), and were therefore also responsible for the deep learning boom, also affecting game playing programs combining CNN with MCTS, as pioneered by Google DeepMind's AlphaGo and AlphaZero entities in Go, Shogi and Chess using TPUs, and the open source projects Leela Zero headed by Gian-Carlo Pascutto for Go and its Leela Chess Zero adaption.

See also

Publications

2009

2010...

2011

2012

2013

2014

2015 ...

2016

2017

2018

Forum Posts

2005 ...

2010 ...

2011

Re: Possible Board Presentation and Move Generation for GPUs by Steffan Westcott, CCC, March 20, 2011

2012

2013

2015 ...

2017

2018

Re: How good is the RTX 2080 Ti for Leela? by Ankan Banerjee, CCC, September 16, 2018

2019

External Links

OpenCL

CUDA

 :

Deep Learning

Game Programming

GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper

Chess Programming

References

  1. Graphics processing unit - Wikimedia Commons
  2. CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES
  3. AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
  4. AMD Vega White Paper
  5. Nvidia Turing White Paper
  6. Vega (GCN 5th generation) from Wikipedia
  7. AnandTech - Nvidia Turing Deep Dive page 4
  8. INSIDE VOLTA
  9. AnandTech - Nvidia Turing Deep Dive page 6
  10. CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions
  11. AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths
  12. host-device latencies? by Srdja Matovic, Nvidia CUDA ZONE, Feb 28, 2019
  13. host-device latencies? by Srdja Matovic AMD Developer Community, Feb 28, 2019
  14. Re: GPU ANN, how to deal with host-device latencies? by Milos Stanisavljevic, CCC, May 06, 2018
  15. Jetson TK1 Embedded Development Kit | NVIDIA
  16. Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016
  17. PowerVR from Wikipedia
  18. Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2
  19. Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2
  20. Tesla K20 GPU Compute Processor Specifications Released | techPowerUp
  21. Parallel Thread Execution from Wikipedia
  22. NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf
  23. ankan-ban/perft_gpu · GitHub
  24. Tensor processing unit from Wikipedia
  25. GeForce 20 series from Wikipedia
  26. Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019
  27. Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

Up one Level