Difference between revisions of "GPU"

From Chessprogramming wiki
Jump to: navigation, search
m (Memory Model)
m (History)
(35 intermediate revisions by 2 users not shown)
Line 1: Line 1:
 
'''[[Main Page|Home]] * [[Hardware]] * GPU'''
 
'''[[Main Page|Home]] * [[Hardware]] * GPU'''
  
[[FILE:6600GT GPU.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/GeForce_6_series GeForce 6600GT (NV43)] GPU <ref>[https://commons.wikimedia.org/wiki/Graphics_processing_unit Graphics processing unit - Wikimedia Commons]</ref> ]]  
+
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]]  
  
 
'''GPU''' (Graphics Processing Unit),<br/>
 
'''GPU''' (Graphics Processing Unit),<br/>
Line 7: Line 7:
  
 
=History=
 
=History=
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer, like  [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.
+
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like  [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.
  
 
=GPU in Computer Chess=  
 
=GPU in Computer Chess=  
Line 56: Line 56:
 
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]
 
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]
 
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]
 
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]
 +
 +
== Intel ==
 +
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.
 +
 +
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]
 +
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]
  
 
== Nvidia ==
 
== Nvidia ==
Line 67: Line 73:
 
== Further ==  
 
== Further ==  
  
* [https://en.wikipedia.org/wiki/OneAPI_(programming_model) oneAPI] (Intel)
 
 
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)
 
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)
 
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)
 
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)
Line 75: Line 80:
 
=Hardware Model=
 
=Hardware Model=
  
A common scheme on GPUs is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities, floating-point and/or integer with specific bit-width of the FPU/ALU. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMACUs (matrix-multiply-accumulate-units) are used to speed up neural networks further.
+
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.
  
 
=Programming Model=
 
=Programming Model=
  
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (resp. work-items in OpenCL) are coupled to a block (resp. work-group in OpenCL), these can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold.
+
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold and how many threads can run in total concurrently on the device.
  
 
=Memory Model=
 
=Memory Model=
Line 90: Line 95:
 
* __global - usually VRAM, accessable by all work-items resp. threads.
 
* __global - usually VRAM, accessable by all work-items resp. threads.
  
 +
===Memory Examples===
  
 
Here the data for the Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] as an example: <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>
 
Here the data for the Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] as an example: <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>
Line 107: Line 113:
 
* 768 KiB L2 cache
 
* 768 KiB L2 cache
 
* 3 GiB to 6 GiB global memory
 
* 3 GiB to 6 GiB global memory
 +
 +
===Unified Memory===
 +
 +
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.
  
 
=Instruction Throughput=  
 
=Instruction Throughput=  
Line 113: Line 123:
 
==Integer Instruction Throughput==
 
==Integer Instruction Throughput==
 
* INT32
 
* INT32
: The 32 bit integer performance can be architecture and operation depended less than 32 bit FLOP or 24 bit integer performance.
+
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.
  
 
* INT64
 
* INT64
: In general GPU [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] are 32 bit wide and have to emulate 64 bit integer operations.
+
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.
 
* INT8
 
* INT8
 
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.
 
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.
  
==Floating Point Instruction Throughput==
+
==Floating-Point Instruction Throughput==
  
 
* FP32
 
* FP32
: Consumer GPU performance is measured usually in single-precision (32 bit) floating point FMA, fused-multiply-add, throughput.
+
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.
  
 
* FP64
 
* FP64
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64 bit) floating point operations throughput than server brand GPUs.
+
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.
  
 
* FP16
 
* FP16
: Some GPGPU architectures offer half-precision (16 bit) floating point operation throughput with an FP32:FP16 ratio of 1:2.
+
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.
 
 
==Tensors==
 
===Nvidia TensorCores===
 
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6  AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>
 
 
 
===AMD Matrix Cores===
 
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA, matrix-fused-multiply-add, operations on various data types like INT8, FP16, BF16, FP32.
 
 
 
===Intel XMX Cores===
 
: Intel plans XMX, Xe Matrix eXtensions, for its upcoming [https://www.anandtech.com/show/15973/the-intel-xelp-gpu-architecture-deep-dive-building-up-from-the-bottom/4 Xe discrete GPU] series.
 
  
 
==Throughput Examples==
 
==Throughput Examples==
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref>
+
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref>
  
 
     MAD 16
 
     MAD 16
Line 150: Line 150:
 
     Bitwise XOR 32
 
     Bitwise XOR 32
  
Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec
+
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec
  
AMD Radeon HD 7970 (GCN 1.0) - 32 bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref>
+
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref>
  
 
     MAD 1/4
 
     MAD 1/4
Line 160: Line 160:
 
     Bitwise XOR 1
 
     Bitwise XOR 1
  
Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec
+
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec
 +
 
 +
=Tensors=
 +
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.
 +
 
 +
==Nvidia TensorCores==
 +
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6  AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>
 +
 
 +
==AMD Matrix Cores==
 +
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations.
 +
 
 +
==Intel XMX Cores==
 +
: Intel plans XMX, Xe Matrix eXtensions, for its upcoming [https://www.anandtech.com/show/15973/the-intel-xelp-gpu-architecture-deep-dive-building-up-from-the-bottom/4 Xe discrete GPU] series.
  
 
=Host-Device Latencies=  
 
=Host-Device Latencies=  
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.
+
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.
  
 
=Deep Learning=
 
=Deep Learning=
Line 175: Line 187:
  
 
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]  
 
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]  
 +
 +
=== CDNA 2 ===
 +
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.
 +
 +
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]
  
 
=== CDNA ===  
 
=== CDNA ===  
Line 181: Line 198:
 
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]
 
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]
  
=== Navi 2X RDNA 2.0 ===  
+
=== Navi 2X RDNA 2 ===  
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2.0] cards were unveiled on October 28, 2020.
+
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.
  
 
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]
 
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]
  
=== Navi RDNA 1.0 ===  
+
=== Navi RDNA 1 ===  
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1.0] cards were unveiled on July 7, 2019.
+
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.
  
 
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]
 
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]
Line 212: Line 229:
 
Apple released its M1 SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.
 
Apple released its M1 SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.
  
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M1 on Wikipedia]
+
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series M1 series on Wikipedia]
  
== ARM Mali ==
+
== ARM ==
The Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.
+
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.
  
 
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]
 
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]
Line 232: Line 249:
 
== Intel ==
 
== Intel ==
  
=== Intel Xe 'Gen12' ===
+
=== Xe 'Gen12' ===
  
 
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).
 
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).
Line 270: Line 287:
  
 
== PowerVR ==
 
== PowerVR ==
Imagination Technologies licenses PowerVR IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.
+
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.
 +
 
 +
=== PowerVR ===
 +
 
 +
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]
 +
 
 +
=== IMG ===
 +
 
 +
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]
 +
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]
  
=== PowerVR Graphics ===
+
== Qualcomm ==
 +
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.
  
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR Series on Wikipedia]
+
=== Adreno ===
 +
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]
  
 
== Vivante Corporation ==
 
== Vivante Corporation ==
Line 281: Line 309:
 
=== GC-Series ===
 
=== GC-Series ===
  
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC-Series on Wikipedia]
+
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]
  
 
=See also=  
 
=See also=  
Line 402: Line 430:
 
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020
 
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020
 
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]
 
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]
 +
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]
  
 
=External Links=  
 
=External Links=  

Revision as of 06:28, 13 March 2022

Home * Hardware * GPU

GPU (Graphics Processing Unit),
a specialized processor primarily intended to fast image processing. GPUs may have more raw computing power than general purpose CPUs but need a specialized and parallelized way of programming. Leela Chess Zero has proven that a Best-first Monte-Carlo Tree Search (MCTS) with deep learning methodology will work with GPU architectures.

History

In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like TIAin the Atari VCS gaming system, GTIA+ANTIC in the Atari 400/800 series, or Denise+Agnus in the Commodore Amiga series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the 3dfx Voodoo2, were used by the video game community to play 3D graphics. Some game engines could use instead the SIMD-capabilities of CPUs such as the Intel MMX instruction set or AMD's 3DNow! for real-time rendering. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the unified shader architecture, like in Nvidia Tesla (2006), ATI/AMD TeraScale (2007) or Intel GMA X3000 (2006), GPGPU frameworks like CUDA and OpenCL emerged and gained in popularity.

GPU in Computer Chess

There are in main three approaches how to use a GPU for Chess:

  • As an accelerator in Lc0: run a neural network for position evaluation on GPU.
  • Offload the search in Zeta: run a parallel game tree search with move generation and position evaluation on GPU.
  • As an hybrid in perft_gpu: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.

GPU Chess Engines

GPGPU

Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like OpenGL or DirextX, followed by first GPGPU frameworks such as Sh/RapidMind or Brook and finally CUDA and OpenCL.

Khronos OpenCL

OpenCL specified by the Khronos Group is widely adopted across all kind of hardware accelerators from different vendors.

AMD

AMD supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with ROCm its own parallel compute platform.

Apple

Since macOS 10.14 Mojave a transition from OpenCL to Metal is recommended by Apple.

Intel

Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the oneAPI platform with DPC++ as frontend language.

Nvidia

CUDA is the parallel computing platform by Nvidia. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via OpenACC and OpenMP.

Further

Hardware Model

A common scheme on GPUs with unified shader architecture is to run multiple threads in SIMT fashion and a multitude of SIMT waves on the same SIMD unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as hardware architecture. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.

Programming Model

A parallel programming model for GPGPU can be data-parallel, task-parallel, a mixture of both, or with libraries and offload-directives also implicitly-parallel. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold and how many threads can run in total concurrently on the device.

Memory Model

OpenCL offers the following memory model for the programmer:

  • __private - usually registers, accessable only by a single work-item resp. thread.
  • __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.
  • __constant - read-only memory.
  • __global - usually VRAM, accessable by all work-items resp. threads.

Memory Examples

Here the data for the Nvidia GeForce GTX 580 (Fermi) as an example: [2]

  • 128 KiB private memory per compute unit
  • 48 KiB (16 KiB) local memory per compute unit (configurable)
  • 64 KiB constant memory
  • 8 KiB constant cache per compute unit
  • 16 KiB (48 KiB) L1 cache per compute unit (configurable)
  • 768 KiB L2 cache
  • 1.5 GiB to 3 GiB global memory

Here the data for the AMD Radeon HD 7970 (GCN) as an example: [3]

  • 256 KiB private memory per compute unit
  • 64 KiB local memory per compute unit
  • 64 KiB constant memory
  • 16 KiB constant cache per four compute units
  • 16 KiB L1 cache per compute unit
  • 768 KiB L2 cache
  • 3 GiB to 6 GiB global memory

Unified Memory

Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.

Instruction Throughput

GPUs are used in HPC environments because of their good FLOP/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's TeraScale, GCN, RDNA), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, Radeon Pro, Radeon Instinct) and the specific model.

Integer Instruction Throughput

  • INT32
The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.
  • INT64
In general registers and Vector-ALUs of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.
  • INT8
Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.

Floating-Point Instruction Throughput

  • FP32
Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.
  • FP64
Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.
  • FP16
Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.

Throughput Examples

Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit [4]

   MAD 16
   MUL 16
   ADD 32
   Bit-shift 16
   Bitwise XOR 32

Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec

AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element [5]

   MAD 1/4
   MUL 1/4
   ADD 1
   Bit-shift 1
   Bitwise XOR 1

Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec

Tensors

MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations [6]. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.

Nvidia TensorCores

With Nvidia Volta series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.[7] Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.[8] Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.[9]

AMD Matrix Cores

AMD released 2020 its server-class CDNA architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations.

Intel XMX Cores

Intel plans XMX, Xe Matrix eXtensions, for its upcoming Xe discrete GPU series.

Host-Device Latencies

One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds [10] up to 100s of microseconds [11]. One solution to overcome this limitation is to couple tasks to batches to be executed in one run [12].

Deep Learning

GPUs are much more suited than CPUs to implement and train Convolutional Neural Networks (CNN), and were therefore also responsible for the deep learning boom, also affecting game playing programs combining CNN with MCTS, as pioneered by Google DeepMind's AlphaGo and AlphaZero entities in Go, Shogi and Chess using TPUs, and the open source projects Leela Zero headed by Gian-Carlo Pascutto for Go and its Leela Chess Zero adaption.

Architectures

The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.

AMD

AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.

CDNA 2

CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.

CDNA

CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.

Navi 2X RDNA 2

RDNA 2 cards were unveiled on October 28, 2020.

Navi RDNA 1

RDNA 1 cards were unveiled on July 7, 2019.

Vega GCN 5th gen

Vega cards were unveiled on August 14, 2017.

Polaris GCN 4th gen

Polaris cards were first released in 2016.

Apple

M1

Apple released its M1 SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.

ARM

The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.

Valhall (2019)

Bifrost (2016)

Midgard (2012)

Intel

Xe 'Gen12'

Intel Xe line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).

Nvidia

Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.

Ampere Architecture

The Ampere microarchitecture was announced on May 14, 2020 [13]. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 [14].

Turing Architecture

Turing cards were first released in 2018. They are the first consumer cores to launch with RTX, for raytracing, features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate convolutional neural networks. The Turing GTX line of chips do not offer RTX or TensorCores.

Architectural Whitepaper

Volta Architecture

Volta cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate convolutional neural networks.

Architecture Whitepaper

Pascal Architecture

Pascal cards were first released in 2016.

Architecture Whitepaper

Maxwell Architecture

Maxwell cards were first released in 2014.

Architecture Whitepaper on archiv.org

PowerVR

PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.

PowerVR

IMG

Qualcomm

Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.

Adreno

Vivante Corporation

Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.

GC-Series

See also

Publications

1986

1990

2008 ...

2010...

2011

2012

2013

2014

2015 ...

2016

Chapter 8 in Ross C. Walker, Andreas W. Götz (2016). Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics. John Wiley & Sons

2017

2018

Forum Posts

2005 ...

2010 ...

2011

Re: Possible Board Presentation and Move Generation for GPUs by Steffan Westcott, CCC, March 20, 2011

2012

2013

2015 ...

2017

2018

Re: How good is the RTX 2080 Ti for Leela? by Ankan Banerjee, CCC, September 16, 2018

2019

2020 ...

External Links

OpenCL

CUDA

 :

Deep Learning

Game Programming

GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper

Chess Programming

References

  1. Image by Mahogny, February 09, 2008, Wikimedia Commons
  2. CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES
  3. AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
  4. CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions
  5. AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths
  6. Re: To TPU or not to TPU... by Rémi Coulom, CCC, December 16, 2017
  7. INSIDE VOLTA
  8. AnandTech - Nvidia Turing Deep Dive page 6
  9. Wikipedia - Ampere microarchitecture
  10. host-device latencies? by Srdja Matovic, Nvidia CUDA ZONE, Feb 28, 2019
  11. host-device latencies? by Srdja Matovic AMD Developer Community, Feb 28, 2019
  12. Re: GPU ANN, how to deal with host-device latencies? by Milos Stanisavljevic, CCC, May 06, 2018
  13. NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog by Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam and Sridhar Ramaswamy, May 14, 2020
  14. CUDA 11 Features Revealed | NVIDIA Developer Blog by Pramod Ramarao, May 14, 2020
  15. Photon mapping from Wikipedia
  16. Cell (microprocessor) from Wikipedia
  17. Jetson TK1 Embedded Development Kit | NVIDIA
  18. Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016
  19. PowerVR from Wikipedia
  20. Density functional theory from Wikipedia
  21. Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2
  22. Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2
  23. Tesla K20 GPU Compute Processor Specifications Released | techPowerUp
  24. Parallel Thread Execution from Wikipedia
  25. NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf
  26. ankan-ban/perft_gpu · GitHub
  27. Tensor processing unit from Wikipedia
  28. GeForce 20 series from Wikipedia
  29. Phoronix Test Suite from Wikipedia
  30. kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums by LukeCuda, June 18, 2018
  31. Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019
  32. Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

Up one Level