Changes

Jump to: navigation, search

GPU

3,297 bytes added, 07:28, 13 March 2022
m
History
'''[[Main Page|Home]] * [[Hardware]] * GPU'''
[[FILE:6600GT GPUNvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/GeForce_6_series GeForce 6600GT (NV43)Nvidia_Tesla Nvidia Tesla] GPU <ref>[https://commons.wikimedia.org/wiki/Graphics_processing_unit Graphics processing unit - File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]]
'''GPU''' (Graphics Processing Unit),<br/>
=History=
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.
=GPU in Computer Chess=
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]
 
== Intel ==
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.
 
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]
== Nvidia ==
== Further ==
* [https://en.wikipedia.org/wiki/OneAPI_(programming_model) oneAPI] (Intel)
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)
=Hardware Model=
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities, - floating-point and/or integer with specific bit-width of the FPU/ALUand registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMACUs MMAC units (matrix-multiply-accumulate-units) are used to speed up neural networks further.
=Programming Model=
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (resp. work-items in OpenCL) contain the kernel to be computed and are coupled to a block (resp. work-group in OpenCL), these one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can holdand how many threads can run in total concurrently on the device.
=Memory Model=
* __private - usually registers, accessable only by a single work-item resp. thread.
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.
* __constant - read-only variablememory.
* __global - usually VRAM, accessable by all work-items resp. threads.
===Memory Examples===
Here the data for the Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] as an example: <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>
* 768 KiB L2 cache
* 3 GiB to 6 GiB global memory
 
===Unified Memory===
 
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.
=Instruction Throughput=
==Integer Instruction Throughput==
* INT32
: The 32 -bit integer performance can be architecture and operation depended less than 32 -bit FLOP or 24 -bit integer performance.
* INT64
: In general GPU [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32 -bit wide and have to emulate 64 -bit integer operations.
* INT8
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.
==Floating -Point Instruction Throughput==
* FP32
: Consumer GPU performance is measured usually in single-precision (32 -bit) floating -point FMA, (fused-multiply-add, ) throughput.
* FP64
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64 -bit) floating -point operations throughput than server brand GPUs.
* FP16
: Some GPGPU architectures offer half-precision (16 -bit) floating -point operation throughput with an FP32:FP16 ratio of 1:2. ==Tensors=====Nvidia TensorCores===: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref> ===AMD Matrix Cores===: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA, matrix-fused-multiply-add, operations on various data types like INT8, FP16, BF16, FP32. ===Intel XMX Cores===: Intel plans XMX, Xe Matrix eXtensions, for its upcoming [https://www.anandtech.com/show/15973/the-intel-xelp-gpu-architecture-deep-dive-building-up-from-the-bottom/4 Xe discrete GPU] series.
==Throughput Examples==
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 -bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref>
MAD 16
Bitwise XOR 32
Max theoretic ADD operation throughput: 32 Ops * x 16 CUs * x 1544 MHz = 790.528 GigaOps/sec
AMD Radeon HD 7970 (GCN 1.0) - 32 -bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref>
MAD 1/4
Bitwise XOR 1
Max theoretic ADD operation throughput: 1 Op * x 2048 PEs * x 925 MHz = 1894.4 GigaOps/sec =Tensors=MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit. ==Nvidia TensorCores==: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref> ==AMD Matrix Cores==: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. ==Intel XMX Cores==: Intel plans XMX, Xe Matrix eXtensions, for its upcoming [https://www.anandtech.com/show/15973/the-intel-xelp-gpu-architecture-deep-dive-building-up-from-the-bottom/4 Xe discrete GPU] series.
=Host-Device Latencies=
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.
=Deep Learning=
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]
 
=== CDNA 2 ===
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.
 
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]
=== CDNA ===
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]
=== Navi 2X RDNA 2.0 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2.0] cards were unveiled on October 28, 2020.
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]
=== Navi RDNA 1.0 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1.0] cards were unveiled on July 7, 2019.
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]
Apple released its M1 SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M1 series on Wikipedia]
== ARM Mali ==The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]
== Intel ==
=== Intel Xe 'Gen12' ===
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).
== PowerVR ==
PowerVR (Imagination Technologies ) licenses PowerVR IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available. === PowerVR === * [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia] === IMG === * [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia] == Qualcomm ==Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered. === Adreno ===* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia] == Vivante Corporation ==Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.
=== PowerVR Graphics GC-Series ===
* [https://en.wikipedia.org/wiki/PowerVRVivante_Corporation#PowerVR_Graphics PowerVR Series Products GC series on Wikipedia]
=See also=
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]
=External Links=
422
edits

Navigation menu