422
edits
Changes
GPU
,→2020 ...: link to Chinese gpus
'''[[Main Page|Home]] * [[Hardware]] * GPU'''
[[FILE:6600GT GPUNvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/GeForce_6_series GeForce 6600GT (NV43)Nvidia_Tesla Nvidia Tesla] GPU <ref>[https://commons.wikimedia.org/wiki/Graphics_processing_unit Graphics processing unit - File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]]
'''GPU''' (Graphics Processing Unit),<br/>
a specialized processor primarily initially intended to for fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and massive parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.
=History=In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as [https://en.wikipedia.org/wiki/IMPACT_(computer_graphics) SGI Impact] (1995) in 3D graphics-workstations or [https://en.wikipedia.org/wiki/3dfx#Voodoo_Graphics_PCI 3dfx Voodoo] (1996) for playing 3D games on PCs, emerged. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip [https://en.wikipedia.org/wiki/PlayStation_technical_specifications#Graphics_processing_unit_(GPU) GTE] used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like [https://en.wikipedia.org/wiki/NV1 NV1] (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU= frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.
== Khronos OpenCL ==
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform. * [https://developercommunity.nvidiaamd.com/cudat5/opencl/bd-p/opencl-zone Nvidia CUDA Zonediscussions AMD OpenCL Developer Community]* [https://docsrocmdocs.amd.nvidiacom/en/latest/index.html AMD ROCm™ documentation]* [https://manualzz.com/cudadoc/parallelo/cggy6/amd-opencl-programming-user-threadguide-executioncontents AMD OpenCL Programming Guide]* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]* [https:/index/gpuopen.html Nvidia PTX com/amd-isa-documentation/ AMD GPU ISAdocumentation] == Apple ==Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]]. * [https://developer.apple.com/opencl/ Apple OpenCL Developer] * [https://developer.apple.com/metal/ Apple Metal Developer]* [https://docsdeveloper.nvidiaapple.com/cudalibrary/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/indexIntroduction.html Nvidia CUDA Toolkit DocumentationApple Metal Programming Guide]* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification] == Intel ==Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI]platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.
== Further == * [https://rocmen.githubwikipedia.ioorg/ ROCm Homepagewiki/Vulkan#Planned_features Vulkan](OpenGL sucessor of Khronos Group)* [httphttps://developeren.amdwikipedia.com/wordpress/media/2013org/07wiki/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming GuideDirectCompute DirectCompute](Microsoft)* [httphttps://developeren.amdwikipedia.com/wordpress/mediaorg/2013wiki/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization GuideC%2B%2B_AMP C++ AMP](Microsoft)* [https://gpuopenen.com/wp-content/uploadswikipedia.org/2019wiki/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction SetOpenACC OpenACC](offload directives)* [https://developeren.amdwikipedia.comorg/wp-contentwiki/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction SetOpenMP OpenMP](offload directives)
== Other 3rd party tools =Hardware Model=
{| class=The SIMT Programming Model"wikitable" style="margin:auto"|+ Vendor Terminology|-! AMD Terminology !! Nvidia Terminology|-| Compute Unit || Streaming Multiprocessor|-| Stream Core || CUDA Core|-| Wavefront || Warp|}
== Blocks and Workgroups =Programming Model=
{| class="wikitable" style= Grids and "margin:auto"|+ Terminology|-! OpenCL Terminology !! CUDA Terminology|-| Kernel || Kernel|-| Compute Unit || Streaming Multiprocessor|-| Processing Element || CUDA Core|-| Work-Item || Thread|-| Work-Group || Block|-| NDRange ==|| Grid|-|}
* __shared__ (CUDA) or __private - usually registers, accessable only by a single work-item resp. thread.* __local (OpenCL) - scratch-pad memory shared across work-items of a work-group resp. threads of block.* __constant - This is highlyread-accelerated only memory regions designed for threads to exchange data within a CUDA Block or OpenCL Workgroup. On AMD Systems* __global - usually VRAM, there is more Local "LDS" memory than even L1 Cache (GCN) or L0 Cache (RDNA)accessable by all work-items resp. threads.
===Memory Examples===
* 128 KiB private memory per compute unit
* 48 KiB (16 KiB) local memory per compute unit (configurable)
* 8 KiB constant cache per compute unit
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)
* 768 KiB L2 cachein total
* 1.5 GiB to 3 GiB global memory
* 256 KiB private memory per compute unit
* 64 KiB local memory per compute unit
* 16 KiB constant cache per four compute units
* 16 KiB L1 cache per compute unit
* 768 KiB L2 cachein total
* 3 GiB to 6 GiB global memory
= Architectures ==Unified Memory=== Usually data has to be copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU. =Instruction Throughput= GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and Physical Hardware the specific model. ==Integer Instruction Throughput==* INT32: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance. * INT64: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.* INT8: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.
* FP16: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2. ==Throughput Examples==Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref> MAD 16 MUL 16 ADD 32 Bit-shift 16 Bitwise XOR 32 Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref> MAD 1/4 MUL 1/4 ADD 1 Bit-shift 1 Bitwise XOR 1 Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec =Tensors=MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit. ==Nvidia TensorCores==: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref> ==AMD Matrix Cores==: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation acceleration. AMD's CDNA 3 architecture adds support for FP8 and sparse matrix data (sparsity). ==Intel XMX Cores==: Intel added XMX, Xe Matrix eXtensions, cores to some of the [https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] GPU series, like [https://en.wikipedia.org/wiki/Intel_Arc#Alchemist Arc Alchemist] and [https://developerwww.armintel.com/documentationcontent/www/us/en/products/sku/232876/101574intel-data-center-gpu-max-1100/latest Bifrost specifications.html Intel Data Center GPU Max Series]. =Host-Device Latencies= One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and Valhall OpenCL AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer GuideCommunity, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>. =Deep Learning=GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]]adaption.
=== Midgard (2012) ==Architectures =* [https:The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD//developerCGI programs and server brands for big-data and number-crunching workloads.armEach brand offering different feature sets in driver, VRAM, or computation abilities.com/documentation/100614/latest Midgard OpenCL Developer Guide]
== AMD ==
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]
=== CDNA3 ===
CDNA3 HPC architecture was unveiled in December, 2023. With MI300A APU model (CPU+GPU+HBM) and MI300X GPU model, both with multi-chip modules design. Featuring Matrix Cores with support for a broad type of precision, as INT8, FP8, BF16, FP16, TF32, FP32, FP64, as well as sparse matrix data (sparsity). Supported by AMD's ROCm open software stack for AMD Instinct accelerators.
* [https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf AMD CDNA3 Whitepaper]
* [https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf AMD Instinct MI300/CDNA3 Instruction Set Architecture]
* [https://www.amd.com/en/developer/resources/rocm-hub.html AMD ROCm Developer Hub]
=== Navi 3x RDNA3 ===
RDNA3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation acceleration.
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]
* [https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf RDNA3 Instruction Set Architecture]
=== CDNA2 ===
CDNA2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]
* [https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf CDNA2 Instruction Set Architecture]
=== CDNA ===
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]
* [https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf CDNA Instruction Set Architecture]
=== Navi 2x RDNA2 ===
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA2] cards were unveiled on October 28, 2020.
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]
=== Navi RDNA 2.0 cards were unveiled on October 28, 2020.=== * [https://en.wikipedia.org/wiki/Radeon_RX_6000_series Radeon RX 6000 series from WikipeduaRDNA_(microarchitecture) RDNA] === Navi RDNA 1cards were unveiled on July 7, 2019.0 ===
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]
* [https://engpuopen.wikipedia.orgcom/wp-content/wikiuploads/RDNA_(microarchitecture) RDNA (microarchitecture) from Wikipedua] RDNA cards were first released in 2019. RDNA is a major change for AMD cards: the underlying hardware supports both Wave32 and Wave64 gangs of threads. Compute Units have 2x32 wide SIMD units, each of which executes 32 threads per clock tick. A Wave64 workgroup will execute on a single SIMD unit, but over two clock ticks. It should be noted that these Wave32 still have 5 cycles of latency before registers can be reused, so a Wave64 executing over two clock ticks will have fewer stalls than a Wave32. * [https:/08/enRDNA_Shader_ISA_5August2019.wikipedia.org/wiki/Radeon_RX_5000_series Radeon RX 5000 series from Wikipeduapdf RDNA Instruction Set Architecture]* Radeon 5700 XT* Radeon 5700
=== Vega GCN 5th gen ===
[https://wwwen.techpowerupwikipedia.comorg/gpu-specswiki/docs/amd-vega-architecture.pdf Architecture WhitepaperRadeon_RX_Vega_series Vega] Vega cards were first released in unveiled on August 14, 2017. Vega is the last in the line of the GCN Architecture: 64 threads per wavefront. Each compute unit contains 4x SIMD units, supporting a total of 40 wavefronts per compute unit (a queue of 10-wavefronts per SIMD Unit). Each SIMD unit contains 16 vALUs for general compute + 1 sALU for branching and constant logic. Each SIMD unit executes the same instruction over four clock ticks (16 vALUs x 4 clock ticks == 64 threads per Wavefront). Vega specifically added Packed FP16 instructions, such as dot-product and packed add and packed multiply. From a programming level, these packed FP16 instructions are SIMD-within-SIMD, each SIMD thread could operate its own SIMD FP16 instruction akin to AVX or SSE from the x64 architecture.
* Radeon VII[https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]* Vega64* Vega56[https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set Architecture]
=== Polaris GCN 4th gen ===
=== Intel Xe 'Gen12' Southern Islands GCN 1st gen ===
Southern Island cards introduced the [https://en.wikipedia.org/wiki/Intel_Xe Intel XeGraphics_Core_Next GCN] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing)architecture in 2012.
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Xe 'Gen12' GPUs Radeon_HD_7000_series AMD Radeon HD 7000 on Wikipedia]* [https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/programmer-references/si_programming_guide_v2.pdf Southern Islands Programming Guide]* [https://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf Southern Islands Instruction Set Architecture]
==NvidiaApple ==Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.
=== Ampere Architecture M series ===The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.
* RTX 2080 Ti* RTX 2080* RTX 2070 Ti* RTX 2070 Super* RTX 2070 * RTX 2060 Super* RTX 2060* GTX 1660 -- Low-end [https://en.wikipedia.org/wiki/Mali_(GPU without Tensor cores or RTX Cores.)#Variants Mali variants on Wikipedia]
=== Volta Architecture Valhall (2019) ===
* [https://imagesdeveloper.nvidiaarm.com/contentdocumentation/volta-architecture101574/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaperlatest Bifrost and Valhall OpenCL Developer Guide]
* Tesla V100* Titan V[https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]
=== Pascal Architecture Midgard (2012) ===* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]
* [https://en.wikipedia.org/wiki/Maxwell(microarchitecture) MaxwellList_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia] cards were first released in 2014.
=Instruction Throughput== Grace Hopper Superchip === GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 Terascale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA])GH200 Grace Hopper Superchip was unveiled August, 2023 and combines the brand Nvidia Grace CPU (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro QuadroARM|ARM v9], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific Nvidia Hopper GPU architectures via NVLink to deliver a CPU+GPU coherent memory modelfor accelerated AI and HPC applications.
* INT8: Some architectures like AMD [https://enimages.wikipedianvidia.orgcom/wikiaem-dam/AMD_RX_Vega_series Vega] or Intel [https:Solutions/geforce/en.wikipedia.orgada/wiki/Intel_Xe Xe] offer higher throughput with lower precision. They double the [https://ennvidia-ada-gpu-architecture.wikipedia.org/wiki/FP16 FP16pdf Ada GPU Whitepaper] and quadruple the [https://en.wikipedia.org/wiki/Integer_(computer_science)#Common_integral_data_types INT8] throughput.<ref>* [https://endocs.wikipedia.org/wiki/Graphics_Core_Next#fifth Vega (GCN 5th generation) from Wikipedia]</ref><ref>[https://www.servethehomenvidia.com/intel-xe-sg1-hp-and-dg1-at-architecture-day-2020cuda/intel-architectureada-daytuning-2020-xe-lp-int8-increaseguide/ xe-lp-int8 from servethehomeindex.html Ada Tuning Guide]</ref>
==Floating Point Instruction Throughput=Hopper Architecture ===The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.
* FP32[https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]* [https: Consumer GPU performance is measured usually in single-precision (32 bit) floating point FMA, fused//docs.nvidia.com/cuda/hopper-multiplytuning-add, throughputguide/index.html Hopper Tuning Guide]
* FP16[https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]* [https: Some newer GPGPU architectures offer half//www.nvidia.com/content/PDF/nvidia-precision (16 bit) floating point operation throughput with an FP32ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]* [https:FP16 ratio of 1:2//docs.nvidia. Older architectures migth not support FP16 at all, at the same rate as FP32, or at very low ratescom/cuda/ampere-tuning-guide/index.html Ampere GPU Architecture Tuning Guide]
==Tensors=====Nvidia TensorCoresTuring Architecture ===: With Nvidia [https://en.wikipedia.org/wiki/Volta_Turing_(microarchitecture) VoltaTuring] series TensorCores cards were introducedfirst released in 2018. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used are the first consumer cores to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16launch with RTX, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_Ray_tracing_(microarchitecturegraphics)raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Details Wikipedia - Ampere microarchitectureConvolutional|convolutional neural networks]]</ref>. The Turing GTX line of chips do not offer RTX or TensorCores.
===Intel XMX CoresVolta Architecture ===: Intel plans XMX, Xe Matrix eXtensions, for its upcoming [https://wwwen.anandtechwikipedia.comorg/show/15973wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the-intel-xelp-gpu-architecture-deep-dive-building-up-from-the-bottom/4 Xe discrete GPUfirst cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]] series.
=Host-Device Latencies= One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&tPowerVR ===67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.
== Qualcomm ==Qualcomm offers Adreno GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], in various types as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipediaa component of their Snapdragon SoCs.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaptionSince Adreno 300 series OpenCL support is offered.
=History== Adreno ===In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+* [https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_SetAdreno#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines, such as [https://en.wikipedia.org/wiki/Quake_(video_game) Quake], could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architectureVariants Adreno variants on Wikipedia], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007), Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006) or ARM [https://en.wikipedia.org/wiki/Mali_(GPU)#Technical_details Mali Midgard] (2012), GPGPU frameworks like CUDA and OpenCL emerged and gained in popularity.
=See also=
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]
'''2014'''
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]
==2015 ...==
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11
'''2016'''
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref>
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]
* [https://talkchess.com/forum3/viewtopic.php?f=7&t=72566&p=955538#p955538 Re: China boosts in silicon...] by [[Srdja Matovic]], [[CCC]], January 13, 2024
=External Links=
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]
* [https://developer.nvidia.com/ NVIDIA Developer]