Changes

Jump to: navigation, search

GPU

11,336 bytes added, 24 January
m
2020 ...: link to Chinese gpus
'''[[Main Page|Home]] * [[Hardware]] * GPU'''
[[FILE:6600GT GPUNvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/GeForce_6_series GeForce 6600GT (NV43)Nvidia_Tesla Nvidia Tesla] GPU <ref>[https://commons.wikimedia.org/wiki/Graphics_processing_unit Graphics processing unit - File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]]
'''GPU''' (Graphics Processing Unit),<br/>
a specialized processor primarily initially intended to for fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.
=History=
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 IMPACT_(computer_graphics) SGI Impact] (1995) in 3D graphics-workstations or [https://en.wikipedia.org/wiki/3dfx#Voodoo_Graphics_PCI 3dfx Voodoo2Voodoo](1996) for playing 3D games on PCs, were used by the video game community to play 3D graphicsemerged. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip [https://en.wikipedia.org/wiki/PlayStation_technical_specifications#Graphics_processing_unit_(GPU) GTE] used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like [https://en.wikipedia.org/wiki/NV1 NV1 ] (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.
=GPU in Computer Chess=
There are in main three approaches four ways how to use a GPU for Chesschess:
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.* As an a hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree* Neural network training such as [https://github.com/glinscott/nnue-pytorch Stockfish NNUE trainer in Pytorch]<ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=75724 Pytorch NNUE training] by [[Gary Linscott]], [[CCC]], November 08, 2020</ref> or [https://github.com/LeelaChessZero/lczero-training Lc0 TensorFlow Training]
=GPU Chess Engines=
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]
* [https://rocmrocmdocs.githubamd.iocom/en/ ROCm Homepagelatest/index.html AMD ROCm™ documentation]* [httphttps://developer.amdmanualzz.com/wordpressdoc/mediao/2013cggy6/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guideamd-opencl-programming-user-revguide-2.7.pdf contents AMD OpenCL Programming Guide]
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]
* [https://gpuopen.com/wpamd-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]* [https://developer.amd.com/wpisa-content/resourcesdocumentation/Vega_Shader_ISA_28July2017.pdf Vega Instruction SetAMD GPU ISA documentation]
== Apple ==
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal ] is recommended by [[Apple]].
* [https://developer.apple.com/opencl/ Apple OpenCL Developer]
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]
 
== Intel ==
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.
 
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]
== Nvidia ==
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]
* [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Nvidia CUDA C++ Programming Guide]
* [https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html Nvidia CUDA C++ Best Practices Guide]
== Further ==
 * [https://en.wikipedia.org/wiki/OneAPI_Vulkan#Planned_features Vulkan] (programming_modelOpenGL sucessor of Khronos Group) oneAPI* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (IntelMicrosoft)
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)
=Hardware Model=
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities, - floating-point and/or integer with specific bit-width of the FPU/ALUand registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMACUs MMAC units (matrix-multiply-accumulate-units) are used to speed up neural networks further. {| class="wikitable" style="margin:auto"|+ Vendor Terminology|-! AMD Terminology !! Nvidia Terminology|-| Compute Unit || Streaming Multiprocessor|-| Stream Core || CUDA Core|-| Wavefront || Warp|} ===Hardware Examples=== Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref> * 512 CUDA cores @1.544GHz* 16 SMs - Streaming Multiprocessors* organized in 2x16 CUDA cores per SM* Warp size of 32 threads AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref> * 2048 Stream cores @0.925GHz* 32 Compute Units* organized in 4xSIMD16, each SIMT4, per Compute Unit* Wavefront size of 64 work-items ===Wavefront and Warp===Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.
=Programming Model=
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (resp. work-items in OpenCL) contain the kernel to be computed and are coupled to a block (respwork-group, one or multiple work-groups form the NDRange to be executed on the GPU device. The members of a work-group in OpenCL)execute the same kernel, these can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads work-items a block work-group can holdand how many threads can run in total concurrently on the device{| class="wikitable" style="margin:auto"|+ Terminology|-! OpenCL Terminology !! CUDA Terminology|-| Kernel || Kernel|-| Compute Unit || Streaming Multiprocessor|-| Processing Element || CUDA Core|-| Work-Item || Thread|-| Work-Group || Block|-| NDRange || Grid|-|} ==Thread Examples== Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref> * Warp size: 32* Maximum number of threads per block: 1024* Maximum number of resident blocks per multiprocessor: 32* Maximum number of resident warps per multiprocessor: 64* Maximum number of resident threads per multiprocessor: 2048  AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref> * Wavefront size: 64* Maximum number of work-items per work-group: 1024* Maximum number of work-groups per compute unit: 40* Maximum number of Wavefronts per compute unit: 40* Maximum number of work-items per compute unit: 2560
=Memory Model=
* __global - usually VRAM, accessable by all work-items resp. threads.
{| class="wikitable" style="margin:auto"
|+ Terminology
|-
! OpenCL Terminology !! CUDA Terminology
|-
| Private Memory || Registers
|-
| Local Memory || Shared Memory
|-
| Constant Memory || Constant Memory
|-
| Global Memory || Global Memory
|}
Here the data for the ===Memory Examples=== Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] as an example: <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>
* 128 KiB private memory per compute unit
* 48 KiB (16 KiB) local memory per compute unit (configurable)
* 8 KiB constant cache per compute unit
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)
* 768 KiB L2 cachein total
* 1.5 GiB to 3 GiB global memory
Here the data for the AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) as an example: <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>
* 256 KiB private memory per compute unit
* 64 KiB local memory per compute unit
* 16 KiB constant cache per four compute units
* 16 KiB L1 cache per compute unit
* 768 KiB L2 cachein total
* 3 GiB to 6 GiB global memory
 
===Unified Memory===
 
Usually data has to be copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.
=Instruction Throughput=
* INT64
: In general GPU [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.
* INT8
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.
==Floating -Point Instruction Throughput==
* FP32
: Consumer GPU performance is measured usually in single-precision (32 -bit) floating -point FMA, (fused-multiply-add, ) throughput.
* FP64
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64 -bit) floating -point operations throughput than server brand GPUs.
* FP16
: Some GPGPU architectures offer half-precision (16 -bit) floating -point operation throughput with an FP32:FP16 ratio of 1:2. ==Tensors=====Nvidia TensorCores===: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref> ===AMD Matrix Cores===: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA, matrix-fused-multiply-add, operations on various data types like INT8, FP16, BF16, FP32. ===Intel XMX Cores===: Intel plans XMX, Xe Matrix eXtensions, for its upcoming [https://www.anandtech.com/show/15973/the-intel-xelp-gpu-architecture-deep-dive-building-up-from-the-bottom/4 Xe discrete GPU] series.
==Throughput Examples==
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 -bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref>
MAD 16
Bitwise XOR 32
Max theoretic ADD operation throughput: 32 Ops * x 16 CUs * x 1544 MHz = 790.528 GigaOps/sec
AMD Radeon HD 7970 (GCN 1.0) - 32 -bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref>
MAD 1/4
Bitwise XOR 1
Max theoretic ADD operation throughput: 1 Op * x 2048 PEs * x 925 MHz = 1894.4 GigaOps/sec =Tensors=MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit. ==Nvidia TensorCores==: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref> ==AMD Matrix Cores==: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation acceleration. AMD's CDNA 3 architecture adds support for FP8 and sparse matrix data (sparsity). ==Intel XMX Cores==: Intel added XMX, Xe Matrix eXtensions, cores to some of the [https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] GPU series, like [https://en.wikipedia.org/wiki/Intel_Arc#Alchemist Arc Alchemist] and [https://www.intel.com/content/www/us/en/products/sku/232876/intel-data-center-gpu-max-1100/specifications.html Intel Data Center GPU Max Series].
=Host-Device Latencies=
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.
=Deep Learning=
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]
 
=== CDNA3 ===
CDNA3 HPC architecture was unveiled in December, 2023. With MI300A APU model (CPU+GPU+HBM) and MI300X GPU model, both with multi-chip modules design. Featuring Matrix Cores with support for a broad type of precision, as INT8, FP8, BF16, FP16, TF32, FP32, FP64, as well as sparse matrix data (sparsity). Supported by AMD's ROCm open software stack for AMD Instinct accelerators.
 
* [https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf AMD CDNA3 Whitepaper]
* [https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf AMD Instinct MI300/CDNA3 Instruction Set Architecture]
* [https://www.amd.com/en/developer/resources/rocm-hub.html AMD ROCm Developer Hub]
 
=== Navi 3x RDNA3 ===
RDNA3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation acceleration.
 
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]
* [https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf RDNA3 Instruction Set Architecture]
 
=== CDNA2 ===
CDNA2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.
 
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]
* [https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf CDNA2 Instruction Set Architecture]
=== CDNA ===
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]
* [https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf CDNA Instruction Set Architecture]
=== Navi 2X RDNA 2.0 2x RDNA2 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2.0RDNA2] cards were unveiled on October 28, 2020.
* [https://en.wikipedia.org/wiki/Radeon_RX_6000_series AMD Radeon RX 6000 on Wikipedia]
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]
=== Navi RDNA 1.0 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1.0] cards were unveiled on July 7, 2019.
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction SetArchitecture]
=== Vega GCN 5th gen ===
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction SetArchitecture]
=== Polaris GCN 4th gen ===
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]
* [https://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf GCN3/4 Instruction Set Architecture]
 
=== Southern Islands GCN 1st gen ===
 
Southern Island cards introduced the [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN] architecture in 2012.
 
* [https://en.wikipedia.org/wiki/Radeon_HD_7000_series AMD Radeon HD 7000 on Wikipedia]
* [https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/programmer-references/si_programming_guide_v2.pdf Southern Islands Programming Guide]
* [https://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf Southern Islands Instruction Set Architecture]
== Apple ==
=== M1 M series ===
Apple released its M1 M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M1 M series on Wikipedia]
== ARM Mali ==The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]
== Intel ==
=== Intel Xe 'Gen12' ===
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]* [https://en.wikipedia.org/wiki/Intel_Arc#Alchemist Arc Alchemist series on Wikipedia]
==Nvidia==
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]
 
=== Grace Hopper Superchip ===
The Nvidia GH200 Grace Hopper Superchip was unveiled August, 2023 and combines the Nvidia Grace CPU ([[ARM|ARM v9]]) and Nvidia Hopper GPU architectures via NVLink to deliver a CPU+GPU coherent memory model for accelerated AI and HPC applications.
 
* [https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip NVIDIA Grace Hopper Superchip Data Sheet]
* [https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper NVIDIA Grace Hopper Superchip Architecture Whitepaper]
 
=== Ada Lovelace Architecture ===
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.
 
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]
* [https://docs.nvidia.com/cuda/ada-tuning-guide/index.html Ada Tuning Guide]
 
=== Hopper Architecture ===
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.
 
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]
* [https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html Hopper Tuning Guide]
=== Ampere Architecture ===
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]
* [https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html Ampere GPU Architecture Tuning Guide]
=== Turing Architecture ===
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.
* [https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Turing Architecture Whitepaper]* [https://docs.nvidia.com/cuda/turing-tuning-guide/index.html Turing Tuning Guide]
=== Volta Architecture ===
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].
* [https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Volta Architecture Whitepaper]* [https://docs.nvidia.com/cuda/volta-tuning-guide/index.html Volta Tuning Guide]
=== Pascal Architecture ===
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.
* [https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Pascal Architecture Whitepaper]* [https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html Pascal Tuning Guide]
=== Maxwell Architecture ===
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.
* [https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Maxwell Architecture Whitepaper on archiv.org]* [https://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html Maxwell Tuning Guide]
== PowerVR ==
PowerVR (Imagination Technologies ) licenses PowerVR IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available. === PowerVR === * [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia] === IMG === * [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]
=== PowerVR Graphics =Qualcomm ==Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.
=== Adreno ===* [https://en.wikipedia.org/wiki/PowerVRAdreno#PowerVR_Graphics PowerVR Series Variants Adreno variants on Wikipedia]
== Vivante Corporation ==
=== GC-Series ===
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC-Series series on Wikipedia]
=See also=
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]
* [https://talkchess.com/forum3/viewtopic.php?f=7&t=72566&p=955538#p955538 Re: China boosts in silicon...] by [[Srdja Matovic]], [[CCC]], January 13, 2024
=External Links=
422
edits

Navigation menu