Changes

Jump to: navigation, search

GPU

5,881 bytes added, 22:11, 17 January 2023
m
Nvidia
'''[[Main Page|Home]] * [[Hardware]] * GPU'''
[[FILE:6600GT GPUNvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/GeForce_6_series GeForce 6600GT (NV43)Nvidia_Tesla Nvidia Tesla] GPU <ref>[https://commons.wikimedia.org/wiki/Graphics_processing_unit Graphics processing unit - File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]]
'''GPU''' (Graphics Processing Unit),<br/>
=History=
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.
=GPU in Computer Chess=
== Apple ==
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal ] is recommended by [[Apple]].
* [https://developer.apple.com/opencl/ Apple OpenCL Developer]
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]
* [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Nvidia CUDA C++ Programming Guide]
* [https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html Nvidia CUDA C++ Best Practices Guide]
== Further ==
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.
 
{| class="wikitable" style="margin:auto"
|+ Vendor Terminology
|-
! AMD Terminology !! Nvidia Terminology
|-
| Compute Unit || Streaming Multiprocessor
|-
| Stream Core || CUDA Core
|-
| Wavefront || Warp
|}
 
===Hardware Examples===
 
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref>
 
* 512 CUDA cores @1.544GHz
* 16 SMs - Streaming Multiprocessors
* organized in 2x16 CUDA cores per SM
* Warp size of 32 threads
 
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref>
 
* 2048 Stream cores @0.925GHz
* 32 Compute Units
* organized in 4xSIMD16, each SIMT4, per Compute Unit
* Wavefront size of 64 work-items
 
===Wavefront and Warp===
Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.
=Programming Model=
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks work-groups form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads work-items a block work-group can hold and how many threads can run in total concurrently on the device. {| class="wikitable" style="margin:auto"|+ Terminology|-! OpenCL Terminology !! CUDA Terminology|-| Kernel || Kernel|-| Compute Unit || Streaming Multiprocessor|-| Processing Element || CUDA Core|-| Work-Item || Thread|-| Work-Group || Block|-| NDRange || Grid|-|} ==Thread Examples== Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref> * Warp size: 32* Maximum number of threads per block: 1024* Maximum number of resident blocks per multiprocessor: 32* Maximum number of resident warps per multiprocessor: 64* Maximum number of resident threads per multiprocessor: 2048  AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref> * Wavefront size: 64* Maximum number of work-items per work-group: 1024* Maximum number of work-groups per compute unit: 40* Maximum number of Wavefronts per compute unit: 40* Maximum number of work-items per compute unit: 2560
=Memory Model=
* __constant - read-only memory.
* __global - usually VRAM, accessable by all work-items resp. threads.
 
{| class="wikitable" style="margin:auto"
|+ Terminology
|-
! OpenCL Terminology !! CUDA Terminology
|-
| Private Memory || Registers
|-
| Local Memory || Shared Memory
|-
| Constant Memory || Constant Memory
|-
| Global Memory || Global Memory
|}
===Memory Examples===
Here the data for the Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] as an example: <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>
* 128 KiB private memory per compute unit
* 48 KiB (16 KiB) local memory per compute unit (configurable)
* 768 KiB L2 cache
* 1.5 GiB to 3 GiB global memory
Here the data for the AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) as an example: <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>
* 256 KiB private memory per compute unit
* 64 KiB local memory per compute unit
===Unified Memory===
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.
=Instruction Throughput=
==Nvidia TensorCores==
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref>
==AMD Matrix Cores==
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.
==Intel XMX Cores==
: Intel plans added XMX, Xe Matrix eXtensions, for its upcoming cores to the [https://wwwen.anandtechwikipedia.comorg/showwiki/15973/the-intel-xelp-gpu-architecture-deep-dive-building-up-from-the-bottom/4 Xe discrete List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU] series.
=Host-Device Latencies=
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]
=== CDNA 2 Navi 3x RDNA3 === CDNA 2 RDNA3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators. * [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]* [https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf RDNA3 Instruction Set Architecture] === CDNA2 === CDNA2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]
* [https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf CDNA2 Instruction Set Architecture]
=== CDNA ===
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]
* [https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf CDNA Instruction Set Architecture]
=== Navi 2X RDNA 2 2x RDNA2 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2RDNA2] cards were unveiled on October 28, 2020.
* [https://en.wikipedia.org/wiki/Radeon_RX_6000_series AMD Radeon RX 6000 on Wikipedia]
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]
=== Navi RDNA 1 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]
 
=== Southern Islands GCN 1st gen ===
 
Southern Island cards introduced the [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN] architecture in 2012.
 
* [https://en.wikipedia.org/wiki/Radeon_HD_7000_series AMD Radeon HD 7000 on Wikipedia]
* [https://amd.wpenginepowered.com/wordpress/media/2013/10/si_programming_guide_v2.pdf Southern Islands Programming Guide]
* [https://amd.wpenginepowered.com/wordpress/media/2013/07/AMD_Southern_Islands_Instruction_Set_Architecture1.pdf Southern Islands Instruction Set Architecture]
== Apple ==
=== M1 M series ===
Apple released its M1 M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series M1 Apple M series on Wikipedia]
== ARM ==
== Intel ==
=== Xe 'Gen12' ===
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]
==Nvidia==
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]
 
=== Ada Lovelace Architecture ===
 
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.
 
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]
* [https://docs.nvidia.com/cuda/ada-tuning-guide/index.html Ada Tuning Guide]
 
=== Hopper Architecture ===
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.
 
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]
* [https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html Hopper Tuning Guide]
=== Ampere Architecture ===
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]
* [https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html Ampere GPU Architecture Tuning Guide]
=== Turing Architecture ===
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.
* [https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Turing Architecture Whitepaper]* [https://docs.nvidia.com/cuda/turing-tuning-guide/index.html Turing Tuning Guide]
=== Volta Architecture ===
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].
* [https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Volta Architecture Whitepaper]* [https://docs.nvidia.com/cuda/volta-tuning-guide/index.html Volta Tuning Guide]
=== Pascal Architecture ===
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.
* [https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Pascal Architecture Whitepaper]* [https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html Pascal Tuning Guide]
=== Maxwell Architecture ===
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.
* [https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Maxwell Architecture Whitepaper on archiv.org]* [https://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html Maxwell Tuning Guide]
== PowerVR ==
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]
422
edits

Navigation menu