Changes

GPU

5,881 bytes added, 22:11, 17 January 2023

m

→‎Nvidia

'''[[Main Page|Home]] * [[Hardware]] * GPU'''

[[FILE:~~6600GT GPU~~NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/~~GeForce_6_series GeForce 6600GT (NV43)~~Nvidia_Tesla Nvidia Tesla] ~~GPU~~ <ref>[https://commons.wikimedia.org/wiki/~~Graphics_processing_unit Graphics processing unit -~~ File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]]

'''GPU''' (Graphics Processing Unit),<br/>

=History=

In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.

=GPU in Computer Chess=

== Apple ==

Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal ] is recommended by [[Apple]].

* [https://developer.apple.com/opencl/ Apple OpenCL Developer]

* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]

* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]

* [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Nvidia CUDA C++ Programming Guide]

* [https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html Nvidia CUDA C++ Best Practices Guide]

== Further ==

A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.

{| class="wikitable" style="margin:auto"

|+ Vendor Terminology

|-

! AMD Terminology !! Nvidia Terminology

|-

| Compute Unit || Streaming Multiprocessor

|-

| Stream Core || CUDA Core

|-

| Wavefront || Warp

|}

===Hardware Examples===

Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref>

* 512 CUDA cores @1.544GHz

* 16 SMs - Streaming Multiprocessors

* organized in 2x16 CUDA cores per SM

* Warp size of 32 threads

AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref>

* 2048 Stream cores @0.925GHz

* 32 Compute Units

* organized in 4xSIMD16, each SIMT4, per Compute Unit

* Wavefront size of 64 work-items

===Wavefront and Warp===

Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.

=Programming Model=

A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a ~~block (~~work-group ~~in OpenCL)~~, one or multiple ~~blocks~~ work-groups form the ~~grid (~~NDRange ~~in OpenCL)~~ to be executed on the GPU device. The members of a ~~block resp.~~ work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many ~~threads~~ work-items a ~~block~~ work-group can hold and how many threads can run in total concurrently on the device. {| class="wikitable" style="margin:auto"|+ Terminology|-! OpenCL Terminology !! CUDA Terminology|-| Kernel || Kernel|-| Compute Unit || Streaming Multiprocessor|-| Processing Element || CUDA Core|-| Work-Item || Thread|-| Work-Group || Block|-| NDRange || Grid|-|} ==Thread Examples== Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref> * Warp size: 32* Maximum number of threads per block: 1024* Maximum number of resident blocks per multiprocessor: 32* Maximum number of resident warps per multiprocessor: 64* Maximum number of resident threads per multiprocessor: 2048 AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref> * Wavefront size: 64* Maximum number of work-items per work-group: 1024* Maximum number of work-groups per compute unit: 40* Maximum number of Wavefronts per compute unit: 40* Maximum number of work-items per compute unit: 2560

=Memory Model=

* __constant - read-only memory.

* __global - usually VRAM, accessable by all work-items resp. threads.

{| class="wikitable" style="margin:auto"

|+ Terminology

|-

! OpenCL Terminology !! CUDA Terminology

|-

| Private Memory || Registers

|-

| Local Memory || Shared Memory

|-

| Constant Memory || Constant Memory

|-

| Global Memory || Global Memory

|}

===Memory Examples===

~~Here the data for the~~ Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] ~~as an example:~~ <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>

* 128 KiB private memory per compute unit

* 48 KiB (16 KiB) local memory per compute unit (configurable)

* 768 KiB L2 cache

* 1.5 GiB to 3 GiB global memory

~~Here the data for the~~ AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) ~~as an example:~~ <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>

* 256 KiB private memory per compute unit

* 64 KiB local memory per compute unit

===Unified Memory===

Usually data has to be ~~transferred/~~copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.

=Instruction Throughput=

==Nvidia TensorCores==

: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref>

==AMD Matrix Cores==

: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.

==Intel XMX Cores==

: Intel ~~plans~~ added XMX, Xe Matrix eXtensions, ~~for its upcoming~~ cores to the [https://~~www~~en.~~anandtech~~wikipedia.~~com~~org/~~show~~wiki/~~15973/the-intel-xelp-gpu-architecture-deep-dive-building-up-from-the-bottom/4 Xe discrete~~ List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU] series.

=Host-Device Latencies=

* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]

=== ~~CDNA 2~~ Navi 3x RDNA3 === ~~CDNA 2~~ RDNA3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators. * [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]* [https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf RDNA3 Instruction Set Architecture] === CDNA2 === CDNA2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.

* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]

* [https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf CDNA2 Instruction Set Architecture]

=== CDNA ===

* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]

* [https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf CDNA Instruction Set Architecture]

=== Navi ~~2X RDNA 2~~ 2x RDNA2 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 ~~RDNA 2~~RDNA2] cards were unveiled on October 28, 2020.

* [https://en.wikipedia.org/wiki/Radeon_RX_6000_series AMD Radeon RX 6000 on Wikipedia]

* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]

=== Navi RDNA 1 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.

* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]

* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]

=== Southern Islands GCN 1st gen ===

Southern Island cards introduced the [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN] architecture in 2012.

* [https://en.wikipedia.org/wiki/Radeon_HD_7000_series AMD Radeon HD 7000 on Wikipedia]

* [https://amd.wpenginepowered.com/wordpress/media/2013/10/si_programming_guide_v2.pdf Southern Islands Programming Guide]

* [https://amd.wpenginepowered.com/wordpress/media/2013/07/AMD_Southern_Islands_Instruction_Set_Architecture1.pdf Southern Islands Instruction Set Architecture]

== Apple ==

=== M1 M series ===

Apple released its M1 M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.

* [https://en.wikipedia.org/wiki/Apple_silicon#M_series M1 Apple M series on Wikipedia]

== ARM ==

== Intel ==

=== Xe ~~'Gen12'~~ ===

[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).

* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]

==Nvidia==

* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]

=== Ada Lovelace Architecture ===

The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.

* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]

* [https://docs.nvidia.com/cuda/ada-tuning-guide/index.html Ada Tuning Guide]

=== Hopper Architecture ===

The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.

* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]

* [https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html Hopper Tuning Guide]

=== Ampere Architecture ===

* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]

* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]

* [https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html Ampere GPU Architecture Tuning Guide]

=== Turing Architecture ===

[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.

* [https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf ~~Architectural~~ Turing Architecture Whitepaper]* [https://docs.nvidia.com/cuda/turing-tuning-guide/index.html Turing Tuning Guide]

=== Volta Architecture ===

[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].

* [https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Volta Architecture Whitepaper]* [https://docs.nvidia.com/cuda/volta-tuning-guide/index.html Volta Tuning Guide]

=== Pascal Architecture ===

[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.

* [https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Pascal Architecture Whitepaper]* [https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html Pascal Tuning Guide]

=== Maxwell Architecture ===

[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.

* [https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Maxwell Architecture Whitepaper on archiv.org]* [https://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html Maxwell Tuning Guide]

== PowerVR ==

* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020

* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]

* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021

* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]

Smatovic

422

edits

Changes

GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools