Changes

Jump to: navigation, search

GPU

3,491 bytes added, 22:11, 17 January 2023
m
Nvidia
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]
* [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Nvidia CUDA C++ Programming Guide]
* [https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html Nvidia CUDA C++ Best Practices Guide]
== Further ==
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.
 
{| class="wikitable" style="margin:auto"
|+ Vendor Terminology
|-
! AMD Terminology !! Nvidia Terminology
|-
| Compute Unit || Streaming Multiprocessor
|-
| Stream Core || CUDA Core
|-
| Wavefront || Warp
|}
===Hardware Examples===
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref>
* 512 cuda CUDA cores @1.544GHz* 16 SMs - Streaming Multiprocessors (Compute Units)* organized in 2x16 cuda CUDA cores per SM
* Warp size of 32 threads
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref>
* 2048 stream Stream cores @0.925GHz
* 32 Compute Units
* organized in 4xSIMD16/, each SIMT4 , per Compute Unit* Wavefront size of 64 Workwork-Itemsitems ===Wavefront and Warp===Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.
=Programming Model=
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks work-groups form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads work-items a block work-group can hold and how many threads can run in total concurrently on the device. {| class="wikitable" style="margin:auto"|+ Terminology|-! OpenCL Terminology !! CUDA Terminology|-| Kernel || Kernel|-| Compute Unit || Streaming Multiprocessor|-| Processing Element || CUDA Core|-| Work-Item || Thread|-| Work-Group || Block|-| NDRange || Grid|-|} ==Thread Examples== Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref> * Warp size: 32* Maximum number of threads per block: 1024* Maximum number of resident blocks per multiprocessor: 32* Maximum number of resident warps per multiprocessor: 64* Maximum number of resident threads per multiprocessor: 2048  AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref> * Wavefront size: 64* Maximum number of work-items per work-group: 1024* Maximum number of work-groups per compute unit: 40* Maximum number of Wavefronts per compute unit: 40* Maximum number of work-items per compute unit: 2560
=Memory Model=
* __constant - read-only memory.
* __global - usually VRAM, accessable by all work-items resp. threads.
 
{| class="wikitable" style="margin:auto"
|+ Terminology
|-
! OpenCL Terminology !! CUDA Terminology
|-
| Private Memory || Registers
|-
| Local Memory || Shared Memory
|-
| Constant Memory || Constant Memory
|-
| Global Memory || Global Memory
|}
===Memory Examples===
===Unified Memory===
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.
=Instruction Throughput=
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]
=== Navi 3x RDNA 3 RDNA3 === RDNA 3 RDNA3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]
* [https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf RDNA3 Instruction Set Architecture]
=== CDNA 2 CDNA2 === CDNA 2 CDNA2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]
* [https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf CDNA2 Instruction Set Architecture]
=== CDNA ===
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]
* [https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf CDNA Instruction Set Architecture]
=== Navi 2x RDNA 2 RDNA2 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2RDNA2] cards were unveiled on October 28, 2020.
* [https://en.wikipedia.org/wiki/Radeon_RX_6000_series AMD Radeon RX 6000 on Wikipedia]
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]
=== Navi RDNA 1 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]
 
=== Southern Islands GCN 1st gen ===
 
Southern Island cards introduced the [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN] architecture in 2012.
 
* [https://en.wikipedia.org/wiki/Radeon_HD_7000_series AMD Radeon HD 7000 on Wikipedia]
* [https://amd.wpenginepowered.com/wordpress/media/2013/10/si_programming_guide_v2.pdf Southern Islands Programming Guide]
* [https://amd.wpenginepowered.com/wordpress/media/2013/07/AMD_Southern_Islands_Instruction_Set_Architecture1.pdf Southern Islands Instruction Set Architecture]
== Apple ==
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]
* [https://docs.nvidia.com/cuda/ada-tuning-guide/index.html Ada Tuning Guide]
=== Hopper Architecture ===
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]
* [https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html Hopper Tuning Guide]
=== Ampere Architecture ===
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]
* [https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html Ampere GPU Architecture Tuning Guide]
=== Turing Architecture ===
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.
* [https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Turing Architecture Whitepaper]* [https://docs.nvidia.com/cuda/turing-tuning-guide/index.html Turing Tuning Guide]
=== Volta Architecture ===
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].
* [https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Volta Architecture Whitepaper]* [https://docs.nvidia.com/cuda/volta-tuning-guide/index.html Volta Tuning Guide]
=== Pascal Architecture ===
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.
* [https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Pascal Architecture Whitepaper]* [https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html Pascal Tuning Guide]
=== Maxwell Architecture ===
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.
* [https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Maxwell Architecture Whitepaper on archiv.org]* [https://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html Maxwell Tuning Guide]
== PowerVR ==
422
edits

Navigation menu