Changes

GPU

3,491 bytes added, 22:11, 17 January 2023

m

→‎Nvidia

* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]

* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]

* [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Nvidia CUDA C++ Programming Guide]

* [https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html Nvidia CUDA C++ Best Practices Guide]

== Further ==

A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.

{| class="wikitable" style="margin:auto"

|+ Vendor Terminology

|-

! AMD Terminology !! Nvidia Terminology

|-

| Compute Unit || Streaming Multiprocessor

|-

| Stream Core || CUDA Core

|-

| Wavefront || Warp

|}

===Hardware Examples===

Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref>

* 512 ~~cuda~~ CUDA cores @1.544GHz* 16 SMs - Streaming Multiprocessors ~~(Compute Units)~~* organized in 2x16 ~~cuda~~ CUDA cores per SM

* Warp size of 32 threads

AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref>

* 2048 ~~stream~~ Stream cores @0.925GHz

* 32 Compute Units

* organized in 4xSIMD16/, each SIMT4 , per Compute Unit* Wavefront size of 64 ~~Work~~work-~~Items~~items ===Wavefront and Warp===Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.

=Programming Model=

A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a ~~block (~~work-group ~~in OpenCL)~~, one or multiple ~~blocks~~ work-groups form the ~~grid (~~NDRange ~~in OpenCL)~~ to be executed on the GPU device. The members of a ~~block resp.~~ work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many ~~threads~~ work-items a ~~block~~ work-group can hold and how many threads can run in total concurrently on the device. {| class="wikitable" style="margin:auto"|+ Terminology|-! OpenCL Terminology !! CUDA Terminology|-| Kernel || Kernel|-| Compute Unit || Streaming Multiprocessor|-| Processing Element || CUDA Core|-| Work-Item || Thread|-| Work-Group || Block|-| NDRange || Grid|-|} ==Thread Examples== Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref> * Warp size: 32* Maximum number of threads per block: 1024* Maximum number of resident blocks per multiprocessor: 32* Maximum number of resident warps per multiprocessor: 64* Maximum number of resident threads per multiprocessor: 2048 AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref> * Wavefront size: 64* Maximum number of work-items per work-group: 1024* Maximum number of work-groups per compute unit: 40* Maximum number of Wavefronts per compute unit: 40* Maximum number of work-items per compute unit: 2560

=Memory Model=

* __constant - read-only memory.

* __global - usually VRAM, accessable by all work-items resp. threads.

{| class="wikitable" style="margin:auto"

|+ Terminology

|-

! OpenCL Terminology !! CUDA Terminology

|-

| Private Memory || Registers

|-

| Local Memory || Shared Memory

|-

| Constant Memory || Constant Memory

|-

| Global Memory || Global Memory

|}

===Memory Examples===

===Unified Memory===

Usually data has to be ~~transferred/~~copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.

=Instruction Throughput=

* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]

=== Navi 3x ~~RDNA 3~~ RDNA3 === ~~RDNA 3~~ RDNA3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.

* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]

* [https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf RDNA3 Instruction Set Architecture]

=== ~~CDNA 2~~ CDNA2 === ~~CDNA 2~~ CDNA2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.

* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]

* [https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf CDNA2 Instruction Set Architecture]

=== CDNA ===

* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]

* [https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf CDNA Instruction Set Architecture]

=== Navi 2x ~~RDNA 2~~ RDNA2 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 ~~RDNA 2~~RDNA2] cards were unveiled on October 28, 2020.

* [https://en.wikipedia.org/wiki/Radeon_RX_6000_series AMD Radeon RX 6000 on Wikipedia]

* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]

=== Navi RDNA 1 === [https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.

* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]

* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]

=== Southern Islands GCN 1st gen ===

Southern Island cards introduced the [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN] architecture in 2012.

* [https://en.wikipedia.org/wiki/Radeon_HD_7000_series AMD Radeon HD 7000 on Wikipedia]

* [https://amd.wpenginepowered.com/wordpress/media/2013/10/si_programming_guide_v2.pdf Southern Islands Programming Guide]

* [https://amd.wpenginepowered.com/wordpress/media/2013/07/AMD_Southern_Islands_Instruction_Set_Architecture1.pdf Southern Islands Instruction Set Architecture]

== Apple ==

* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]

* [https://docs.nvidia.com/cuda/ada-tuning-guide/index.html Ada Tuning Guide]

=== Hopper Architecture ===

* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]

* [https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html Hopper Tuning Guide]

=== Ampere Architecture ===

* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]

* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]

* [https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html Ampere GPU Architecture Tuning Guide]

=== Turing Architecture ===

[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.

* [https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf ~~Architectural~~ Turing Architecture Whitepaper]* [https://docs.nvidia.com/cuda/turing-tuning-guide/index.html Turing Tuning Guide]

=== Volta Architecture ===

[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].

* [https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Volta Architecture Whitepaper]* [https://docs.nvidia.com/cuda/volta-tuning-guide/index.html Volta Tuning Guide]

=== Pascal Architecture ===

[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.

* [https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Pascal Architecture Whitepaper]* [https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html Pascal Tuning Guide]

=== Maxwell Architecture ===

[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.

* [https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Maxwell Architecture Whitepaper on archiv.org]* [https://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html Maxwell Tuning Guide]

== PowerVR ==

Smatovic

422

edits

Changes

GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools