Changes

GPU

57 bytes removed, 17:06, 9 August 2019

→‎Building up to larger thread groups

This highlights the fundamental trade off of the GPU platform. GPUs have many threads of execution, but they are forced to execute with their warps or wavefronts. In complicated loops or trees of if-statements, this thread divergence problem can cause your code to potentially leave many hardware threads idle.

== ~~Building up to larger thread groups~~ Blocks and Workgroups ==

~~The GPU hardware will execute entire~~ Programmers can group warps or wavefronts at together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a ~~time~~modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resources. ~~Anything less than 32-threads will force some SIMD-~~Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads ~~to idle~~can communicate extremely efficiently. ~~As such~~Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, ~~high-performance programmers should try to schedule~~ as ~~many full-warps or wavefronts~~ long as ~~possible~~those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.

Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory == Grids and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.NDRange ==

~~Workgroups are not~~ CUDA Grids and OpenCL NDRange is the end of scaling ~~however~~for the programming model. ~~GPUs~~ Many blocks can ~~support many workgroups to execute~~ be specified in ~~parallel. AMD Vega Compute Units (CUs) can schedule 40 wavefronts per CU (although it only physically executes 4 wavefronts concurrently), and has 64 CUs available on~~ a ~~Vega64 GPU. AMD Vega64 (Vega) Summary: 64 Threads per Wavefront. 1 to 16 Wavefronts per Workgroup. With 64 CUs supporting 40 wavefronts~~CUDA Grid, ~~a total of 2560 wavefronts (163,840 threads) can be loaded per AMD Vega64~~while many workgroups operate over an OpenCL NDRange.

~~NVidia has a similar language and mechanism. NVidia GPUs can support~~ The underlying hardware supports running many ~~blocks to execute~~ workgroups in parallel, across different compute units. An AMD Vega64 has 64 compute units for example, while an NVidia ~~Symmetric Multiprocessors can schedule 32 warp per SM (although it can only physically execute 1 warp at a time). With 40 SMs available on a~~ RTX 2070has 40 symmetric multiprocessors. ~~NVidia RTX 2070 (Turing) Summary: 32 Threads~~ The hardware scheduler can fit many blocks and workgroups per ~~Warp~~compute unit. ~~1 to 32 Warps per Block. With 40 SMs~~The exact number is dependent on the amount of registers, ~~each supporting 32 warps~~memory, and wavefronts a ~~total of 1280 warps (40,960 threads) can be scheduled per RTX 2070~~particular workgroup uses.

~~The challenge of GPU Compute Languages is to provide the programmer the flexibility to take advantage of memory optimizations at the~~ CUDA ~~Block or~~ Grids and OpenCL ~~Workgroup level (~1024 threads)~~NDRanges may operate in parallel, ~~while still being able to specify the tens-of-thousands of physical threads working on~~ or may be traversed sequentially if the ~~typical~~ GPUdoesn't have enough parallel resources.

= Architectures and Physical Hardware =

NVidia's consumer line of cards is Geforce, branded with RTX or GTX labels. Nvidia's professional line of cards is Quadro. Finally, Tesla cards constitute NVidia's server line.

NVidia's "Titan" line of Geforce cards ~~are technically~~ use consumer ~~cards~~drivers, but ~~internally are using~~ use professional or server class chips. As such, the Titan line can cost anywhere from $1000 to $3000 per card.

=== Turing Architecture ===

[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]

Turing cards were first released in 2018. They are the first consumer cores to launch with RTX, or raytracing, features. RTX instructions will more quickly traverse an aabb tree to discover ray-intersections with lists of objects. These are also the first consumer cards to launch with Tensor cores, 4x4 matrix multiplication FP16 instructions to accelerate convolutional neural networks.

=== Volta Architecture ===

[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]

Volta cards were released in 2018. Only Tesla and Titan cards were produced in this generation, aiming only for the most expensive end of the market. They were the first cards to launch with Tensor cores, supporting 4x4 FP16 matrix multiplications to accelerate convolutional neural networks.

=== Pascal Architecture ===

[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]

Pascal cards were first released in 2016.

== AMD ==

=== RDNA 1.0 == = [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]

RDNA cards were first released in 2019. RDNA is a major change for AMD cards: the underlying hardware supports both Wave32 and Wave64 gangs of threads.

* 5700

=== Vega GCN 5th gen === [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]

Vega cards were first released in 2017.

* Vega56

=== Polaris GCN 4th gen == = [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]

* RX 580

PercivalTiglao

85

edits

Changes

GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools