Changes

Jump to: navigation, search

GPU

2,742 bytes added, 00:25, 8 August 2019
SIMD Model
* OpenMP 4.5 Device Offload
=The Implicitly Parallel SIMD Model=
CUDA, OpenCL, HIP, and even other GPU languages like GLSL, HLSL, C++AMP and even non-GPU languages like Intel [ISPC](https://ispc.github.io/) all have the same model of implicitly parallel programming. Gangs of threads called Warps in CUDA or Wavefronts in OpenCL execute on a SIMD unit concurrently. The GPU executes a warp (NVidia) or wavefront (AMD) at a time, all 32 threads stepping with the same program counter / instruction pointer. This causes issues with if-statements or while-loops: in the GPU hardware, threads will disable themselves if the rest of the gang needs to execute an if-statement. This is called thread divergence and is a common source of GPU inefficiency.
 
Even at the lowest machine level: threads are ganged in warps or wavefronts. There is no way to have anything smaller than 32-threads at a time on NVidia Turing hardware. As such, the programmer must imagine this group of 32 (NVidia Turing, AMD RDNA) or 64 (AMD GCN) threads working throughout their code.
 
Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.
 
Workgroups are not the end of scaling however. GPUs can support many workgroups to execute in parallel. AMD Vega Compute Units support the parallel processing of up to 40 wavefronts per compute unit, and a Vega64 GPU has 64 compute units available. It is common to have tens-of-thousands of threads executing concurrently on a GPU.
 
NVidia RTX 2070 (Turing) Summary: 32 Threads per Warp. 1 to 32 Warps per Block. The RTX 2070 has 40 SMs, each supporting 32 warps for a total of 1280 concurrent warps at max occupancy(40,960 threads)
 
AMD Vega64 (Vega) Summary: 64 Threads per Wavefront. 1 to 16 Wavefronts per Workgroup. Vega64 has 64 compute units, each supporting up to 40 wavefronts. A total of 2560 wavefronts at max occupancy (163,840 threads)
 
The challenge of GPU Compute Languages is to provide the programmer the flexibility to take advantage of memory optimizations at the CUDA Block or OpenCL Workgroup level (~1024 threads), while still being able to specify the tens-of-thousands of physical threads working on the typical GPU.
=Inside=

Navigation menu