Changes

Jump to: navigation, search

GPU

179 bytes added, 00:31, 8 August 2019
The Implicitly Parallel SIMD Model
Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.
Workgroups are not the end of scaling however. GPUs can support many workgroups to execute in parallel. AMD Vega Compute Units support the parallel processing of up to (CUs) can schedule 40 wavefronts per compute unitCU (although it only physically executes 4 wavefronts concurrently), and has 64 CUs available on a Vega64 GPU has . AMD Vega64 (Vega) Summary: 64 compute units availableThreads per Wavefront. It is common 1 to have tens-of-thousands 16 Wavefronts per Workgroup. With 64 CUs supporting 40 wavefronts, a total of 2560 wavefronts (163,840 threads executing concurrently on a GPU) can be loaded per AMD Vega64.
NVidia has a similar language and mechanism. NVidia GPUs can support many blocks to execute in parallel. NVidia Symmetric Multiprocessors can schedule 32 warp per SM (although it can only physically execute 1 warp at a time). With 40 SMs available on a RTX 2070. NVidia RTX 2070 (Turing) Summary: 32 Threads per Warp. 1 to 32 Warps per Block. The RTX 2070 has With 40 SMs, each supporting 32 warps for , a total of 1280 concurrent warps at max occupancy(40,960 threads) AMD Vega64 (Vega) Summary: 64 Threads can be scheduled per Wavefront. 1 to 16 Wavefronts per Workgroup. Vega64 has 64 compute units, each supporting up to 40 wavefrontsRTX 2070. A total of 2560 wavefronts at max occupancy (163,840 threads)
The challenge of GPU Compute Languages is to provide the programmer the flexibility to take advantage of memory optimizations at the CUDA Block or OpenCL Workgroup level (~1024 threads), while still being able to specify the tens-of-thousands of physical threads working on the typical GPU.

Navigation menu