Changes

GPU

179 bytes added, 00:31, 8 August 2019

→‎The Implicitly Parallel SIMD Model

Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.

Workgroups are not the end of scaling however. GPUs can support many workgroups to execute in parallel. AMD Vega Compute Units ~~support the parallel processing of up to~~ (CUs) can schedule 40 wavefronts per ~~compute unit~~CU (although it only physically executes 4 wavefronts concurrently), and has 64 CUs available on a Vega64 GPU ~~has~~ . AMD Vega64 (Vega) Summary: 64 ~~compute units available~~Threads per Wavefront. ~~It is common~~ 1 to ~~have tens-of-thousands~~ 16 Wavefronts per Workgroup. With 64 CUs supporting 40 wavefronts, a total of 2560 wavefronts (163,840 threads ~~executing concurrently on a GPU~~) can be loaded per AMD Vega64.

NVidia has a similar language and mechanism. NVidia GPUs can support many blocks to execute in parallel. NVidia Symmetric Multiprocessors can schedule 32 warp per SM (although it can only physically execute 1 warp at a time). With 40 SMs available on a RTX 2070. NVidia RTX 2070 (Turing) Summary: 32 Threads per Warp. 1 to 32 Warps per Block. ~~The RTX 2070 has~~ With 40 SMs, each supporting 32 warps ~~for~~ , a total of 1280 ~~concurrent~~ warps ~~at max occupancy~~(40,960 threads) ~~AMD Vega64 (Vega) Summary: 64 Threads~~ can be scheduled per ~~Wavefront. 1 to 16 Wavefronts per Workgroup. Vega64 has 64 compute units, each supporting up to 40 wavefronts~~RTX 2070. ~~A total of 2560 wavefronts at max occupancy (163,840 threads)~~

The challenge of GPU Compute Languages is to provide the programmer the flexibility to take advantage of memory optimizations at the CUDA Block or OpenCL Workgroup level (~1024 threads), while still being able to specify the tens-of-thousands of physical threads working on the typical GPU.

PercivalTiglao

85

edits

Changes

GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools