Changes

Jump to: navigation, search

GPU

425 bytes added, 19:22, 8 August 2019
The Implicitly Parallel SIMT Programming Model
=The Implicitly Parallel SIMT Programming Model=
CUDA, OpenCL, ROCm HIP, all have the same model of implicitly parallel programming. All threads are given an identifier: a threadIdx in CUDA or local_id in OpenCL. Aside from this index, all threads of a kernel will execute the same code. The only way to alter the behavior of code is to use this threadIdx to access different data.
The executed code is always implicitly [[SIMD_Techniques| SIMD]]. Instead of thinking of SIMD-lanes, each lane is considered its own thread. The smallest group of threads is called a CUDA Warp, or OpenCL Wavefront. NVidia GPUs execute 32-threads per warp, while AMD GCN GPUs execute 64-threads per wavefront. All threads within a Warp or Wavefront share an instruction pointer. Consider the following CUDA code:
if(threadIdx.x == 0){
}
While there is only one thread in the warp that has threadIdx == 0, all 32 threads of the warp will have their shared instruction pointer execute doA() together. To keep the code semantically correct, threads #1 through #31 will have their NVidia Predicate-register cleared (or AMD Execution Mask cleared), which means the thread will throw away the work after executing a specific statement. For those familiar with x64 AVX code, a GPU thread is comparable to a SIMD-lane in AVX. All lanes of an AVX instruction will execute any particular instruction, but you may throw away the results of some registers using mask or comparison instructions.
Once doA() is complete, the machine will continue and doB(). In this case, thread#0 will have its execution mask-cleared, while threads #1 through #31 will actually complete the results of doB().
== Building up to larger thread groups ==
Even at the lowest machine level: threads are ganged in The GPU hardware will execute entire warps or wavefrontsat a time. There is no way to have anything smaller Anything less than 32-threads at a time on NVidia Turing hardwarewill force some SIMD-threads to idle. As such, the programmer must imagine this group of 32 (NVidia Turing, AMD RDNA) high-performance programmers should try to schedule as many full-warps or 64 (AMD GCN) threads working throughout their codewavefronts as possible.
Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.

Navigation menu