Changes

Jump to: navigation, search

GPU

396 bytes added, 22:04, 4 November 2019
Grids and NDRange
== Grids and NDRange ==
While warps, blocks, wavefronts and workgroups are concepts that the machine executes... Grids and NDRanges are the scope of the problem specified by a programmer. For example, the 1920x1080 screen could be defined as a Grid with 2073600 threads to execute (likely organized as a 2-dimensional 1920x1080 grid for convenience). Specifying these 2,073,600 work items is the purpose of a CUDA Grid or OpenCL NDRange.
A typical midrange GPU will "only" The programmer may choose to cut up the 1920x1080 screen into blocks of size 32x32 pixels. Or maybe an algorithm is horizontal in nature, and it may be able more convenient to process tenswork with blocks of 1x1024 pixels instead. Or maybe the block-ofsizes have been set to some video standards, and maybe 8x8 blocks (64-thousands of threads at ) are the biggest you can practically work with (say MPEG-2 decoder 8x8 macroblocks). Regardless, the programmer chooses a timeblock size which is most convenient and optimized for their purposes. In practiceTo complete this hypothetical example, the device driver will cut a 1920x1080 screen could be split up a Grid or NDRange into 60x34 CUDA Blocks (or OpenCL Workgroups. These blocks and workgroups will execute ), each covering 32x32 pixels with as much parallel processing as the underlying hardware can support 1024 CUDA Threads (maybe 10,000 at a time on a midrange GPUor OpenCL Workitems). The device driver will implicitly iterate these blocks over the entire Grid or NDRange to complete the task the programmer has specified, similar to a for-loopeach.
These blocks and workgroups will execute with as much parallel processing as the underlying hardware can support. Roughly 150 CUDA Blocks or OpenCL Workgroups at a time on a typical midrange GPU circa from 2019 (such as a NVidia 2060 Super or AMD 5700). The most important note is that blocks within a Grid grid (or workgroups within an NDRange) may not execute concurrently with each other. Some degree of sequential processing may happen. As such, communication across a Grid or NDRange is difficult to achieve. If thread #0 creates a Spinlock waiting for thread #1000000 to communicate with it, modern hardware will probably never have the two threads executing concurrently with each other, and the code would likely timeout. In practice, the easiest mechanism for Grid or NDRange sized synchronization is to wait for the kernel to finish executing: to have the CPU wait and process the results in between Grid or NDRanges.
For example: LeelaZero will schedule an NDRange for each [https://github.com/leela-zero/leela-zero/blob/next/src/kernels/convolve1.opencl Convolve operation], as well as merge and other primitives. The convolve operation is over a 3-dimensional NDRange for <channel, output, row_batch>. To build up a full CNN operation, the CPU will schedule different operations for the GPU: convolve, merge, transform and more.

Navigation menu