Changes

GPU

190 bytes removed, 21:02, 9 August 2019

→‎Grids and NDRange

A typical midrange GPU will "only" be able to process tens-of-thousands of threads at a time. In practice, the device driver will cut up a Grid or NDRange (usually consisting of millions of items) into Blocks or Workgroups. These blocks and workgroups will execute with as much parallel processing as the underlying hardware can support (maybe 10,000 at a time on a midrange GPU). The device driver will implicitly iterate these blocks over the entire Grid or NDRange to complete the task the programmer has specified, similar to a for-loop.

The most important note is that Grids and NDRanges ~~can be 1-dimensional, 2-dimensional~~may not execute concurrently with each other. Some degree of sequential processing may happen. As such, communication across a Grid or ~~3-dimensional. 2-dimensional grids are common for screen-space operation such as pixel shaders. While 3-dimensional grids are useful for specifying many operations per pixel~~ NDRange is difficult to achieve (~~such as~~ If thread #0 creates a ~~raytracer~~Spinlock waiting for thread #1000000 to communicate with it, ~~which may launch 5000 rays per pixel~~modern hardware will probably never have the two threads executing concurrently with each other and the code would deadlock). In practice, the easiest mechanism for Grid or NDRange sized synchronization is to wait for the kernel to finish executing: to have the CPU wait and process the results in between Grid or NDRanges.

~~The most important note is that Grids and NDRanges may not execute concurrently with~~ For example: LeelaZero will schedule an NDRange for each ~~other~~[https://github. ~~Some degree of sequential processing may happen~~com/leela-zero/leela-zero/blob/next/src/kernels/convolve1. ~~As such~~opencl Convolve operation], ~~communication across~~ as well as merge and other primitives. The convolve operation is over a ~~Grid or~~ 3-dimensional NDRange ~~is difficult to achieve (If thread #0 creates a Spinlock or Mutex waiting~~ for ~~thread #1000000 to communicate with it~~<channel, output, ~~modern hardware will probably never have the two threads executing concurrently with each other and the code would deadlock)~~row_batch>. ~~In practice~~To build up a full CNN operation, ~~the easiest mechanism for Grid or NDRange sized synchronization is to wait for the kernel to finish executing: to have~~ the CPU ~~wait and process the results in between Grid or NDRanges.~~ ~~For example: LeelaZero~~ will schedule ~~a Grid~~ different operations for ~~each CNN evaluation. The CPU will traverse~~ the ~~MCTS tree~~GPU: convolve, merge, transform and ~~mark off other CNN evaluation locations (each their own Grid). When the GPU finishes a Grid, the CUDA API provides asynchronous or synchronous APIs to inform the CPU of the Grid completion~~more.

= Architectures and Physical Hardware =

PercivalTiglao

85

edits

Changes

GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools