Changes

GPU

36 bytes removed, 21:08, 9 August 2019

→‎Grids and NDRange

== Grids and NDRange ==

While warps, blocks, wavefronts and workgroups are concepts that the machine executes... Grids and NDRanges are the scope of the problem specified by a programmer. For example, the 1920x1080 screen could be defined as a ~~pixel-shader executing over~~ Grid with 2073600 threads to execute (likely organized as a ~~1920x1080 screen will have~~ 2~~,073,600 pixels to process. GPUs are designed such that each of these pixels could get its own thread of execution~~-dimensional grid for convenience). Specifying these 2,073,600 work items is the purpose of a CUDA Grid or OpenCL NDRange.

A typical midrange GPU will "only" be able to process tens-of-thousands of threads at a time. In practice, the device driver will cut up a Grid or NDRange ~~(usually consisting of millions of items)~~ into Blocks or Workgroups. These blocks and workgroups will execute with as much parallel processing as the underlying hardware can support (maybe 10,000 at a time on a midrange GPU). The device driver will implicitly iterate these blocks over the entire Grid or NDRange to complete the task the programmer has specified, similar to a for-loop.

The most important note is that ~~Grids and NDRanges~~ blocks within a Grid (or workgroups within an NDRange) may not execute concurrently with each other. Some degree of sequential processing may happen. As such, communication across a Grid or NDRange is difficult to achieve (. If thread #0 creates a Spinlock waiting for thread #1000000 to communicate with it, modern hardware will probably never have the two threads executing concurrently with each other , and the code would ~~deadlock)~~likely timeout. In practice, the easiest mechanism for Grid or NDRange sized synchronization is to wait for the kernel to finish executing: to have the CPU wait and process the results in between Grid or NDRanges.

For example: LeelaZero will schedule an NDRange for each [https://github.com/leela-zero/leela-zero/blob/next/src/kernels/convolve1.opencl Convolve operation], as well as merge and other primitives. The convolve operation is over a 3-dimensional NDRange for <channel, output, row_batch>. To build up a full CNN operation, the CPU will schedule different operations for the GPU: convolve, merge, transform and more.

PercivalTiglao

85

edits

Changes

GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools