Changes

Jump to: navigation, search

GPU

307 bytes removed, 20:13, 9 August 2019
Architectures and Physical Hardware
== Grids and NDRange ==
While warps, blocks, wavefronts and workgroups are concepts that the machine executes... Grids and NDRanges are the scope of the problem specified by a programmer. For example, the 1920x1080 screen could be defined as a pixel-shader executing over Grid with 2073600 threads to execute (likely organized as a 1920x1080 screen will have 2,073,600 pixels to process. GPUs are designed such that each of these pixels could get its own thread of execution-dimensional grid for convenience). Specifying these 2,073,600 work items is the purpose of a CUDA Grid or OpenCL NDRange.
A typical midrange GPU will "only" be able to process tens-of-thousands of threads at a time. In practice, the device driver will cut up a Grid or NDRange (usually consisting of millions of items) into Blocks or Workgroups. These blocks and workgroups will execute with as much parallel processing as the underlying hardware can support (maybe 10,000 at a time on a midrange GPU). The device driver will implicitly iterate these blocks over the entire Grid or NDRange to complete the task the programmer has specified, similar to a for-loop.
Grids and NDRanges can be 1-dimensional, 2-dimensionalThe most important note is that blocks within a Grid (or workgroups within an NDRange) may not execute concurrently with each other. Some degree of sequential processing may happen. As such, communication across a Grid or 3-dimensionalNDRange is difficult to achieve. 2-dimensional grids are common If thread #0 creates a Spinlock waiting for screen-space operation such as pixel shadersthread #1000000 to communicate with it, modern hardware will probably never have the two threads executing concurrently with each other, and the code would likely timeout. While 3-dimensional grids are useful In practice, the easiest mechanism for Grid or NDRange sized synchronization is to wait for specifying many operations per pixel (such as a raytracer, which may launch 5000 rays per pixel)the kernel to finish executing: to have the CPU wait and process the results in between Grid or NDRanges.
The most important note is that Grids and NDRanges may not execute concurrently with For example: LeelaZero will schedule an NDRange for each other[https://github. Some degree of sequential processing may happencom/leela-zero/leela-zero/blob/next/src/kernels/convolve1. As suchopencl Convolve operation], communication across as well as merge and other primitives. The convolve operation is over a Grid or 3-dimensional NDRange is difficult to achieve (If thread #0 creates a Spinlock or Mutex waiting for thread #1000000 to communicate with it<channel, output, modern hardware will probably never have the two threads executing concurrently with each other and the code would deadlock)row_batch>. In practiceTo build up a full CNN operation, the easiest mechanism for Grid or NDRange sized synchronization is to wait for the kernel to finish executing: to have the CPU wait and process the results in between Grid or NDRanges. For example: LeelaZero will schedule a Grid different operations for each CNN evaluation. The CPU will traverse the MCTS treeGPU: convolve, merge, transform and mark off other CNN evaluation locations (each their own Grid). When the GPU finishes a Grid, the CUDA API provides asynchronous or synchronous APIs to inform the CPU of the Grid completionmore.
= Architectures and Physical Hardware =
Each generation, the manufacturers create a series of cards, with set vRAM and SIMD Cores. The market is split into three categories: server, professional, and consumer. Consumer cards are cheapest and are primarily targeted for the video game market. Professional cards have better driver support for 3d programs like Autocad. Finally, server cards provide virtualization services, allowing cloud companies to virtually split their cards between customers.
While Consumer class GPUs cost anywhere from $100 to $1000. Professional cards can run to $2000, while server and professional class cards have more vRAMcan cost as much as $10, consumer cards are more than adequate starting points for GPU Programmers000.
GPUs use high-bandwidth RAM, such as GDDR6 or HBM2. These specialized RAM GDDR6 and HBM2 are designed for the extremely parallel nature of GPUs, and can provide 200GBps to 1000GBps throughput. In comparison: a typical DDR4 channel can provide 20GBps. A dual channel desktop will typically have under 50GBps bandwidth to DDR4 main memory.
== NVidia ==
NVidia's consumer line of cards is Geforce, branded with RTX or GTX labels. Nvidia's professional line of cards is "Quadro". Finally, Tesla cards constitute NVidiaNvidia's server lineof cards is "Tesla".
NVidia's "Titan" line of Geforce cards use consumer drivers, but use professional or server class chips. As such, the Titan line can cost anywhere from $1000 to $3000 per card.

Navigation menu