Changes

Jump to: navigation, search

GPU

2,800 bytes added, 22:43, 26 August 2019
The SIMT Programming Model
For example: LeelaZero will schedule an NDRange for each [https://github.com/leela-zero/leela-zero/blob/next/src/kernels/convolve1.opencl Convolve operation], as well as merge and other primitives. The convolve operation is over a 3-dimensional NDRange for <channel, output, row_batch>. To build up a full CNN operation, the CPU will schedule different operations for the GPU: convolve, merge, transform and more.
 
==Memory Model==
 
OpenCL, CUDA, ROCM, and other GPU-languages all have a similar memory model.
 
* __device__ (CUDA) or __global (OpenCL) memory -- OpenCL __global and CUDA __device__ memory exists on the GPU's VRAM. Any threads can access any part of __device__ or __global memory, although memory-ordering and caching details can get quite complicated if multiple threads simultaneously read and write to a particular memory location. Proper memory ordering with __threadfence() (CUDA) or mem_fence() (OpenCL) is essential to preventing memory-consistency issues.
 
* __constant__ (CUDA) or __constant (OpenCL) memory -- Constants are not allowed to change during the execution of a particular kernel. Historically, this was used by Pixel Shaders as they read texture data. The texture-data could be computed and loaded onto the GPU, but the data was not allowed to change during the Pixel Shader's execution. Both NVidia and AMD GPUs have special caches (and in AMD's case: special registers called sGPRs) which accelerate constant-data, but have difficult caching issues if the data ever changes.
 
* __shared__ (CUDA) or __local__ (OpenCL) memory -- This is highly-accelerated memory regions designed for threads to exchange data within a CUDA Block or OpenCL Workgroup. On AMD Systems, there is more Local "LDS" memory than even L1 Cache (GCN) or L0 Cache (RDNA).
 
* Default (CUDA) or Private (OpenCL) Memory -- Private memory typically maps to a GPU-register, and is inaccessible to other threads. If a kernel requires more memory than what can exist in GPU-registers, the data will automatically spill over into global VRAM (with an associated performance penalty). In practice, this spillover is well interleaved, well-optimized, and reduced to as small a subset as possible through compiler optimizations.
 
 
Here the data for the Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] as an example: <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>
* 128 KiB private memory per compute unit
* 48 KiB (16 KiB) local memory per compute unit (configurable)
* 64 KiB constant memory
* 8 KiB constant cache per compute unit
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)
* 768 KiB L2 cache
* 1.5 GiB to 3 GiB global memory
Here the data for the AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) as an example: <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>
* 256 KiB private memory per compute unit
* 64 KiB local memory per compute unit
* 64 KiB constant memory
* 16 KiB constant cache per four compute units
* 16 KiB L1 cache per compute unit
* 768 KiB L2 cache
* 3 GiB to 6 GiB global memory
= Architectures and Physical Hardware =

Navigation menu