Changes

Jump to: navigation, search

GPU

139 bytes removed, 19:29, 31 December 2019
no edit summary
The [https://en.wikipedia.org/wiki/Khronos_Group Khronos group] is a committee formed to oversee the [https://en.wikipedia.org/wiki/OpenGL OpenGL], [[OpenCL]], and [https://en.wikipedia.org/wiki/Vulkan_(API) Vulkan] standards. Although compute shaders exist in all languages, OpenCL is the designated general purpose compute language.
OpenCL 1.2 is widely supported by [[AMD]], [[Nvidia|NVidia]], and [[Intel]]. OpenCL 2.0, although specified in 2013, has had a slow rollout, and the specific features aren't necessarily widespread in modern GPUs yet. AMD continues to target OpenCL 2.0 support in their ROCm environment, while NVidia Nvidia has implemented some OpenCL 2.0 features.
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]
== NVidia Nvidia Software overview ==
[[Nvidia|NVidia]] [https://en.wikipedia.org/wiki/CUDA CUDA] is their general purpose compute framework. CUDA has a [[Cpp|C++]] compiler based on [https://en.wikipedia.org/wiki/LLVM LLVM] / [https://en.wikipedia.org/wiki/Clang clang], which compiles into an assembly-like language called [https://en.wikipedia.org/wiki/Parallel_Thread_Execution PTX]. NVidia Nvidia device drivers take PTX and compile that down to the final machine code (called NVidia Nvidia SASS). NVidia Nvidia keeps PTX portable between its GPUs, while its SASS assembly language may change from year-to-year as NVidia Nvidia releases new GPUs. A defining feature of CUDA was the "single source" C++ compiler, the same compiler would work with both CPU host-code and GPU device-code. This meant that the data-structures and even pointers from the CPU can be shared directly with the GPU code.
* [https://developer.nvidia.com/cuda-zone NVidia Nvidia CUDA Zone]* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html NVidia Nvidia PTX ISA]* [https://docs.nvidia.com/cuda/index.html NVidia Nvidia CUDA Toolkit Documentation]
== AMD Software Overview ==
CUDA, OpenCL, ROCm HIP, all have the same model of implicitly parallel programming. All threads are given an identifier: a threadIdx in CUDA or local_id in OpenCL. Aside from this index, all threads of a kernel will execute the same code. The only way to alter the behavior of code is to use this threadIdx to access different data.
The executed code is always implicitly [[SIMD]]. Instead of thinking of SIMD-lanes, each lane is considered its own thread. The smallest group of threads is called a CUDA Warp, or OpenCL Wavefront. NVidia Nvidia GPUs execute 32-threads per warp, while AMD GCN GPUs execute 64-threads per wavefront. All threads within a Warp or Wavefront share an instruction pointer. Consider the following CUDA code:
if(threadIdx.x == 0){
}
While there is only one thread in the warp that has threadIdx == 0, all 32 threads of the warp will have their shared instruction pointer execute doA() together. To keep the code semantically correct, threads #1 through #31 will have their NVidia Nvidia Predicate-register cleared (or AMD Execution Mask cleared), which means the thread will throw away the work after executing a specific statement. For those familiar with x64 AVX code, a GPU thread is comparable to a SIMD-lane in AVX. All lanes of an AVX instruction will execute any particular instruction, but you may throw away the results of some registers using mask or comparison instructions.
Once doA() is complete, the machine will continue and doB(). In this case, thread#0 will have its execution mask-cleared, while threads #1 through #31 will actually complete the results of doB().
== Blocks and Workgroups ==
Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidiaNvidia), sharing L1 cache, shared memory and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia Nvidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.
== Grids and NDRange ==
The programmer may choose to cut up the 1920x1080 screen into blocks of size 32x32 pixels. Or maybe an algorithm is horizontal in nature, and it may be more convenient to work with blocks of 1x1024 pixels instead. Or maybe the block-sizes have been set to some video standards, and maybe 8x8 blocks (64-threads) are the biggest you can practically work with (say MPEG-2 decoder 8x8 macroblocks). Regardless, the programmer chooses a block size which is most convenient and optimized for their purposes. To complete this hypothetical example, a 1920x1080 screen could be split up into 60x34 CUDA Blocks (or OpenCL Workgroups), each covering 32x32 pixels with 1024 CUDA Threads (or OpenCL Workitems) each.
These blocks and workgroups will execute with as much parallel processing as the underlying hardware can support. Roughly 150 CUDA Blocks or OpenCL Workgroups at a time on a typical midrange GPU circa from 2019 (such as a NVidia Nvidia 2060 Super or AMD 5700). The most important note is that blocks within a grid (or workgroups within an NDRange) may not execute concurrently with each other. Some degree of sequential processing may happen. If thread #0 creates a Spinlock waiting for thread #1000000 to communicate with it, modern hardware will probably never have the two threads executing concurrently with each other, and the code would likely timeout. In practice, the easiest mechanism for Grid or NDRange sized synchronization is to wait for the kernel to finish executing: to have the CPU wait and process the results in between Grid or NDRanges.
For example: LeelaZero will schedule an NDRange for each [https://github.com/leela-zero/leela-zero/blob/next/src/kernels/convolve1.opencl Convolve operation], as well as merge and other primitives. The convolve operation is over a 3-dimensional NDRange for <channel, output, row_batch>. To build up a full CNN operation, the CPU will schedule different operations for the GPU: convolve, merge, transform and more.
* __device__ (CUDA) or __global (OpenCL) memory -- OpenCL __global and CUDA __device__ memory exists on the GPU's VRAM. Any threads can access any part of __device__ or __global memory, although memory-ordering and caching details can get quite complicated if multiple threads simultaneously read and write to a particular memory location. Proper memory ordering with __threadfence() (CUDA) or mem_fence() (OpenCL) is essential to preventing memory-consistency issues.
* __constant__ (CUDA) or __constant (OpenCL) memory -- Constants are not allowed to change during the execution of a particular kernel. Historically, this was used by Pixel Shaders as they read texture data. The texture-data could be computed and loaded onto the GPU, but the data was not allowed to change during the Pixel Shader's execution. Both NVidia Nvidia and AMD GPUs have special caches (and in AMD's case: special registers called sGPRs) which accelerate constant-data. The caches associated with this memory space is sometimes called K$ (Konstant-cache), and has to be independently flushed if its data ever changes. The main benefit in both AMD and NVidia Nvidia systems is that K$ values are broadcast extremely efficiently to all threads in a wavefront, but only if all threads in a wavefront are reading from the same memory location. Instead of haing 32-memory reads (NVidiaNvidia) or 64-memory reads (AMD GCN), a read from K$ can be optimized into a single-read, broadcast to all 32 or 64-threads of a Warp or Wavefront.
* __shared__ (CUDA) or __local (OpenCL) memory -- This is highly-accelerated memory regions designed for threads to exchange data within a CUDA Block or OpenCL Workgroup. On AMD Systems, there is more Local "LDS" memory than even L1 Cache (GCN) or L0 Cache (RDNA).
GPUs use high-bandwidth RAM, such as GDDR6 or HBM2. GDDR6 and HBM2 are designed for the extremely parallel nature of GPUs, and can provide 200GBps to 1000GBps throughput. In comparison: a typical DDR4 channel can provide 20GBps. A dual channel desktop will typically have under 50GBps bandwidth to DDR4 main memory.
== NVidia Nvidia==Nvidia's consumer line of cards is Geforce, branded with RTX or GTX labels. Nvidia's professional line of cards is "Quadro". Finally, Nvidia's server line of cards is "Tesla".
NVidia's consumer line of cards is Geforce, branded with RTX or GTX labels. Nvidia's professional line of cards is "Quadro". Finally, Nvidia's server line of cards is "Tesla". NVidia's "Titan" line of Geforce cards use consumer drivers, but use professional or server class chips. As such, the Titan line can cost anywhere from $1000 to $3000 per card.
=== Turing Architecture ===
 
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]
==Throughput Examples==
 
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref>
=Deep Learning=
 
GPUs were originally intended to process matrix multiplications for graphical transformations and rendering. [[Neural Networks#Convolutional|Convolutional Neural Networks]] can have their operations interpreted as a series of matrix multiplications. GPUs are therefore a natural fit to parallelize and process CNNs.
GPUs traditionally operated on 32-bit floating point numbers. However, CNNs can make due with 16-bit half floats (FP16), or even 8-bit or 4-bit numbers. One thousand single-precision floats will take up 4kB of space, while one-thousand FP16 will take up 2kB of space. A half-float uses half the memory, eats only half the memory bandwidth, and only half the space in caches. As such, GPUs such as AMD Vega or NVidia Nvidia Volta added support for FP16 processing.
Specialized units, such as NVidia Nvidia Volta's "Tensor cores", can perform an entire 4x4 block of FP16 matrix multiplications in just one PTX assembly language statement. It is with these instructions that CNN operations are accelerated.
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom,
also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.
= History = 
In the early 1980s, chips such as the [https://en.wikipedia.org/wiki/ANTIC ANTIC ] or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise] would take a lot of the mechanics of video processing (waiting for scanlines and processing other TV or monitor signals). The 1990s would make 3d graphics and 3d modeling more popular, especially for video games. Cards specifically designed to accelerate 3d math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3d graphics. Some other engines, such as Quake, would instead use the SIMD-capabilities of CPUs such as the Intel MMX instruction set.
The large number of regular matrix multiplications led to natural SIMD-style algorithms. The 3d graphics community drew upon the rich history of vector-compute and SIMD-compute from 1980s and 1970s supercomputers. As such, many publications relating to Cray-vector supercomputers or the Connection Machine supercomputer easily apply to modern GPUs. For example, all the algorithms described in the 1986 publication "[https://dl.acm.org/citation.cfm?id=7903 Data Parallel Algorithms]" can be efficiently executed on a modern GPU workgroup (roughly ~256x GPU threads). The Data Parallel Algorithms paper is a beginner-level algorithms paper, demonstrating simple and efficient parallel-prefix sum, parallel-linked list traversal, parallel RegEx matching on the 4096x parallel Connection Machine-2 supercomputer.
Modern papers on GPUs, such as NVidiaNvidia's excellent [https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html Chapter 39. Parallel Prefix Sum (Scan) with CUDA (GPU Gems 3)], are built on top of these papers from the 1980s or 1990s. As such, the beginner will find it far easier to read the papers from the 1980s or 90s before attempting to read a modern piece like GPU Gems 3. =Chess Engines=* [[Category:GPU]]
=See also=
* [[Deep Learning]]
** [[AlphaGo]]
** [[AlphaZero]]
** [[Neural Networks#Convolutional|Convolutional Neural Networks]]
** [[Leela Zero]]
** [[Leela Chess Zero]]
* [[FPGA]]
* [[Graphics Programming]]
* [[SIMD and SWAR Techniques]]
* [[Thread]]
* [[Zeta]]
=Publications=

Navigation menu