Changes

Jump to: navigation, search

GPU

1,082 bytes added, 22:10, 8 August 2019
m
Wikipedia link and internal links added
'''GPU''' (Graphics Processing Unit),<br/>
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and massive parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.
=GPGPU=
The traditional job of a GPU is to take the [https://en.wikipedia.org/wiki/Three-dimensional_space x,y,z coordinates ] of [https://en.wikipedia.org/wiki/Triangle_strip triangles], and [https://en.wikipedia.org/wiki/3D_projection map ] these triangles to [https://en.wikipedia.org/wiki/Glossary_of_computer_graphics#screen_space screen-space ] through a [https://en.wikipedia.org/wiki/Matrix_multiplication matrix multiplicationmultiplication0. As video game graphics grew more sophisticated, the number of triangles per scene grew larger. GPUs similarly grew in size to massively parallel behemoths capable of performing billions of transformations hundreds of times per second.
These lists of triangles were specified in Graphics APIs like [https://en.wikipedia.org/wiki/DirectXDirectX]. But video game programmers demanded more flexibility from their hardware: such as lighting, transparency, and reflections. This flexibility was granted with specialized programming languages, called [https://en.wikipedia.org/wiki/Shader#Vertex_shaders vertex shaders ] or [https://en.wikipedia.org/wiki/Shader#Pixel_shaders pixel shaders].
Eventually, the fixed-functionality of GPUs disappeared, and GPUs became primarily a massively parallel general purpose computers. Instead of using vertex shaders inside of DirectX, general compute languages are designed to make sense outside of a graphical setting.
== Khronos OpenCL ==
The [https://en.wikipedia.org/wiki/Khronos_Group Khronos group ] is a committee formed to oversee the [https://en.wikipedia.org/wiki/OpenGLOpenGL], [[OpenCL]], and [https://en.wikipedia.org/wiki/Vulkan_(API) Vulkan ] standards. Although compute shaders exist in all languages, OpenCL is the designated general purpose compute language.
OpenCL 1.2 is widely supported by [[AMD]], [[Nvidia|NVidia]], and [[Intel]]. OpenCL 2.0, although specified in 2013, has had a slow rollout, and the specific features aren't necessarily widespread in modern GPUs yet. AMD continues to target OpenCL 2.0 support in their ROCm environment, while NVidia has implemented some OpenCL 2.0 features.
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]
* [httphttps://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]
== NVidia Software overview ==
[[Nvidia|NVidia ]] [https://en.wikipedia.org/wiki/CUDA CUDA] is their general purpose compute framework. CUDA has a [[Cpp|C++ ]] compiler based on [https://en.wikipedia.org/wiki/LLVM LLVM ] / [https:// en.wikipedia.org/wiki/Clang clang], which compiles into an assembly-like language called [https://en.wikipedia.org/wiki/Parallel_Thread_Execution PTX]. NVidia device drivers take PTX and compile that down to the final machine code (called NVidia SASS). NVidia keeps PTX portable between its GPUs, while its SASS assembly language may change from year-to-year as NVidia releases new GPUs. A defining feature of CUDA was the "single source" C++ compiler, the same compiler would work with both CPU host-code and GPU device-code. This meant that the data-structures and even pointers from the CPU can be shared directly with the GPU code.
* [https://developer.nvidia.com/cuda-zone NVidia CUDA Zone]
== AMD Software Overview ==
[[AMD|AMD's ]] original software stack, called [https://en.wikipedia.org/wiki/AMDGPU AMDGPU-pro], provides OpenCL 1.2 and 2.0 capabilities on [[Linux ]] and [[Windows]]. However, most of AMD's efforts today is on an experimental framework called [https://en.wikipedia.org/wiki/OpenCL#Implementations ROCm]. ROCm is AMD's open source compiler and device driver stack intended for general purpose compute. ROCm supports two languages: [https://en.wikipedia.org/wiki/GPUOpen#AMD_Boltzmann_Initiative HIP ] (a CUDA-like single-source C++ compiler also based on LLVM/clang), and OpenCL 2.0. ROCm only works on Linux machines supporting modern hardware, such as [https://en.wikipedia.org/wiki/PCI_Express#3.0 PCIe 3.0 ] and relatively recent GPUs (such as the Rx [https://en.wikipedia.org/wiki/AMD_Radeon_500_series RX 580], and [https://en.wikipedia.org/wiki/AMD_RX_Vega_series Vega ] GPUs).
AMD regularly publishes the assembly language details of their architectures. Their "GCN Assembly" changes slightly from generation to generation, but the fundamental principles have remained the same.
CUDA, OpenCL, ROCm HIP, all have the same model of implicitly parallel programming. All threads are given an identifier: a threadIdx in CUDA or local_id in OpenCL. Aside from this index, all threads of a kernel will execute the same code. The only way to alter the behavior of code is to use this threadIdx to access different data.
The executed code is always implicitly [[SIMD_Techniques| SIMD]]. Instead of thinking of SIMD-lanes, each lane is considered its own thread. The smallest group of threads is called a CUDA Warp, or OpenCL Wavefront. NVidia GPUs execute 32-threads per warp, while AMD GCN GPUs execute 64-threads per wavefront. All threads within a Warp or Wavefront share an instruction pointer. Consider the following CUDA code:
if(threadIdx.x == 0){

Navigation menu