Changes

← Older edit

GPU

2,336 bytes added, 24 January

m

→‎2020 ...: link to Chinese gpus

'''[[Main Page|Home]] * [[Hardware]] * GPU'''

[[FILE:~~6600GT GPU~~NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/~~GeForce_6_series GeForce 6600GT (NV43)~~Nvidia_Tesla Nvidia Tesla] ~~GPU~~ <ref>[https://commons.wikimedia.org/wiki/~~Graphics_processing_unit Graphics processing unit -~~ File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]]

'''GPU''' (Graphics Processing Unit),<br/>

a specialized processor ~~primarily~~ initially intended to for fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and ~~massive~~ parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.

=History=In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as [https://en.wikipedia.org/wiki/IMPACT_(computer_graphics) SGI Impact] (1995) in 3D graphics-workstations or [https://en.wikipedia.org/wiki/3dfx#Voodoo_Graphics_PCI 3dfx Voodoo] (1996) for playing 3D games on PCs, emerged. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip [https://en.wikipedia.org/wiki/PlayStation_technical_specifications#Graphics_processing_unit_(GPU) GTE] used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like [https://en.wikipedia.org/wiki/NV1 NV1] (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU= frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.

~~The traditional job of a~~ =GPU is to take the [https://en.wikipedia.org/wiki/Three-dimensional_space x,y,z coordinates] of [https://en.wikipedia.org/wiki/Triangle_strip triangles], and [https://en.wikipedia.org/wiki/3D_projection map] these triangles to [https://en.wikipedia.org/wiki/Glossary_of_computer_graphics#screen_space screen space] through a [https://en.wikipedia.org/wiki/Matrix_multiplication matrix multiplication]. As video game graphics grew more sophisticated, the number of triangles per scene grew larger. GPUs similarly grew in ~~size to massively parallel behemoths capable of performing billions of transformations hundreds of times per second.~~Computer Chess=

~~These lists of triangles were specified~~ There are in Graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirectX]. But video game programmers demanded more flexibility from their hardware: such as lighting, transparency, and reflections. This flexibility was granted with specialized programming languages, called [httpsmain four ways how to use a GPU for chess://en.wikipedia.org/wiki/Shader#Vertex_shaders vertex shaders] or [https://en.wikipedia.org/wiki/Shader#Pixel_shaders pixel shaders]. GPUs evolved to accelerate general purpose compute from pixel shader and vertex shader programmers, and even merged the functionality into "universal" shaders (which can perform either vertex shading or pixel shading).

~~Today~~* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU* As a hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree* Neural network training such as [https://github.com/glinscott/nnue-pytorch Stockfish NNUE trainer in Pytorch]<ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=75724 Pytorch NNUE training] by [[Gary Linscott]], ~~these universal shaders are flexible enough~~ [[CCC]], November 08, 2020</ref> or [https://github.com/LeelaChessZero/lczero-training Lc0 TensorFlow Training] =GPU Chess Engines=* [[:Category:GPU]] =GPGPU= Early efforts to ~~provide General Purpose compute~~ leverage a GPU for ~~GPUs (GPGPU)~~general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia. org/wiki/DirectX DirextX], followed by first GPGPU ~~languages,~~ frameworks such as ~~OpenCL~~ [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA~~, is how the programmer can access this capability~~CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].

== Khronos OpenCL ==

[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.

~~The~~ * [https://enwww.~~wikipedia~~khronos.org/~~wiki~~conformance/~~Khronos_Group Khronos group] is a committee formed to oversee the [https://en.wikipedia.org/wiki/OpenGL OpenGL], [[OpenCL]], and [https://en.wikipedia.org~~adopters/~~wiki~~conformant-products/~~Vulkan_(API) Vulkan] standards. Although compute shaders exist in all languages, OpenCL is the designated general purpose compute language.~~ opencl List of OpenCL ~~1.2 is widely supported by [[AMD~~Conformant Products]], [[Nvidia]], and [[Intel]]. OpenCL 2.0, although specified in 2013, has had a slow rollout, and the specific features aren't necessarily widespread in modern GPUs yet. AMD continues to target OpenCL 2.0 support in their ROCm environment, while Nvidia has implemented some OpenCL 2.0 features.

* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]

* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]

~~== Nvidia Software overview ==~~* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]

[[Nvidia]] [https://en.wikipedia.org/wiki/CUDA CUDA] is their general purpose compute framework. CUDA has a [[Cpp|C++]] compiler based on [https://en.wikipedia.org/wiki/LLVM LLVM] / [https://en.wikipedia.org/wiki/Clang clang], which compiles into an assembly-like language called [https://en.wikipedia.org/wiki/Parallel_Thread_Execution PTX]. Nvidia device drivers take PTX and compile that down to the final machine code (called Nvidia SASS). Nvidia keeps PTX portable between its GPUs, while its SASS assembly language may change from year-to-year as Nvidia releases new GPUs. A defining feature of CUDA was the "single source" C++ compiler, the same compiler would work with both CPU host-code and GPU device-code. This meant that the data-structures and even pointers from the CPU can be shared directly with the GPU code.== AMD ==

[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform. * [https://~~developer~~community.~~nvidia~~amd.com/~~cuda~~t5/opencl/bd-p/opencl-~~zone Nvidia CUDA Zone~~discussions AMD OpenCL Developer Community]* [https://~~docs~~rocmdocs.amd.~~nvidia~~com/en/latest/index.html AMD ROCm™ documentation]* [https://manualzz.com/~~cuda~~doc/~~parallel~~o/cggy6/amd-opencl-programming-user-~~thread~~guide-~~execution~~contents AMD OpenCL Programming Guide]* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]* [https:/~~index~~/gpuopen.~~html Nvidia PTX~~ com/amd-isa-documentation/ AMD GPU ISAdocumentation] == Apple ==Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]]. * [https://developer.apple.com/opencl/ Apple OpenCL Developer] * [https://developer.apple.com/metal/ Apple Metal Developer]* [https://~~docs~~developer.~~nvidia~~apple.com/~~cuda~~library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/~~index~~Introduction.html ~~Nvidia CUDA Toolkit Documentation~~Apple Metal Programming Guide]* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification] == Intel ==Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI]platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.

~~== AMD Software Overview ==~~* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]

[[AMD|AMD's]] original software stack, called [https://en.wikipedia.org/wiki/AMDGPU AMDGPU-pro], provides OpenCL 1.2 and 2.0 capabilities on [[Linux]] and [[Windows]]. However, most of AMD's efforts today is on an experimental framework called [https://en.wikipedia.org/wiki/OpenCL#Implementations ROCm]. ROCm is AMD's open source compiler and device driver stack intended for general purpose compute. ROCm supports two languages: [https://en.wikipedia.org/wiki/GPUOpen#AMD_Boltzmann_Initiative HIP] (a CUDA-like single-source C++ compiler also based on LLVM/clang), and OpenCL 2.0. ROCm only works on Linux machines supporting modern hardware, such as [https://en.wikipedia.org/wiki/PCI_Express#3.0 PCIe 3.0] and relatively recent GPUs (such as the [https://en.wikipedia.org/wiki/AMD_Radeon_500_series RX 580], and [https://en.wikipedia.org/wiki/AMD_RX_Vega_series Vega] GPUs).== Nvidia ==

~~AMD regularly publishes~~ [https://en.wikipedia.org/wiki/CUDA CUDA] is the ~~assembly~~ parallel computing platform by [[Nvidia]]. It supports language ~~details of their architectures~~frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia. ~~Their "GCN Assembly" changes slightly from generation to generation, but the fundamental principles have remained the same~~org/wiki/OpenMP OpenMP].

~~AMD's OpenCL documentation, especially the "OpenCL~~ * [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]* [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Nvidia CUDA C++ Programming Guide~~" and the "Optimization Guide" are good places to start for beginners looking to program their GPUs~~]* [https://docs.nvidia. ~~For Linux developers, the ROCm environment is under active development and has enough features to get code working well~~com/cuda/cuda-c-best-practices-guide/index.html Nvidia CUDA C++ Best Practices Guide]

== Further == * [https://~~rocm~~en.~~github~~wikipedia.ioorg/ ~~ROCm Homepage~~wiki/Vulkan#Planned_features Vulkan](OpenGL sucessor of Khronos Group)* [~~http~~https://~~developer~~en.~~amd~~wikipedia.~~com/wordpress/media/2013~~org/07wiki/~~AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide~~DirectCompute DirectCompute](Microsoft)* [~~http~~https://~~developer~~en.~~amd~~wikipedia.~~com/wordpress/media~~org/~~2013~~wiki/~~12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide~~C%2B%2B_AMP C++ AMP](Microsoft)* [https://~~gpuopen~~en.~~com/wp-content/uploads~~wikipedia.org/~~2019~~wiki/~~08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set~~OpenACC OpenACC](offload directives)* [https://~~developer~~en.~~amd~~wikipedia.~~com~~org/~~wp-content~~wiki/~~resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set~~OpenMP OpenMP](offload directives)

=~~= Other 3rd party tools =~~Hardware Model=

* A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/~~DirectCompute DirectCompute~~Single_instruction,_multiple_threads SIMT] ~~(GPGPU API by Microsoft)~~* fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/~~OpenMP OpenMP~~SIMD SIMD] ~~Device Offload~~* [https://enunit to hide memory latencies.~~wikipedia~~Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU.~~org~~The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/~~wiki~~or integer with specific bit-width of the FPU/~~OpenACC OpenACC] Device Offload~~* [https://enALU and registers.~~wikipedia~~There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores.~~org/wiki/Metal_(API) Metal] (GPU~~ Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and ~~GPGPU API by Apple)~~* the concrete classification as [https://en.wikipedia.org/wiki/~~OneAPI_(programming_model) oneAPI~~Flynn%27s_taxonomy hardware architecture] . Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (~~Data Parallel C++ by Intel~~matrix-multiply-accumulate units)are used to speed up neural networks further.

{| class=~~The SIMT Programming Model~~"wikitable" style="margin:auto"|+ Vendor Terminology|-! AMD Terminology !! Nvidia Terminology|-| Compute Unit || Streaming Multiprocessor|-| Stream Core || CUDA Core|-| Wavefront || Warp|}

CUDA, OpenCL, ROCm HIP, all have the same model of implicitly parallel programming. All threads are given an identifier: a threadIdx in CUDA or local_id in OpenCL. Aside from this index, all threads of a kernel will execute the same code. The only way to alter the behavior of code is to use this threadIdx to access different data.===Hardware Examples===

~~The executed code is always implicitly~~ Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[~~SIMD]]~~https://www. ~~Instead of thinking of SIMD-lanes, each lane is considered its own thread~~nvidia. ~~The smallest group of threads is called a CUDA Warp, or OpenCL Wavefront~~com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper. pdf Fermi white paper from Nvidia ~~GPUs execute 32-threads per warp, while AMD GCN GPUs execute 64-threads per wavefront~~]</ref><ref>[https://en. ~~All threads within a Warp or Wavefront share an instruction pointer~~wikipedia. ~~Consider the following CUDA code:~~org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref>

~~if(threadIdx~~* 512 CUDA cores @1.~~x == 0){~~544GHz ~~doA();~~ * 16 SMs - Streaming Multiprocessors ~~} else {~~* organized in 2x16 CUDA cores per SM ~~doB();~~ }* Warp size of 32 threads

~~While there is only one thread in the warp that has threadIdx == 0, all 32 threads of the warp will have their shared instruction pointer execute doA~~AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN) ~~together~~]<ref>[https://en. To keep the code semantically correct, threads #1 through #31 will have their Nvidia Predicate-register cleared (or AMD Execution Mask cleared), which means the thread will throw away the work after executing a specific statementwikipedia. ~~For those familiar with x64 AVX code, a GPU thread is comparable to a SIMD-lane in AVX~~org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en. ~~All lanes of an AVX instruction will execute any particular instruction, but you may throw away the results of some registers using mask or comparison instructions~~wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref>

~~Once doA() is complete, the machine will continue and doB()~~* 2048 Stream cores @0. ~~In this case~~925GHz* 32 Compute Units* organized in 4xSIMD16, ~~thread#0 will have its execution mask-cleared~~each SIMT4, ~~while threads #1 through #31 will actually complete the results~~ per Compute Unit* Wavefront size of ~~doB().~~64 work-items

~~This highlights~~ ===Wavefront and Warp===Generalized the ~~fundamental trade off~~ definition of the GPU platform. GPUs have many threads of execution, but they are forced to execute with their warps or wavefronts. In complicated loops or trees of if-statements, this thread divergence problem can cause your code to potentially leave many hardware threads idle. In Wavefront and Warp size is the ~~above example code, 97%~~ amount of ~~the~~ threads ~~will be effectively idle during doA(), while 3% of the threads will be idle during doB()~~executed in SIMT fashion on a GPU with unified shader architecture.

=~~= Blocks and Workgroups =~~Programming Model=

~~Programmers~~ A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can ~~group warps or wavefronts together into larger clusters~~be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, ~~called CUDA Blocks~~ or ~~OpenCL Workgroups~~with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. ~~1024~~ Single GPU threads ~~can~~ (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work ~~together~~ -groups form the NDRange to be executed on ~~a modern~~ the GPU ~~Compute Unit (AMD) or Symmetric Multiprocessor (Nvidia), sharing L1 cache, shared memory and other resources~~device. ~~Because~~ The members of a work-group execute the ~~tight coupling of L1 cache and Shared Memory~~same kernel, ~~these 1024 threads~~ can ~~communicate extremely efficiently. Case in point: both Nvidia PTX~~ be usually synchronized and ~~AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within~~ have access to the same ~~workgroup. Atomic operations,~~ scratch-pad memory ~~fences~~, with an architecture limit of how many work-items a work-group can hold and ~~other synchronization primitives are extremely fast and well optimized~~ how many threads can run in ~~these cases~~total concurrently on the device.

{| class="wikitable" style= ~~Grids and~~ "margin:auto"|+ Terminology|-! OpenCL Terminology !! CUDA Terminology|-| Kernel || Kernel|-| Compute Unit || Streaming Multiprocessor|-| Processing Element || CUDA Core|-| Work-Item || Thread|-| Work-Group || Block|-| NDRange ==|| Grid|-|}

While warps, blocks, wavefronts and workgroups are concepts that the machine executes... Grids and NDRanges are the scope of the problem specified by a programmer. For example, the 1920x1080 screen could be defined as a Grid with 2073600 threads to execute (likely organized as a 2-dimensional 1920x1080 grid for convenience). Specifying these 2,073,600 work items is the purpose of a CUDA Grid or OpenCL NDRange.==Thread Examples==

The programmer may choose to cut up the 1920x1080 screen into blocks of size 32x32 pixels. Or maybe an algorithm is horizontal in nature, and it may be more convenient to work with blocks of 1x1024 pixels instead. Or maybe the block-sizes have been set to some video standardsNvidia GeForce GTX 580 (Fermi, ~~and maybe 8x8 blocks (64-threads) are the biggest you can practically work with (say MPEG-2 decoder 8x8 macroblocks~~CC2)<ref>[https://en. ~~Regardless, the programmer chooses a block size which is most convenient and optimized for their purposes~~wikipedia. ~~To complete this hypothetical example, a 1920x1080 screen could be split up into 60x34~~ org/wiki/CUDA ~~Blocks (or OpenCL Workgroups), each covering 32x32 pixels with 1024~~ #Technical_Specification CUDA ~~Threads (or OpenCL Workitems) each.~~ Technical_Specification on Wikipedia]</ref>

~~These~~ * Warp size: 32* Maximum number of threads per block: 1024* Maximum number of resident blocks and workgroups will execute with as much parallel processing as the underlying hardware can support. Roughly 150 CUDA Blocks or OpenCL Workgroups at a time on a typical midrange GPU circa from 2019 (such as a Nvidia 2060 Super or AMD 5700). The most important note is that blocks within a grid (or workgroups within an NDRange) may not execute concurrently with each other. Some degree per multiprocessor: 32* Maximum number of resident warps per multiprocessor: 64* Maximum number of ~~sequential processing may happen. If thread #0 creates a Spinlock waiting for thread #1000000 to communicate with it, modern hardware will probably never have the two~~ resident threads ~~executing concurrently with each other, and the code would likely timeout. In practice, the easiest mechanism for Grid or NDRange sized synchronization is to wait for the kernel to finish executing~~per multiprocessor: ~~to have the CPU wait and process the results in between Grid or NDRanges.~~2048

For example: LeelaZero will schedule an NDRange for each [https://github.com/leela-zero/leela-zero/blob/next/src/kernels/convolve1.opencl Convolve operation], as well as merge and other primitives. The convolve operation is over a 3-dimensional NDRange for <channel, output, row_batch>. To build up a full CNN operation, the CPU will schedule different operations for the GPU: convolve, merge, transform and more.

~~==Memory Model==~~AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref>

~~OpenCL, CUDA, ROCM, and other GPU~~* Wavefront size: 64* Maximum number of work-~~languages all have a similar memory model.~~items per work-group: 1024* Maximum number of work-groups per compute unit: 40* Maximum number of Wavefronts per compute unit: 40* Maximum number of work-items per compute unit: 2560

* __device__ (CUDA) or __global (OpenCL) memory -- OpenCL __global and CUDA __device__ memory exists on the GPU's VRAM. Any threads can access any part of __device__ or __global memory, although memory-ordering and caching details can get quite complicated if multiple threads simultaneously read and write to a particular memory location. Proper memory ordering with __threadfence() (CUDA) or mem_fence() (OpenCL) is essential to preventing memory-consistency issues.=Memory Model=

* __constant__ (CUDA) or __constant (OpenCL) offers the following memory ~~-- Constants are not allowed to change during~~ model for the execution of a particular kernel. Historically, this was used by Pixel Shaders as they read texture data. The texture-data could be computed and loaded onto the GPU, but the data was not allowed to change during the Pixel Shader's execution. Both Nvidia and AMD GPUs have special caches (and in AMD's caseprogrammer: special registers called sGPRs) which accelerate constant-data. The caches associated with this memory space is sometimes called K$ (Konstant-cache), and has to be independently flushed if its data ever changes. The main benefit in both AMD and Nvidia systems is that K$ values are broadcast extremely efficiently to all threads in a wavefront, but only if all threads in a wavefront are reading from the same memory location. Instead of haing 32-memory reads (Nvidia) or 64-memory reads (AMD GCN), a read from K$ can be optimized into a single-read, broadcast to all 32 or 64-threads of a Warp or Wavefront.

* ~~__shared__ (CUDA) or~~ __private - usually registers, accessable only by a single work-item resp. thread.* __local ~~(OpenCL)~~ - scratch-pad memory shared across work-items of a work-group resp. threads of block.* __constant - ~~This is highly~~read-~~accelerated~~ only memory ~~regions designed for threads to exchange data within a CUDA Block or OpenCL Workgroup~~. ~~On AMD Systems~~* __global - usually VRAM, ~~there is more Local "LDS" memory than even L1 Cache (GCN) or L0 Cache (RDNA)~~accessable by all work-items resp. threads.

* Default ({| class="wikitable" style="margin:auto"|+ Terminology|-! OpenCL Terminology !! CUDA~~) or __private (OpenCL) Memory -~~Terminology|- | Private ~~memory typically maps to a GPU~~Memory || Registers|-~~register, and is inaccessible to other threads. If a kernel requires more memory than what can exist in GPU~~| Local Memory || Shared Memory|-~~registers, the data will automatically spill over into global VRAM (with an associated performance penalty). In practice, this spillover is well interleaved, well~~| Constant Memory || Constant Memory|-~~optimized, and reduced to as small a subset as possible through compiler optimizations.~~| Global Memory || Global Memory|}

===Memory Examples===

~~Here the data for the~~ Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] ~~as an example:~~ <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>

* 128 KiB private memory per compute unit

* 48 KiB (16 KiB) local memory per compute unit (configurable)

* 8 KiB constant cache per compute unit

* 16 KiB (48 KiB) L1 cache per compute unit (configurable)

* 768 KiB L2 cachein total

* 1.5 GiB to 3 GiB global memory

~~Here the data for the~~ AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) ~~as an example:~~ <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>

* 256 KiB private memory per compute unit

* 64 KiB local memory per compute unit

* 16 KiB constant cache per four compute units

* 16 KiB L1 cache per compute unit

* 768 KiB L2 cachein total

* 3 GiB to 6 GiB global memory

= ~~Architectures~~ ==Unified Memory=== Usually data has to be copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU. =Instruction Throughput= GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and ~~Physical Hardware~~ the specific model. ==Integer Instruction Throughput==* INT32: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.

~~The market is split into two categories, integrated~~ * INT64: In general [https://en.wikipedia.org/wiki/Processor_register registers] and ~~discrete GPUs~~Vector-[https://en. ~~The first being the most important by quantity, the second by performance~~wikipedia. ~~Discrete~~ org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are ~~divided as consumer brands for playing 3D games, professional brands for 3D CAD/CGI programs and and server brands for big~~32-~~data~~ bit wide and ~~number~~have to emulate 64-~~crunching workloads~~bit integer operations.* INT8: Some architectures offer higher throughput with lower precision. ~~Each brand offering different feature sets in drivers, VRAM,~~ They quadruple the INT8 or ~~computation abilities~~octuple the INT4 throughput.

== ~~ARM Mali~~ Floating-Point Instruction Throughput==~~The Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.~~

~~=== Bifrost~~ * FP32: Consumer GPU performance is measured usually in single-precision (~~2016~~32-bit) ~~and Valhall~~ floating-point FMA (~~2019~~fused-multiply-add) ~~===~~throughput.

* FP64: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs. * FP16: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2. ==Throughput Examples==Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref> MAD 16 MUL 16 ADD 32 Bit-shift 16 Bitwise XOR 32 Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref> MAD 1/4 MUL 1/4 ADD 1 Bit-shift 1 Bitwise XOR 1 Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec =Tensors=MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit. ==Nvidia TensorCores==: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://~~developer~~www.~~arm~~anandtech.com/~~documentation~~show/13282/nvidia-turing-architecture-deep-dive/~~101574~~6 AnandTech - Nvidia Turing Deep Dive page 6]</~~latest Bifrost~~ ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and ~~Valhall OpenCL~~ sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref> ==AMD Matrix Cores==: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation acceleration. AMD's CDNA 3 architecture adds support for FP8 and sparse matrix data (sparsity). ==Intel XMX Cores==: Intel added XMX, Xe Matrix eXtensions, cores to some of the [https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] GPU series, like [https://en.wikipedia.org/wiki/Intel_Arc#Alchemist Arc Alchemist] and [https://www.intel.com/content/www/us/en/products/sku/232876/intel-data-center-gpu-max-1100/specifications.html Intel Data Center GPU Max Series]. =Host-Device Latencies= One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer ~~Guide~~Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>. =Deep Learning=GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]]adaption.

=~~== Midgard (2012) ==~~Architectures =* [https:The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/~~/developer~~CGI programs and server brands for big-data and number-crunching workloads.~~arm~~Each brand offering different feature sets in driver, VRAM, or computation abilities.~~com/documentation/100614/latest Midgard OpenCL Developer Guide]~~

== AMD ==

AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.

* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]

=== CDNA3 ===

CDNA3 HPC architecture was unveiled in December, 2023. With MI300A APU model (CPU+GPU+HBM) and MI300X GPU model, both with multi-chip modules design. Featuring Matrix Cores with support for a broad type of precision, as INT8, FP8, BF16, FP16, TF32, FP32, FP64, as well as sparse matrix data (sparsity). Supported by AMD's ROCm open software stack for AMD Instinct accelerators.

* [https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf AMD CDNA3 Whitepaper]

* [https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf AMD Instinct MI300/CDNA3 Instruction Set Architecture]

* [https://www.amd.com/en/developer/resources/rocm-hub.html AMD ROCm Developer Hub]

=== Navi 3x RDNA3 ===

RDNA3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation acceleration.

* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]

* [https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf RDNA3 Instruction Set Architecture]

=== CDNA2 ===

CDNA2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.

* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]

* [https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf CDNA2 Instruction Set Architecture]

=== CDNA ===

* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]

* [https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf CDNA Instruction Set Architecture]

=== Navi ~~2X RDNA 2.0~~ 2x RDNA2 === * [https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 ~~RDNA 2 from Wikipedua~~RDNA2] cards were unveiled on October 28, 2020. * [https://en.wikipedia.org/wiki/Radeon_RX_6000_series AMD Radeon RX 6000 on Wikipedia]

* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]

=== Navi RDNA ~~2.0 cards were unveiled on October 28, 2020.~~=== * [https://en.wikipedia.org/wiki/~~Radeon_RX_6000_series Radeon RX 6000 series from Wikipedua~~RDNA_(microarchitecture) RDNA] ~~=== Navi RDNA 1~~cards were unveiled on July 7, 2019.~~0 ===~~

* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]

* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]

* [https://engpuopen.~~wikipedia.org~~com/wp-content/~~wiki~~uploads/~~RDNA_(microarchitecture) RDNA (microarchitecture) from Wikipedua]~~ ~~RDNA cards were first released in~~ 2019. RDNA is a major change for AMD cards: the underlying hardware supports both Wave32 and Wave64 gangs of threads. Compute Units have 2x32 wide SIMD units, each of which executes 32 threads per clock tick. A Wave64 workgroup will execute on a single SIMD unit, but over two clock ticks. It should be noted that these Wave32 still have 5 cycles of latency before registers can be reused, so a Wave64 executing over two clock ticks will have fewer stalls than a Wave32. * [https:/08/enRDNA_Shader_ISA_5August2019.~~wikipedia.org/wiki/Radeon_RX_5000_series Radeon RX 5000 series from Wikipedua~~pdf RDNA Instruction Set Architecture]* Radeon 5700 XT* Radeon 5700

=== Vega GCN 5th gen ===

[https://~~www~~en.~~techpowerup~~wikipedia.~~com~~org/~~gpu-specs~~wiki/~~docs/amd-vega-architecture.pdf Architecture Whitepaper~~Radeon_RX_Vega_series Vega] ~~Vega~~ cards were ~~first released in~~ unveiled on August 14, 2017. Vega is the last in the line of the GCN Architecture: 64 threads per wavefront. Each compute unit contains 4x SIMD units, supporting a total of 40 wavefronts per compute unit (a queue of 10-wavefronts per SIMD Unit). Each SIMD unit contains 16 vALUs for general compute + 1 sALU for branching and constant logic. Each SIMD unit executes the same instruction over four clock ticks (16 vALUs x 4 clock ticks == 64 threads per Wavefront). Vega specifically added Packed FP16 instructions, such as dot-product and packed add and packed multiply. From a programming level, these packed FP16 instructions are SIMD-within-SIMD, each SIMD thread could operate its own SIMD FP16 instruction akin to AVX or SSE from the x64 architecture.

* ~~Radeon VII~~[https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]* ~~Vega64~~* Vega56[https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set Architecture]

=== Polaris GCN 4th gen ===

[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris ] cards were first released in 2016 ~~under the AMD Radeon 400 series name~~.

* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]* [https://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf GCN3/4 Instruction Set Architecture]

* RX 580* RX 570* RX 560=== Southern Islands GCN 1st gen ===

~~== Intel ==~~Southern Island cards introduced the [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN] architecture in 2012.

~~=== Intel Xe 'Gen12' ===~~* [https://en.wikipedia.org/wiki/Radeon_HD_7000_series AMD Radeon HD 7000 on Wikipedia]* [https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/programmer-references/si_programming_guide_v2.pdf Southern Islands Programming Guide]* [https://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf Southern Islands Instruction Set Architecture]

[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).== Apple ==

* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Xe 'Gen12' GPUs on Wikipedia] === M series ===

~~==Nvidia==Nvidia line of discrete GPUs is branded as GeForce~~ Apple released its M series SoC (system on a chip) with integrated GPU for ~~consumer, Quadro for professional~~ desktops and ~~Tesla for server~~notebooks in 2020.

~~=== Ampere Architecture ===The~~ * [https://en.wikipedia.org/wiki/~~Ampere_(microarchitecture) Ampere microarchitecture] was announced~~ Apple_silicon#M_series Apple M series on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen JonesWikipedia], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.

* DGX A100 == ARM ==* HGX A100The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.

~~=== Turing Architecture ===~~* [https://~~www~~en.~~nvidia~~wikipedia.~~com~~org/~~content~~wiki/~~dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper~~Mali_(GPU)#Variants Mali variants on Wikipedia]

~~[https://en.wikipedia.org/wiki/Turing_~~=== Valhall (~~microarchitecture~~2019) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, or [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. RTX instructions will more quickly traverse an [https://en.wikipedia.org/wiki/Minimum_bounding_box#Axis-aligned_minimum_bounding_box aabb] [https://en.wikipedia.org/wiki/Bounding_volume_hierarchy tree] to discover ray-intersections with lists of bounding-boxes, accelerating raytracing performance. These are also the first consumer cards to launch with Tensor cores, 4x4 matrix multiplication FP16 instructions to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].===

* ~~RTX 2080 Ti~~* RTX 2080* RTX 2070 Ti* RTX 2070 Super* RTX 2070 * RTX 2060 Super* RTX 2060* GTX 1660 -- Low-end GPU without Tensor cores or RTX Cores[https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]

=== ~~Volta Architecture~~ Bifrost (2016) ===

* [https://~~images~~developer.~~nvidia~~arm.com/~~content~~documentation/~~volta-architecture~~101574/~~pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper~~latest Bifrost and Valhall OpenCL Developer Guide]

=== Midgard (2012) ===* [https://endeveloper.~~wikipedia~~arm.~~org~~com/documentation/~~wiki~~100614/Volta_(microarchitecture) Volta] cards were released in 2017. Only Tesla and Titan cards were produced in this generation, aiming only for the most expensive end of the market. They were the first cards to launch with Tensor cores, supporting 4x4 FP16 matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networkslatest Midgard OpenCL Developer Guide]].

* Tesla V100* Titan V== Intel ==

=== ~~Pascal Architecture~~ Xe ===

[https://~~images~~en.~~nvidia~~wikipedia.~~com~~org/~~content~~wiki/~~pdf/tesla/whitepaper/pascal~~Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-~~architecture~~performance-~~whitepaper~~computing).~~pdf Architecture Whitepaper]~~

* [https://en.wikipedia.org/wiki/~~Pascal_(microarchitecture) Pascal~~List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia] ~~cards were first released in 2016~~* [https://en.wikipedia.org/wiki/Intel_Arc#Alchemist Arc Alchemist series on Wikipedia]

* ==Nvidia==Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla ~~P100~~* Titan Xp* GTX 1080 Ti* GTX 1080* GTX 1070 Ti* GTX 1060* GTX 1050* GTX 1030for server.

~~=== Maxwell Architecture ===~~* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]

=== Grace Hopper Superchip ===The Nvidia GH200 Grace Hopper Superchip was unveiled August, 2023 and combines the Nvidia Grace CPU ([~~https://web.archive.org/web/20170721113746/http://international.download~~[ARM|ARM v9]]) and Nvidia Hopper GPU architectures via NVLink to deliver a CPU+GPU coherent memory model for accelerated AI and HPC applications.~~nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]~~

* [https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip NVIDIA Grace Hopper Superchip Data Sheet]* [https://resources.~~wikipedia~~nvidia.~~org~~com/~~wiki~~en-us-grace-cpu/~~Maxwell(microarchitecture) Maxwell~~nvidia-grace-hopper NVIDIA Grace Hopper Superchip Architecture Whitepaper] ~~cards were first released in 2014.~~

=~~Instruction Throughput~~== Ada Lovelace Architecture === ~~GPUs are used in~~ The [https://en.wikipedia.org/wiki/~~High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP~~Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture]~~/Watt ratio. The instruction throughput in general depends~~ was announced on ~~the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla]~~September 20, ~~[https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]~~2022, ~~[https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler]~~featuring 4th-generation Tensor Cores with FP8, ~~[https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 Terascale]~~FP16, [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro]BF16, ~~[https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct])~~ TF32 and ~~the specific model~~sparsity acceleration.

~~==Integer Instruction Throughput==~~* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]* ~~INT32~~[https: ~~The 32 bit integer performance can be architecture and operation depended less than 32 bit FLOP or 24 bit integer performance~~//docs.nvidia.com/cuda/ada-tuning-guide/index.html Ada Tuning Guide]

* INT64=== Hopper Architecture ===~~: Current GPU~~ The [https://en.wikipedia.org/wiki/~~Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs~~Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] ~~are 32 bit wide and have to emulate 64 bit integer operations.~~* INT8~~: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput~~was announced on March 22, 2022, featuring Transfomer Engines for large language models.

~~==Floating Point Instruction Throughput==~~* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]* [https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html Hopper Tuning Guide]

* FP32=== Ampere Architecture ===The [https: ~~Consumer~~ //en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU ~~performance is measured usually~~ based on the Ampere architecture delivers a generational leap in accelerated computing in ~~single~~conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-~~precision (32 bit) floating point FMA, fused~~11-~~multiply~~features-~~add~~revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, ~~throughput~~2020</ref>.

* ~~FP64~~[https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]* [https: ~~Consumer GPUs have in general a lower ratio (FP32~~//www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]* [https:~~FP64) for double~~//docs.nvidia.com/cuda/ampere-~~precision (64 bit) floating point operations than server brand GPUs, like 4:1 down to 32:1 compared to 2:1 to 4:1~~tuning-guide/index.html Ampere GPU Architecture Tuning Guide]

* FP16=== Turing Architecture ===[https: ~~Some newer GPGPU architectures offer half-precision~~ //en.wikipedia.org/wiki/Turing_(~~16 bit~~microarchitecture) ~~floating point operation throughput~~ Turing] cards were first released in 2018. They are the first consumer cores to launch with ~~an FP32~~RTX, for [https:~~FP16 ratio of 1:2~~//en.wikipedia. ~~Older architectures migth not support FP16 at all~~org/wiki/Ray_tracing_(graphics) raytracing], at features. These are also the ~~same rate as FP32,~~ first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or ~~at very low rates~~TensorCores.

~~==Tensors=====Nvidia TensorCores===: With Nvidia~~ * [https://enwww.~~wikipedia~~nvidia.~~org~~com/content/~~wiki~~dam/~~Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix~~en-~~multiplication-accumulate-units, used to accelerate neural networks.<ref>[https:~~zz/Solutions/ondesign-~~demand.gputechconf.com~~visualization/~~gtc~~technologies/~~2017/presentation~~turing-architecture/~~s7798~~NVIDIA-~~luke~~Turing-~~durant~~Architecture-~~inside-volta~~Whitepaper.pdf ~~INSIDE VOLTA~~Turing Architecture Whitepaper]~~</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>~~* [https://~~www~~docs.~~anandtech~~nvidia.com/~~show~~cuda/~~13282/nvidia-~~turing-~~architecture-deep~~tuning-~~dive~~guide/~~6 AnandTech - Nvidia~~ index.html Turing ~~Deep Dive page 6~~Tuning Guide]~~</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>~~

===~~AMD Matrix Cores~~Volta Architecture ===~~: AMD released 2020 its server-class~~ [https://~~www~~en.~~amd~~wikipedia.~~com~~org/~~system~~wiki/~~files/documents/amd-cdna-whitepaper~~Volta_(microarchitecture) Volta] cards were released in 2017.~~pdf CDNA] architecture~~ They were the first cards to launch with ~~Matrix Cores which support MFMA~~TensorCores, supporting matrix~~-fused-multiply-add, operations on various data types like INT8, FP16, BF16, FP32~~multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].

~~===Intel XMX Cores===: Intel plans XMX, Xe Matrix eXtensions, for its upcoming~~ * [https://~~www~~images.~~anandtech~~nvidia.com/~~show~~content/~~15973~~volta-architecture/~~the-intel-xelp-gpu~~pdf/volta-architecture-~~deep~~whitepaper.pdf Volta Architecture Whitepaper]* [https://docs.nvidia.com/cuda/volta-~~dive~~tuning-~~building-up-from-the-bottom~~guide/~~4 Xe discrete GPU~~index.html Volta Tuning Guide] ~~series.~~

==~~Throughput Examples~~= Pascal Architecture ===~~Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations~~[https://~~clock cycle per compute unit <ref>CUDA C Programming Guide v7~~en.~~0, Chapter 5~~wikipedia.4org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.~~1. Arithmetic Instructions</ref>~~ ~~MAD 16~~ ~~MUL 16~~ ~~ADD 32~~ ~~Bit-shift 16~~ ~~Bitwise XOR 32~~

~~Max theoretic ADD operation throughput~~* [https: ~~32 Ops~~ //images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Pascal Architecture Whitepaper]* ~~16 CUs * 1544 MHz = 790~~[https://docs.nvidia.~~528 GigaOps~~com/cuda/pascal-tuning-guide/~~sec~~index.html Pascal Tuning Guide]

~~AMD Radeon HD 7970 (GCN 1.0) - 32 bit integer operations~~=== Maxwell Architecture ===[https://~~clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide~~en.~~pdf 3~~wikipedia.~~0beta, Chapter 2~~org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.~~7.1 Instruction Bandwidths</ref>~~

~~MAD 1~~* [https:/4/web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Maxwell Architecture Whitepaper on archiv.org] ~~MUL 1~~* [https:/4 ~~ADD 1~~ ~~Bit~~/docs.nvidia.com/cuda/maxwell-tuning-~~shift 1~~ ~~Bitwise XOR 1~~guide/index.html Maxwell Tuning Guide]

~~Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz~~ = ~~1894~~= PowerVR ==PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.~~4 GigaOps/sec~~

=~~Host-Device Latencies~~= One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=~~7&t~~PowerVR ===~~67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.~~

~~=Deep Learning=GPUs were originally intended to process matrix multiplications for graphical transformations and rendering~~* [https://en. ~~[[Neural Networks~~wikipedia.org/wiki/PowerVR#~~Convolutional|Convolutional Neural Networks~~PowerVR_Graphics PowerVR series on Wikipedia]~~] can have their operations interpreted as a series of matrix multiplications. GPUs are therefore a natural fit to parallelize and process CNNs.~~

GPUs traditionally operated on 32-bit floating point numbers. However, CNNs can make due with 16-bit half floats (FP16), or even 8-bit or 4-bit numbers. One thousand single-precision floats will take up 4kB of space, while one-thousand FP16 will take up 2kB of space. A half-float uses half the memory, eats only half the memory bandwidth, and only half the space in caches. As such, GPUs such as AMD Vega or Nvidia Volta added support for FP16 processing.=== IMG ===

~~Specialized units, such as Nvidia Volta's "Tensor cores", can perform an entire 4x4 block of FP16 matrix multiplications in just one PTX assembly language statement~~* [https://en. ~~It is with these instructions that CNN operations are accelerated~~wikipedia. org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]

== Qualcomm ==Qualcomm offers Adreno GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, ~~also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]],~~ in various types as ~~pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia~~a component of their Snapdragon SoCs.~~org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption~~Since Adreno 300 series OpenCL support is offered.

=~~History~~== Adreno ===In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+* [https://en.wikipedia.org/wiki/~~ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set~~Adreno#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines, such as [https://en.wikipedia.org/wiki/Quake_(video_game) Quake], could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architectureVariants Adreno variants on Wikipedia], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007), Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006) or ARM [https://en.wikipedia.org/wiki/Mali_(GPU)#Technical_details Mali Midgard] (2012), GPGPU frameworks like CUDA and OpenCL emerged and gained in popularity.

~~The large number of regular [https://en.wikipedia.org/wiki/Matrix_multiplication matrix multiplications] led~~ == Vivante Corporation ==Vivante licenses IP to natural SIMD-style algorithms. The 3D graphics community drew upon the rich history of vector-compute and SIMD-compute from 1980s and 1970s supercomputers. As such, many publications relating to [[Cray X-MP|Cray-vector supercomputers]] or the [[Connection Machine]] supercomputer easily apply to modern GPUs. For example, all the algorithms described in the 1986 publication ''Data Parallel Algorithms'' <ref>[[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism</ref> can be efficiently executed on a modern GPU workgroup (roughly ~256x GPU threads). The ''Data Parallel Algorithms'' paper is a beginner-level algorithms paper, demonstrating simple and efficient [[Parallel Prefix Algorithms|parallel-prefix sum]], parallel-linked list traversalthird parties for embedded systems, ~~parallel RegEx matching on~~ the ~~4096x parallel Connection Machine-2 supercomputer~~GC series offers optional OpenCL support.

Modern papers on GPUs, such as Nvidia's excellent ''Parallel Prefix Sum (Scan) with CUDA (GPU Gems 3)'' <ref>[https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html Chapter 39. Parallel Prefix Sum (Scan) with CUDA (GPU Gems 3)]</ref>, are built on top of these papers from the 1980s or 1990s. As such, the beginner will find it far easier to read the papers from the 1980s or 90s before attempting to read a modern piece like GPU Gems 3.=== GC-Series ===

~~=Chess Engines=~~* [~~[:Category~~https:~~GPU]~~//en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]

=See also=

* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]

* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]

* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]

'''2014'''

* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]

* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]

* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]

* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]

==2015 ...==

* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]

* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]

* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11

'''2016'''

* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref>

* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020

* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]

* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021

* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]

* [https://talkchess.com/forum3/viewtopic.php?f=7&t=72566&p=955538#p955538 Re: China boosts in silicon...] by [[Srdja Matovic]], [[CCC]], January 13, 2024

=External Links=

* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]

* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]

* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]

* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]

* [https://developer.nvidia.com/ NVIDIA Developer]

Smatovic

422

edits

Changes

GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools