https://www.chessprogramming.org/api.php?action=feedcontributions&user=Smatovic&feedformat=atomChessprogramming wiki - User contributions [en]2022-11-28T01:47:50ZUser contributionsMediaWiki 1.30.1https://www.chessprogramming.org/index.php?title=GPU&diff=26649GPU2022-11-25T11:33:37Z<p>Smatovic: /* Unified Memory */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Vendor Terminology<br />
|-<br />
! AMD Terminology !! Nvidia Terminology<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Stream Core || CUDA Core<br />
|-<br />
| Wavefront || Warp<br />
|}<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 CUDA cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors<br />
* organized in 2x16 CUDA cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 Stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16, each SIMT4, per Compute Unit<br />
* Wavefront size of 64 work-items<br />
<br />
===Wavefront and Warp===<br />
Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work-groups form the NDRange to be executed on the GPU device. The members of a work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many work-items a work-group can hold and how many threads can run in total concurrently on the device.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Kernel || Kernel<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Processing Element || CUDA Core<br />
|-<br />
| Work-Item || Thread<br />
|-<br />
| Work-Group || Block<br />
|-<br />
| NDRange || Grid<br />
|-<br />
|}<br />
<br />
==Thread Examples==<br />
<br />
Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref><br />
<br />
* Warp size: 32<br />
* Maximum number of threads per block: 1024<br />
* Maximum number of resident blocks per multiprocessor: 32<br />
* Maximum number of resident warps per multiprocessor: 64<br />
* Maximum number of resident threads per multiprocessor: 2048<br />
<br />
<br />
AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref><br />
<br />
* Wavefront size: 64<br />
* Maximum number of work-items per work-group: 1024<br />
* Maximum number of work-groups per compute unit: 40<br />
* Maximum number of Wavefronts per compute unit: 40<br />
* Maximum number of work-items per compute unit: 2560<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Private Memory || Registers<br />
|-<br />
| Local Memory || Shared Memory<br />
|-<br />
| Constant Memory || Constant Memory<br />
|-<br />
| Global Memory || Global Memory<br />
|}<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26648GPU2022-11-25T11:30:53Z<p>Smatovic: /* Memory Model */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Vendor Terminology<br />
|-<br />
! AMD Terminology !! Nvidia Terminology<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Stream Core || CUDA Core<br />
|-<br />
| Wavefront || Warp<br />
|}<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 CUDA cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors<br />
* organized in 2x16 CUDA cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 Stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16, each SIMT4, per Compute Unit<br />
* Wavefront size of 64 work-items<br />
<br />
===Wavefront and Warp===<br />
Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work-groups form the NDRange to be executed on the GPU device. The members of a work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many work-items a work-group can hold and how many threads can run in total concurrently on the device.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Kernel || Kernel<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Processing Element || CUDA Core<br />
|-<br />
| Work-Item || Thread<br />
|-<br />
| Work-Group || Block<br />
|-<br />
| NDRange || Grid<br />
|-<br />
|}<br />
<br />
==Thread Examples==<br />
<br />
Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref><br />
<br />
* Warp size: 32<br />
* Maximum number of threads per block: 1024<br />
* Maximum number of resident blocks per multiprocessor: 32<br />
* Maximum number of resident warps per multiprocessor: 64<br />
* Maximum number of resident threads per multiprocessor: 2048<br />
<br />
<br />
AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref><br />
<br />
* Wavefront size: 64<br />
* Maximum number of work-items per work-group: 1024<br />
* Maximum number of work-groups per compute unit: 40<br />
* Maximum number of Wavefronts per compute unit: 40<br />
* Maximum number of work-items per compute unit: 2560<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Private Memory || Registers<br />
|-<br />
| Local Memory || Shared Memory<br />
|-<br />
| Constant Memory || Constant Memory<br />
|-<br />
| Global Memory || Global Memory<br />
|}<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=SIMD_and_SWAR_Techniques&diff=26642SIMD and SWAR Techniques2022-11-18T10:41:26Z<p>Smatovic: /* SIMD Instruction Sets */ added POWER and SVE/SVE2</p>
<hr />
<div>'''[[Main Page|Home]] * [[Programming]] * SIMD and SWAR Techniques'''<br />
<br />
[[FILE:SIMD.svg|border|right|thumb|[https://en.wikipedia.org/wiki/SIMD SIMD] <ref>[https://en.wikipedia.org/wiki/Flynn%27s_taxonomy Flynn's taxonomy from Wikipedia]</ref> ]] <br />
<br />
[[x86]], [[x86-64]], as well as [[PowerPC#G4|PowerPC]] and [https://en.wikipedia.org/wiki/Power_Architecture#Power_ISA_v.2.03 Power ISA v.2.03] processors provide '''Single Instructions''' on '''Multiple Data''' (SIMD), namely on [[Array|vectors]] of [[Float|floats]], [[Double|doubles]] or various integers, [[Byte|bytes]], [[Word|words]], [[Double Word|double words]] or [[Quad Word|quad words]], available through assembly and compiler intrinsics. SIMD-applications related to computer chess cover [[Bitboards|bitboard]] computations and [[Fill Algorithms|fill-algorithms]] like [[Dumb7Fill]] and [[Kogge-Stone Algorithm]], as well as [[Evaluation|evaluation]] related stuff, like this [[SSE2#SSE2dotproduct|SSE2 dot-product]] of 64 bits by a vector of 64 bytes.<br />
<br />
'''SWAR''' as acronym for SIMD Within A Register was coined by [[Hank Dietz]] and '''Randell J. Fisher''' <ref>[http://www.aggregate.org/SWAR/ The Aggregate: SWAR, SIMD Within A Register] by [[Hank Dietz]]</ref> . It is a processing model which applies SIMD parallel processing across sections of a CPU register, often vectors of smaller than byte-entities are processed in [[Parallel Prefix Algorithms|parallel prefix]] manner. <br />
<br />
=SIMD Instruction Sets= <br />
* [[MMX]] on [[x86]] and [[x86-64]]<br />
* [[SSE2]], [[SSE3]], [[SSSE3]] and [[SSE4]] on [[x86]] and [[x86-64]]<br />
* [[SSE5]] by [[AMD]] (proposed but not implemented, replaced by [[XOP]] <ref>[https://en.wikipedia.org/wiki/SSE5 SSE5 from Wikipedia]</ref>)<br />
* [[AltiVec]] on [[PowerPC#G4|PowerPC G4]], [[PowerPC#G5|PowerPC G5]]<br />
* [[VMX]] since [[POWER | POWER6]]<br />
* [[ARM Helium]]<br />
* [[ARM NEON]]<br />
* [[ARM SVE]] <ref>[https://en.wikipedia.org/wiki/AArch64#Scalable_Vector_Extension_(SVE) SVE from Wikipedia]</ref>, [[ARM SVE2]] <ref>[https://en.wikipedia.org/wiki/AArch64#ARMv8.5-A_and_ARMv9.0-A[24] SVE2 from Wikipedia]</ref><br />
* [[AVX]] by [[Intel]] <br />
* [[AVX2]] by [[Intel]]<br />
* [[AVX-512]] by [[Intel]]<br />
* [[XOP]] by [[AMD]] <br />
<span id="SWAR"></span><br />
<br />
=SWAR Arithmetic= <br />
To apply addition and subtraction on vectors of bit-aggregates or [https://en.wikipedia.org/wiki/Bit_field bit-field structures] within a general purpose register, one has to take care carries and borrows don't wrap around. Thus the need to mask of all most significant bits (H) and add in two steps, one 'add' with MSB clear and one add modulo 2 aka '[[General Setwise Operations#ExclusiveOr|xor]]' for the MSB itself. For bytewise (rankwise) math inside a 64-bit register, H is <span style="background-color: #e3e3e3;">0x8080808080808080</span> and L is <span style="background-color: #e3e3e3;">0x0101010101010101</span>.<br />
<pre><br />
SWAR add z = x + y<br />
z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)<br />
</pre><br />
<pre><br />
SWAR sub z = x - y<br />
z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)<br />
</pre><br />
<pre><br />
SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y)<br />
z = (x & y) + (((x ^ y) & ~L) >> 1)<br />
</pre><br />
<br />
=Samples= <br />
Amazing, how similar these two SWAR- and [[Parallel Prefix Algorithms|parallel prefix wise]] routines are. [[Flipping Mirroring and Rotating#MirrorHorizontally|Mirror horizontally]] and [[Population Count#SWARPopcount|population count]] have in common to act on vectors of duos, [[Nibble|nibbles]] and [[Byte|bytes]]. One swaps bits, duos and nibbles, while the second adds populations of them.<br />
<pre><br />
U64 mirrorHorizontal (U64 x) {<br />
const U64 k1 = C64(0x5555555555555555);<br />
const U64 k2 = C64(0x3333333333333333);<br />
const U64 k4 = C64(0x0f0f0f0f0f0f0f0f);<br />
x = ((x & k1) << 1) | ((x >> 1) & k1);<br />
x = ((x & k2) << 2) | ((x >> 2) & k2);<br />
x = ((x & k4) << 4) | ((x >> 4) & k4);<br />
return x;<br />
}<br />
</pre><br />
<pre><br />
int popCount (U64 x) {<br />
const U64 k1 = C64(0x5555555555555555);<br />
const U64 k2 = C64(0x3333333333333333);<br />
const U64 k4 = C64(0x0f0f0f0f0f0f0f0f);<br />
x = x - ((x >> 1) & k1);<br />
x = (x & k2) + ((x >> 2) & k2);<br />
x = ( x + (x >> 4)) & k4 ;<br />
x = (x * C64(0x0101010101010101))>> 56;<br />
return (int) x;<br />
}<br />
</pre><br />
=See also=<br />
* [[GPU]]<br />
* [[NNUE]]<br />
* [[Parallel Prefix Algorithms]]<br />
<br />
=Publications=<br />
==1987 ...==<br />
* [[Alan H. Bond]] ('''1987'''). ''Broadcasting Arrays - A Highly Parallel Computer Architecture Suitable For Easy Fabrication''. [http://www.exso.com/bc.pdf pdf]<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
* [https://dblp.uni-trier.de/pers/f/Fisher:Randall_J=.html Randell J. Fisher], [[Hank Dietz]] ('''1998'''). ''[https://link.springer.com/chapter/10.1007/3-540-48319-5_19 Compiling for SIMD Within a Register]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc1998.html LCPC 1998], [https://link.springer.com/chapter/10.1007/3-540-48319-5_19 pdf]<br />
* [https://www.linkedin.com/in/tom-thompson-500bb7b Tom Thompson] ('''1999'''). ''[http://www.mactech.com/articles/mactech/Vol.15/15.07/AltiVecRevealed/index.html AltiVec Revealed]''. [http://www.mactech.com/ MacTech], Vol. 15, No. 7<br />
==2000 ...==<br />
* [https://dblp.uni-trier.de/pers/f/Fisher:Randall_J=.html Randell J. Fisher] ('''2003'''). ''[https://docs.lib.purdue.edu/dissertations/AAI3108343/ General-Purpose SIMD Within A Register: Parallel Processing on Consumer Microprocessors]''. Ph.D. thesis, [https://en.wikipedia.org/wiki/Purdue_University Purdue University], advisor [[Hank Dietz]], [http://aggregate.org/SWAR/Dis/dissertation.pdf pdf]<br />
* [[Daisuke Takahashi]] ('''2007'''). ''[https://link.springer.com/chapter/10.1007/978-3-540-75755-9_135/ An Implementation of Parallel 1-D FFT Using SSE3 Instructions on Dual-Core Processors]''. Proc. Workshop on State-of-the-Art in Scientific and Parallel Computing, [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], No. 4699, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Daisuke Takahashi]] ('''2008'''). ''Implementation and Evaluation of Parallel FFT Using SIMD Instructions on Multi-Core Processors''. Proc. 2007 International Workshop on Innovative Architecture for Future Generation High-Performance Processors and Systems<br />
* [https://www.researchgate.net/profile/Nicolas_Fritz2 Nicolas Fritz] ('''2009'''). ''SIMD Code Generation in Data-Parallel Programming''. Ph.D. thesis, [https://en.wikipedia.org/wiki/Saarland_University Saarland University], [http://scidok.sulb.uni-saarland.de/volltexte/2009/2563/pdf/Dissertation_9229_Frit_Nico_2009.pdf?q=ibms-cell-processor pdf]<br />
==2010 ...==<br />
* [https://www.rrze.fau.de/wir-ueber-uns/organigramm/mitarbeiter/index.shtml/georg-hager.shtml Georg Hager] <ref>[https://blogs.fau.de/hager/ Georg Hager's Blog | Random thoughts on High Performance Computing]</ref>, [http://dblp.uni-trier.de/pers/hd/t/Treibig:Jan Jan Treibig], [http://dblp.uni-trier.de/pers/hd/w/Wellein:Gerhard Gerhard Wellein] ('''2013'''). ''The Practitioner's Cookbook for Good Parallel Performance on Multi- and Many-Core Systems''. [https://de.wikipedia.org/wiki/Regionales_Rechenzentrum_Erlangen RRZE], [http://sc13.supercomputing.org/ SC13], [https://blogs.fau.de/hager/files/2013/11/sc13_tutorial_134.pdf slides as pdf]<br />
* [https://scholar.google.com/citations?user=4Ab_NBkAAAAJ&hl=en Kaixi Hou], [[Hao Wang]], [http://dblp.uni-trier.de/pers/hd/f/Feng:Wu=chun Wu-chun Feng] ('''2015'''). ''ASPaS: A Framework for Automatic SIMDIZation of Parallel Sorting on x86-based Many-core Processors''. [http://dblp.uni-trier.de/db/conf/ics/ics2015.html#HouWF15 ICS2015], <br />
<br />
=Manuals= <br />
==AMD== <br />
* [http://developer.amd.com/wordpress/media/2012/10/26568_APM_v41.pdf AMD64 Architecture Volume 4: 128-Bit and 256-Bit Media Instructions] (pdf)<br />
* [http://support.amd.com/TechDocs/26569_APM_v5.pdf AMD64 Architecture Volume 5: 64-Bit Media and x87 Floating-Point Instructions] (pdf)<br />
* [http://support.amd.com/TechDocs/43479.pdf AMD64 Architecture Volume 6: 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions] (pdf)<br />
==NXP Semiconductors==<br />
* [http://www.nxp.com/files/32bit/doc/ref_manual/ALTIVECPIM.pdf AltiVec Technology - Programming Interface Manual] (pdf) <ref>On December 7, 2015, [https://en.wikipedia.org/wiki/NXP_Semiconductors NXP Semiconductors] completed its acquisition of Freescale, [https://en.wikipedia.org/wiki/Freescale_Semiconductor Freescale from Wikipedia]</ref><br />
==Intel== <br />
* [http://www.intel.com/design/processor/manuals/248966.pdf Intel 64 and IA32 Architectures Optimization Reference Manual] (pdf)<br />
<br />
=Forum Posts= <br />
==1999==<br />
* [https://www.stmintz.com/ccc/index.php?id=71754 G4 & AltiVec] by [[Will Singleton]], [[CCC]], October 04, 1999 » [[AltiVec]], [[PowerPC #G4|PowerPC G4]]<br />
==2000 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=23860 Superlinear interpolator: a nice novelity ?] by [[Marco Costalba]], [[CCC]], September 20, 2008 » [[Tapered Eval]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?p=301746#301746 Re: talk about IPP's evaluation] by [[Richard Vida]], [[CCC]], November 07, 2009 » [[Ippolit]], [[Tapered Eval]]<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38523 My experience with Linux/GCC] by [[Richard Vida]], [[CCC]], March 23, 2011 » [[C]], [[Linux]], [[Tapered Eval]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39916&start=1 Re: Utilizing Architecture Specific Functions from a HL Language] by [[Wylie Garvin]], [[CCC]], July 31, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42054 two values in one integer] by [[Pierre Bokma]], [[CCC]], January 18, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=59820 Pigeon now using opportunistic SIMD] by [[Stuart Riffle]], [[CCC]], April 11, 2016 » [[Pigeon]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61850 couple of questions about stockfish code ?] by [[Mahmoud Uthman]], [[CCC]], October 26, 2016 » [[Stockfish]], [[Tapered Eval]]<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73126 SIMD methods in TT probing and replacement] by [[Harm Geert Muller]], [[CCC]], February 20, 2020 » [[Transposition Table]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75862 CPU Vector Unit, the new jam for NNs...] by [[Srdja Matovic]], [[CCC]], November 18, 2020 » [[NNUE]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/SIMD SIMD from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/SWAR SWAR from Wikipedia]<br />
* [http://www.aggregate.org/SWAR/ The Aggregate: SWAR, SIMD Within A Register] by [[Hank Dietz]]<br />
==[[x86]]/[[x86-64]]== <br />
* [https://en.wikipedia.org/wiki/MMX_%28instruction_set%29 MMX from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/3DNow 3DNow! from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Streaming_SIMD_Extensions Streaming SIMD Extensions from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/SSE2 SSE2 from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/SSE3 SSE3 from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/SSSE3 SSSE3 from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/SSE4 SSE4 from Wikipedia]<br />
* [https://de.wikipedia.org/wiki/SSE4a SSE4a from Wikipedia.de]<br />
* [https://en.wikipedia.org/wiki/SSE5 SSE5 from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/XOP_instruction_set XOP instruction set from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Advanced_Vector_Extensions Advanced Vector Extensions from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/AVX-512 AVX-512 from Wikipedia]<br />
* [http://developer.amd.com/cpu/Libraries/sseplus/Pages/default.aspx SSEPlus Project] from [http://developer.amd.com/pages/default.aspx AMD Developer Central]<br />
* [http://sseplus.sourceforge.net/index.html SSEPlus Project Documentation]<br />
==Other== <br />
* [https://developer.arm.com/architectures/instruction-sets/simd-isas/neon SIMD ISAs | Neon – Arm Developer]<br />
* [https://en.wikipedia.org/wiki/ARM_architecture#Advanced_SIMD_.28NEON.29 ARM NEON Technology from Wikipedia]<br />
* [https://developer.arm.com/architectures/instruction-sets/simd-isas/helium SIMD ISAs | Arm Helium technology – Arm Developer]<br />
* [https://en.wikipedia.org/wiki/AltiVec AltiVec from Wikipedia]<br />
==Misc==<br />
* [https://en.wikipedia.org/wiki/Explicitly_parallel_instruction_computing Explicitly parallel instruction computing (EPIC) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Instruction-level_parallelism Instruction-level parallelism from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/MIMD MIMD from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia] » [[GPU]], [[Thread]]<br />
* [https://en.wikipedia.org/wiki/SPMD SPMD from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Very_long_instruction_word Very long instruction word (VLIW) from Wikipedia]<br />
<br />
=References= <br />
<references /><br />
'''[[Programming|Up one Level]]'''</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26631GPU2022-11-15T06:21:55Z<p>Smatovic: /* Programming Model */ typo</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Vendor Terminology<br />
|-<br />
! AMD Terminology !! Nvidia Terminology<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Stream Core || CUDA Core<br />
|-<br />
| Wavefront || Warp<br />
|}<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 CUDA cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors<br />
* organized in 2x16 CUDA cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 Stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16, each SIMT4, per Compute Unit<br />
* Wavefront size of 64 work-items<br />
<br />
===Wavefront and Warp===<br />
Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work-groups form the NDRange to be executed on the GPU device. The members of a work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many work-items a work-group can hold and how many threads can run in total concurrently on the device.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Kernel || Kernel<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Processing Element || CUDA Core<br />
|-<br />
| Work-Item || Thread<br />
|-<br />
| Work-Group || Block<br />
|-<br />
| NDRange || Grid<br />
|-<br />
|}<br />
<br />
==Thread Examples==<br />
<br />
Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref><br />
<br />
* Warp size: 32<br />
* Maximum number of threads per block: 1024<br />
* Maximum number of resident blocks per multiprocessor: 32<br />
* Maximum number of resident warps per multiprocessor: 64<br />
* Maximum number of resident threads per multiprocessor: 2048<br />
<br />
<br />
AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref><br />
<br />
* Wavefront size: 64<br />
* Maximum number of work-items per work-group: 1024<br />
* Maximum number of work-groups per compute unit: 40<br />
* Maximum number of Wavefronts per compute unit: 40<br />
* Maximum number of work-items per compute unit: 2560<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Private Memory || Registers<br />
|-<br />
| Local Memory || Shared Memory<br />
|}<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26630GPU2022-11-15T06:21:17Z<p>Smatovic: /* Programming Model */ typo</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Vendor Terminology<br />
|-<br />
! AMD Terminology !! Nvidia Terminology<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Stream Core || CUDA Core<br />
|-<br />
| Wavefront || Warp<br />
|}<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 CUDA cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors<br />
* organized in 2x16 CUDA cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 Stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16, each SIMT4, per Compute Unit<br />
* Wavefront size of 64 work-items<br />
<br />
===Wavefront and Warp===<br />
Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work-group form the NDRange to be executed on the GPU device. The members of a work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many work-items a work-group can hold and how many threads can run in total concurrently on the device.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Kernel || Kernel<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Processing Element || CUDA Core<br />
|-<br />
| Work-Item || Thread<br />
|-<br />
| Work-Group || Block<br />
|-<br />
| NDRange || Grid<br />
|-<br />
|}<br />
<br />
==Thread Examples==<br />
<br />
Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref><br />
<br />
* Warp size: 32<br />
* Maximum number of threads per block: 1024<br />
* Maximum number of resident blocks per multiprocessor: 32<br />
* Maximum number of resident warps per multiprocessor: 64<br />
* Maximum number of resident threads per multiprocessor: 2048<br />
<br />
<br />
AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref><br />
<br />
* Wavefront size: 64<br />
* Maximum number of work-items per work-group: 1024<br />
* Maximum number of work-groups per compute unit: 40<br />
* Maximum number of Wavefronts per compute unit: 40<br />
* Maximum number of work-items per compute unit: 2560<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Private Memory || Registers<br />
|-<br />
| Local Memory || Shared Memory<br />
|}<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26629GPU2022-11-15T06:20:38Z<p>Smatovic: /* Programming Model */ typo</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Vendor Terminology<br />
|-<br />
! AMD Terminology !! Nvidia Terminology<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Stream Core || CUDA Core<br />
|-<br />
| Wavefront || Warp<br />
|}<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 CUDA cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors<br />
* organized in 2x16 CUDA cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 Stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16, each SIMT4, per Compute Unit<br />
* Wavefront size of 64 work-items<br />
<br />
===Wavefront and Warp===<br />
Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work-group form the NDRange to be executed on the GPU device. The members work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many work-items a work-group can hold and how many threads can run in total concurrently on the device.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Kernel || Kernel<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Processing Element || CUDA Core<br />
|-<br />
| Work-Item || Thread<br />
|-<br />
| Work-Group || Block<br />
|-<br />
| NDRange || Grid<br />
|-<br />
|}<br />
<br />
==Thread Examples==<br />
<br />
Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref><br />
<br />
* Warp size: 32<br />
* Maximum number of threads per block: 1024<br />
* Maximum number of resident blocks per multiprocessor: 32<br />
* Maximum number of resident warps per multiprocessor: 64<br />
* Maximum number of resident threads per multiprocessor: 2048<br />
<br />
<br />
AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref><br />
<br />
* Wavefront size: 64<br />
* Maximum number of work-items per work-group: 1024<br />
* Maximum number of work-groups per compute unit: 40<br />
* Maximum number of Wavefronts per compute unit: 40<br />
* Maximum number of work-items per compute unit: 2560<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Private Memory || Registers<br />
|-<br />
| Local Memory || Shared Memory<br />
|}<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=Leela_Chess_Zero&diff=26625Leela Chess Zero2022-11-14T15:12:03Z<p>Smatovic: /* Lc0 */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Engines]] * Leela Chess Zero'''<br />
<br />
[[FILE:LC0-Logo.jpg|border|right|thumb|link=https://twitter.com/leelachesszero| Lc0 logo <ref>[https://twitter.com/leelachesszero Leela Chess Zero (@LeelaChessZero) | Twitter]</ref> ]] <br />
<br />
'''Leela Chess Zero''', (LCZero, lc0)<br/><br />
an adaption of [[Gian-Carlo Pascutto|Gian-Carlo Pascutto's]] [[Leela Zero]] [[Go]] project <ref>[https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]</ref> to [[Chess]], initiated and announced by [[Stockfish]] co-author [[Gary Linscott]], who was already responsible for the Stockfish [[Stockfish#TestingFramework|Testing Framework]] called ''Fishtest''. Leela Chess is [[:Category:Open Source|open source]], released under the terms of [[Free Software Foundation#GPL|GPL version 3]] or later, and supports [[UCI]]. <br />
The goal is to build a strong chess playing entity following the same type of [[Deep Learning|deep learning]] along with [[Monte-Carlo Tree Search|Monte-Carlo tree search]] (MCTS) techniques of [[AlphaZero]] as described in [[DeepMind|DeepMind's]] 2017 and 2018 papers <br />
<ref>[[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815]</ref><br />
<ref>[[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419</ref><br />
<ref>[http://blog.lczero.org/2018/12/alphazero-paper-and-lc0-v0191.html AlphaZero paper, and Lc0 v0.19.1] by [[Alexander Lyashuk|crem]], [[Leela Chess Zero|LCZero blog]], December 07, 2018</ref>, <br />
but using distributed training for the weights of the [[Neural Networks#Deep|deep]] [[Neural Networks#Convolutional|convolutional neural network]] (CNN, DNN, DCNN). <br />
<br />
=Lc0=<br />
Leela Chess Zero consists of an executable to play or analyze [[Chess Game|games]], initially dubbed '''LCZero''', soon rewritten by a team around [[Alexander Lyashuk]] for better performance and then called '''Lc0''' <ref>[https://github.com/LeelaChessZero/lc0/wiki/lc0-transition lc0 transition · LeelaChessZero/lc0 Wiki · GitHub]</ref> <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68094&start=91 Re: TCEC season 13, 2 NN engines will be participating, Leela and Deus X] by [[Gian-Carlo Pascutto]], [[CCC]], August 03, 2018</ref>. This executable, the actual chess engine, performs the [[Monte-Carlo Tree Search|MCTS]] and reads the self-taught [[Neural Networks#Convolutional|CNN]], which weights are persistent in a separate file.<br />
Lc0 is written in [[Cpp|C++]] (started with [[Cpp#14|C++14]] then upgraded to [[Cpp#17|C++17]]) and may be compiled for various platforms and backends. Since deep CNN approaches are best suited to run massively in parallel on [[GPU|GPUs]] to perform all the [[Float|floating point]] [https://en.wikipedia.org/wiki/Dot_product dot products] for thousands of neurons, <br />
the preferred target platforms are [[Nvidia]] [[GPU|GPUs]] supporting [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://en.wikipedia.org/wiki/CuDNN CuDNN] libraries <ref>[https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]</ref>. [[Ankan Banerjee]] wrote the CuDNN, also shared by [[Deus X]] and [[Allie]] <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=71822&start=48 Re: My failed attempt to change TCEC NN clone rules] by [[Adam Treat]], [[CCC]], September 19, 2019</ref>, and DX12 backend code. None CUDA compliant GPUs ([[AMD]], [[Intel]]) are supported through [[OpenCL]] or DX12, while much slower pure CPU binaries are possible using [https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms BLAS], target systems with or without a [https://en.wikipedia.org/wiki/Video_card graphics card] (GPU) are [[Linux]], [[Mac OS]] and [[Windows]] computers, or BLAS only the [[Raspberry Pi]].<br />
<br />
=Description=<br />
Like [[AlphaZero]], Lc0's [[Evaluation|evaluates]] [[Chess Position|positions]] using non-linear function approximation based on a [[Neural Networks|deep neural network]], rather than the [[Evaluation#Linear|linear function approximation]] as used in classical chess programs. <br />
This neural network takes the board position as input and outputs position evaluation (QValue) and a vector of move probabilities (PValue, policy). <br />
Once trained, these network is combined with a [[Monte-Carlo Tree Search]] (MCTS) using the policy to narrow down the search to highprobability moves, <br />
and using the value in conjunction with a fast rollout policy to evaluate positions in the tree. The MCTS selection is done by a variation of [[Christopher D. Rosin|Rosin's]] [[UCT]] improvement dubbed [[Christopher D. Rosin#PUCT|PUCT]] (Predictor + UCT).<br />
<br />
==[[Board Representation]]==<br />
Lc0's color agnostic board is represented by five [[Bitboards|bitboards]] (own pieces, opponent pieces, orthogonal sliding pieces, diagonal sliding pieces, and pawns including [[En passant|en passant]] target information coded as pawns on rank 1 and 8), two king squares, [[Castling Rights|castling rights]], and a flag whether the board is [[Color Flipping|color flipped]]. Getting individual piece bitboards requires some [[General Setwise Operations|setwise operations]] such as [[General Setwise Operations#Intersection|intersection]] and [[General Setwise Operations#RelativeComplement|set theoretic difference]] <ref>[https://github.com/LeelaChessZero/lc0/blob/master/src/chess/board.h lc0/board.h at master · LeelaChessZero/lc0 · GitHub]</ref>.<br />
<br />
==Network==<br />
While [[AlphaGo]] used two disjoint networks for policy and value, [[AlphaZero]] as well as Leela Chess Zero, share a common "body" connected to disjoint policy and value "heads". The “body” consists of spatial 8x8 planes, using B [[Neural Networks#Residual|residual]] blocks with F filters of kernel size 3x3, stride 1. So far, model sizes FxB of 64x6, 128x10, 192x15, and 256x20 were used. <br />
<br />
Concerning [[Nodes per Second|nodes per second]] of the MCTS, smaller models are faster to calculate than larger models. They are faster to train and will earlier recognize progress, but they will also saturate earlier so that at some point more training will no longer improve the engine. Larger and deeper network models will improve the receptivity, the amount of knowledge and pattern to extract from the training samples, with potential for a [[Playing Strength|stronger]] engine. <br />
As a further improvement, Leele Chess Zero applies the ''Squeeze and Excite'' (SE) extension to the residual block architecture <ref>[https://github.com/LeelaChessZero/lc0/wiki/Technical-Explanation-of-Leela-Chess-Zero Technical Explanation of Leela Chess Zero · LeelaChessZero/lc0 Wiki · GitHub]</ref> <ref>[https://towardsdatascience.com/squeeze-and-excitation-networks-9ef5e71eacd7 Squeeze-and-Excitation Networks – Towards Data Science] by [http://plpp.de/ Paul-Louis Pröve], October 17, 2017</ref>. The body is connected to both the policy "head" for the move probability distribution, and the value "head" for the evaluation score aka [[Pawn Advantage, Win Percentage, and Elo|winning probability]] of the current position and up to seven predecessor positions on the input planes.<br />
<br />
==Training==<br />
Like in [[AlphaZero]], the '''Zero''' suffix implies no other initial knowledge than the rules of the game, to build a superhuman player, starting with truly random self-play games to apply [[Reinforcement Learning|reinforcement learning]] based on the outcome of that games. However, there are derived approaches, such as [[Albert Silver|Albert Silver's]] [[Deus X]], trying to take a short-cut by initially using [[Supervised Learning|supervised learning]] techniques, such as feeding in high quality games played by other strong chess playing entities, or huge records of positions with a given preferred move.<br />
The unsupervised training of the NN is about to minimize the [https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm L2-norm] of the [https://en.wikipedia.org/wiki/Mean_squared_error mean squared error] loss of the value output and the policy loss. Further there are experiments to train the value head against not the game outcome, but against the accumulated value for a position after exploring some number of nodes with [[UCT]] <ref>[https://medium.com/oracledevs/lessons-from-alphazero-part-4-improving-the-training-target-6efba2e71628 Lessons From AlphaZero (part 4): Improving the Training Target] by [https://blogs.oracle.com/author/vish-abrams Vish Abrams], [https://blogs.oracle.com/ Oracle Blog], June 27, 2018</ref>.<br />
<br />
The distributed training is realized with an sophisticated [https://en.wikipedia.org/wiki/Client%E2%80%93server_model client-server model]. The client, written entirely in the [[Go (Programming Language)|Go programming language]], incorporates Lc0 to produce self-play games.Controlled by the server, the client may download the latest network, will start self-playing, and uploading games to the server, who on the other hand will regularly produce and distribute new neural network weights after a certain amount of games available from contributors. The training software consists of [[Python]] code, the pipeline requires [https://en.wikipedia.org/wiki/NumPy NumPy] and [https://en.wikipedia.org/wiki/TensorFlow TensorFlow] running on [[Linux]] <ref>[https://github.com/LeelaChessZero/lczero-training GitHub - LeelaChessZero/lczero-training: For code etc relating to the network training process.]</ref>. <br />
The server is written in [[Go (Programming Language)|Go]] along with [[Python]] and [https://en.wikipedia.org/wiki/Shell_script shell scripts].<br />
<br />
=Structure Diagrams= <br />
[[FILE:lc0diagram.png|none|border|text-bottom]] <br />
Related to [[TCEC]] clone discussions concerning [[Deus X]] and [[Allie]] aka [[Allie#AllieStein|AllieStein]],<br/><br />
[[Alexander Lyashuk]] published diagrams with all components of the affected engines,<br/><br />
The above shows the common legend, and the structure of all Leela Chess Zero's components based on current Lc0 engine <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=71822 My failed attempt to change TCEC NN clone rules] by [[Alexander Lyashuk]], [[CCC]], September 14, 2019 » [[TCEC]]</ref><br />
[[FILE:Lczero.png|none|border|text-bottom|670px]]<br />
Same diagram, but initial LCZero engine, which played [[TCEC Season 12]] <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=2&t=71822 My failed attempt to change TCEC NN clone rules] by [[Alexander Lyashuk]], [[CCC]], September 14, 2019 » [[TCEC]]</ref><br />
<br />
=Tournament Play=<br />
==First Experience==<br />
LCZero gained first tournament experience in April 2018 at [[TCEC Season 12]] and over the board at the [[WCCC 2018]]<br />
in [https://en.wikipedia.org/wiki/Stockholm Stockholm], July 2018. It won the [[TCEC Season 13#Fourth|fourth division]] of [[TCEC Season 13]] in August 2018, [[#Lc0|Lc0]] finally coming in third in the [[TCEC Season 13#Third|third division]].<br />
==Breakthrough==<br />
[[TCEC Season 14]] from November 2018 until February 2019 became a breakthrough, Lc0 winning the [[TCEC Season 14#Third|third]], [[TCEC Season 14#Second|second]] and [[TCEC Season 14#First|first]] division, <br />
to even [[TCEC Season 14#Premier|qualify]] for the [[TCEC Season 14#Superfinal|superfinal]], losing by the narrow margin of +10 =81 -9, '''50½ - 49½''' versus [[Stockfish]].<br />
Again runner-up at the [[TCEC Season 15#Premier|TCEC Season 15 premier division]] in April 2019,<br />
Lc0 aka '''LCZero v0.21.1-nT40.T8.610''' won the [[TCEC Season 15#Superfinal|superfinal]] in May 2019 versus Stockfish with +14 =79 -7, '''53½-46½''' <ref>[https://blog.lczero.org/2019/05/lc0-won-tcec-15.html Lc0 won TCEC 15] by [[Alexander Lyashuk|crem]], [[Leela Chess Zero|LCZero blog]], May 28, 2019</ref>.<br />
At the [[TCEC Season 16#Premier|TCEC Season 16 premier division]] in September 2019, Lc0 became in third behind Stockfish and the [[Supervised Learning|supervised]] trained [[Allie#AllieStein|AllieStein]],<br />
but Lc0 took revenge by winning the [[TCEC Season 17#Premier|TCEC Season 17 premier division]] in spring 2020, as '''LCZero v0.24-sv-t60-3010''' fighting down Stockfish in a thrilling [[TCEC Season 17#Superfinal|superfinal]] in April 2020 with +17 =71 -12, '''52½-47½''' <ref>[https://lczero.org/blog/2020/04/tcec-s17-super-final-report/ TCEC S17 SUper FInal report] by @glbchess64, [[Leela Chess Zero|LCZero blog]], April 21, 2020</ref>, but tables turned again in [[TCEC Season 18#Premier|TCEC Season 18]], when Stockfish won the [[TCEC Season 18#Superfinal|superfinal]].<br />
<br />
=Release Dates=<br />
==2018==<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.14.1 - July 08, 2018<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.16.0 - July 20, 2018<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.17.0 - August 27, 2018<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.18.0 - September 30, 2018<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.18.1 - October 02, 2018<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.19.0 - November 19, 2018<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.19.1.1 - December 10, 2018<br />
==2019==<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.20.9 - January 01, 2019<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.20.1 - January 07, 2019<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.20.2 - February 01, 2019<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.21.1 - March 23, 2019<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.21.2 - June 09, 2019<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.21.4 - July 28, 2019<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.22.0 - August 05, 2019<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.23.2 - December 31, 2019<br />
==2020==<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.23.3 - February 18, 2020<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.24.1 - March 15, 2020<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.25.1 - April 30, 2020<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.26.0 - July 02, 2020<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.26.1 - July 15, 2020<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.26.2 - September 02, 2020<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] v0.26.3 - October 10, 2020<br />
==2021==<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] [https://github.com/LeelaChessZero/lc0/releases/tag/v0.27.0 v0.27.0] - February 21, 2021<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] [https://github.com/LeelaChessZero/lc0/releases/tag/v0.28.0 v0.28.0] - August 26, 2021<br />
* [[Leela Chess Zero]] / [[Leela Chess Zero#Lc0|Lc0]] [https://github.com/LeelaChessZero/lc0/releases/tag/v0.28.2 v0.28.2] - December 13, 2021<br />
<br />
=Authors=<br />
* [[:Category:Leela Chess Contributor|Leela Chess Contributors]]<br />
<br />
=See also=<br />
* [[Allie]]<br />
* [[AlphaZero]]<br />
* [[Ceres]]<br />
* [[Fat Fritz]]<br />
* [[Deep Learning]]<br />
* [[Deus X]]<br />
* [[Leela Zero]]<br />
* [[Leila]]<br />
* [[Maia Chess]]<br />
* [[Monte-Carlo Tree Search]]<br />
: [[UCT]]<br />
: [[Christopher D. Rosin#PUCT|PUCT]]<br />
* [[Stockfish NNUE]]<br />
<br />
=Publications=<br />
* [[Bill Jordan]] ('''2020'''). ''Calculation versus Intuition: Stockfish versus Leela''. [https://www.amazon.com/Calculation-versus-Intuition-Stockfish-Leela-ebook/dp/B08LYBQDMB/ amazon] » [[TCEC]], [[Stockfish]]<br />
* [[Dominik Klein]] ('''2021'''). ''[https://github.com/asdfjkl/neural_network_chess Neural Networks For Chess]''. [https://github.com/asdfjkl/neural_network_chess/releases/tag/v1.1 Release Version 1.1 · GitHub] <ref>[https://www.talkchess.com/forum3/viewtopic.php?f=2&t=78283 Book about Neural Networks for Chess] by dkl, [[CCC]], September 29, 2021</ref><br />
* [[Rejwana Haque]], [[Ting Han Wei]], [[Martin Müller]] ('''2021'''). ''On the Road to Perfection? Evaluating Leela Chess Zero Against Endgame Tablebases''. [[Advances in Computer Games 17]]<br />
<br />
=Forum Posts=<br />
==2018==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=66280&start=67 Re: Announcing lczero] by [[Daniel Shawul]], [[CCC]], January 21, 2018 » [[Bojun Huang#Rollout|Rollout Paradigm]]<br />
* [https://github.com/glinscott/leela-chess/issues/47 Policy and value heads are from AlphaGo Zero, not Alpha Zero Issue #47] by [[Gian-Carlo Pascutto]], [https://github.com/glinscott/leela-chess glinscott/leela-chess · GitHub], January 24, 2018<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66452 LCZero is learning] by [[Gary Linscott|Gary]], [[CCC]], January 30, 2018<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66824 LCZero update] by [[Gary Linscott|Gary]], [[CCC]], March 14, 2018<br />
: [http://talkchess.com/forum/viewtopic.php?t=66929 LCZero update (2)] by [[Rein Halbersma]], [[CCC]], March 25, 2018<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66945 LCZero: Progress and Scaling. Relation to CCRL Elo] by [[Kai Laskos]], [[CCC]], March 28, 2018 » [[Playing Strength]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=67013 What does LCzero learn?] by [[Uri Blass]], [[CCC]], April 05, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67044 How to play vs LCZero with Cute Chess gui] by Hai, [[CCC]], April 08, 2018 » [[Cute Chess]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=67075 LCZero in Aquarium / Fritz] by [[Carl Bicknell]], [[CCC]], April 11, 2018<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=67087 LCZero on 10x128 now] by [[Gary Linscott|Gary]], [[CCC]], April 12, 2018 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=67092 lczero faq] by Duncan Roberts, [[CCC]], April 13, 2018<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=67104 Run LC Zero in LittleBlitzerGUI] by [[Stefan Pohl]], [[CCC]], April 14, 2018 » [[LittleBlitzer]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67121 LC0 - how to catch up?] by [[Srdja Matovic]], [[CCC]], April 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67300 Leela on more then one GPU?] by [[Karlo Balla]], [[CCC]], May 01, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[GPU]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67646 New CLOP settings give Leela huge tactics boost] by [[Albert Silver]], [[CCC]], June 04, 2018 » [[CLOP]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67668 First Win by Leela Chess Zero against Stockfish dev] by [[Ankan Banerjee]], [[CCC]], June 07, 2018 » [[Stockfish]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67718 what may be two firsts...] by [[Michael Byrne|Michael B]], [[CCC]], June 13, 2018 » [[DGT Pi]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67728 LcZero and STS] by [[Ed Schroder|Ed Schröder]], [[CCC]], June 14, 2018 » [[Strategic Test Suite]]<br />
* [https://groups.google.com/d/msg/lczero/S-rhiPLnbHg/XY9-Z1LWCAAJ Who entered Leela into WCCC? Bad idea!!] by [[Chris Whittington]], [[Computer Chess Forums|LCZero Forum]], June 23, 2018 » [[WCCC 2018]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67949 how will Leela fare at the WCCC?] by dannyb, [[CCC]], July 10, 2018 » [[WCCC 2018]]<br />
* [https://groups.google.com/d/msg/lczero/EgEslxR04wg/6zY7sLiQAwAJ Lc0 will participate at the WCCC? Wow!] by Martin Renneke, [[Computer Chess Forums|LCZero Forum]], July 10, 2018<br />
* [https://groups.google.com/d/msg/lczero/KUwypuefNdY/DDV8hfwCBQAJ How Leela uses history planes] by Tristrom Cooke, [[Computer Chess Forums|LCZero Forum]], July 19, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68072 Why Lc0 eval (in cp) is asymmetric against AB engines?] by [[Kai Laskos]], [[CCC]], July 25, 2018 » [[Asymmetric Evaluation]], [[Pawn Advantage, Win Percentage, and Elo]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68094 TCEC season 13, 2 NN engines will be participating, Leela and Deus X] by Nay Lin Tun, [[CCC]], July 28, 2018<br />
: [http://talkchess.com/forum3/viewtopic.php?f=2&t=68094&start=90#p770006 Re: TCEC season 13, 2 NN engines will be participating, Leela and Deus X] by [[Gian-Carlo Pascutto]], [[CCC]], August 03, 2018<br />
* [https://groups.google.com/d/msg/lczero/vGdNYW-Ou58/Kh0GCj2OCgAJ Has Silver written any code for "his" ZeusX?] by [[Chris Whittington]], [[Computer Chess Forums|LCZero Forum]], July 31, 2018 <br />
: [https://groups.google.com/d/msg/lczero/vGdNYW-Ou58/-icwb0pjDAAJ Re: Has Silver written any code for "his" ZeusX?] by [[Alexander Lyashuk]], [[Computer Chess Forums|LCZero Forum]], August 02, 2018 <br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68441 Some properties of Lc0 playing] by [[Kai Laskos]], [[CCC]], September 14, 2018 <br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=9 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 17, 2018 <ref>[https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation Multiply–accumulate operation - Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=37 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], October 28, 2018<br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=44 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], November 15, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68511 LC0 0.18rc1 released] by [[Günther Simon]], [[CCC]], September 25, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 <ref>[https://www.nvidia.com/en-us/geforce/graphics-cards/rtx-2070/ GeForce RTX 2070 Graphics Card | NVIDIA]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69045 2900 Elo points progress, 10 million games, 330 nets] by [[Kai Laskos]], [[CCC]], November 25, 2018<br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69045&start=8 Re: 2900 Elo points progress, 10 million games, 330 nets] by [[Alexander Lyashuk|crem]], [[CCC]], November 25, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69068 Scaling of Lc0 at high Leela Ratio] by [[Kai Laskos]], [[CCC]], November 27, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69175&start=75 Re: Alphazero news] by [[Gian-Carlo Pascutto]], [[CCC]], December 07, 2018 » [[AlphaZero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69306 Policy training in Alpha Zero, LC0 ..] by [[Chris Whittington]], [[CCC]], December 18, 2018 » [[AlphaZero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69318 Smallnet (128x10) run1 progresses remarkably well] by [[Kai Laskos]], [[CCC]], December 19, 2018<br />
* [https://groups.google.com/d/msg/lczero/EGcJSrZYLiw/netJ4S38CgAJ use multiple neural nets?] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], December 25, 2018<br />
==2019==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[GPU]]<br />
* [https://groups.google.com/d/msg/lczero/CrMiK3OR1og/mcFd0NDKDQAJ "boosting" endgames in leela training? Simple change to learning process proposed: "forked" training games] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 03, 2019<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69552 Leela on a weak pc, question] by chessico, [[CCC]], January 09, 2019<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[GPU]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69622 Can somebody explain what makes Leela as strong as Komodo?] by Chessqueen, [[CCC]], January 16, 2019<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69668 A0 policy head ambiguity] by [[Daniel Shawul]], [[CCC]], January 21, 2019 » [[AlphaZero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69672 Schizophrenic rating model for Leela] by [[Kai Laskos]], [[CCC]], January 21, 2019 » [[Match Statistics]]<br />
* [http://forum.computerschach.de/cgi-bin/mwf/topic_show.pl?tid=10194 Leela Zero (Lc0) - NVIDIA Geforce RTX 2060] by [[Andreas Strangmüller]], [[Computer Chess Forums|CSS Forum]], January 29, 2019<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69820 11258-32x4-se distilled network released] by [[Dietrich Kappe]], [[CCC]], February 03, 2019<br />
* [http://forum.computerschach.de/cgi-bin/mwf/topic_show.pl?tid=10213 Lc0 setup Hilfe] by [[Clemens Keck]], [[Computer Chess Forums|CSS Forum]], February 07, 2019 (German)<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69852 Lc0 - macOS binary requested] by Steppenwolf, [[CCC]], February 09, 2019 » [[Mac OS]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69957 Thanks for LC0] by [[Peter Berger]], [[CCC]], February 19, 2019<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70069&start=10 Re: Training the trainer: how is it done for Stockfish?] by Graham Jones, [[CCC]], March 03, 2019 » [[Monte-Carlo Tree Search]], [[Stockfish]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70350 Lc0 51010] by [[Larry Kaufman]], [[CCC]], March 29, 2019<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[GPU]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70451 32930 Boost network available] by [[Dietrich Kappe]], [[CCC]], April 09, 2019<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=71209 Lc0 question] by [[Larry Kaufman]], [[CCC]], July 06, 2019<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71651 Some newbie questions about lc0] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], August 25, 2019<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=71686 Lc0 Evaluation Explanation] by Hamster, [[CCC]], August 29, 2019<br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=71686&start=14 Re: Lc0 Evaluation Explanation] by [[Alexander Lyashuk]], [[CCC]], September 03, 2019<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=71822 My failed attempt to change TCEC NN clone rules] by [[Alexander Lyashuk]], [[CCC]], September 14, 2019 » [[TCEC]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72640 Best Nets for Lc0 Page] by [[Ted Summers]], [[CCC]], December 23, 2019 <br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72685 Correct LC0 syntax for multiple GPUs] by [[Dann Corbit]], [[CCC]], December 30, 2019 <br />
==2020==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=73684 Lc0 soon to support chess960 ?] by Modern Times, [[CCC]], April 18, 2020 » [[Chess960]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=73714 How to run rtx 2080ti for leela optimally?] by h1a8, [[CCC]], April 20, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74092 Total noob Leela question] by [[Harm Geert Muller]], [[CCC]], June 03, 2020<br />
* [https://groups.google.com/d/msg/lczero/BvhCa-muLt0/ZzINQk_vCQAJ How strong is Stockfish NNUE compared to Leela..] by OmenhoteppIV, [[Computer Chess Forums|LCZero Forum]], July 13, 2020 » [[Stockfish NNUE]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=74607 LC0 vs. NNUE - some tech details...] by [[Srdja Matovic]], [[CCC]], July 29, 2020 » [[NNUE]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=74915 The next step for LC0?] by [[Srdja Matovic]], [[CCC]], August 28, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75262 Checking the backends with the new lc0 binary] by [[Kai Laskos]], [[CCC]], October 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75950 ZZ-tune conclusively better than the Kiudee default for Lc0] by [[Kai Laskos]], [[CCC]], December 01, 2020 <ref>[https://github.com/kiudee/chess-tuning-tools GitHub - kiudee/chess-tuning-tools] by [[Karlson Pfannschmidt]]</ref><br />
==2021==<br />
* [https://lczero.org/blog/2021/01/announcing-ceres/ Announcing Ceres] by [[Alexander Lyashuk|crem]], [[Leela Chess Zero|LCZero blog]], January 01, 2021 » [[Ceres]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76948 leela] by [[Stuart Cracraft]], [[CCC]], March 29, 2021 » [[Banksia GUI]], [[iPhone]]<br />
* [https://lczero.org/blog/2021/04/joking-ftw-seriously/ Joking FTW, Seriously] by borg, [[Leela Chess Zero|LCZero blog]], April 25, 2021<br />
* [https://lczero.org/blog/2021/06/the-importance-of-open-data/ The importance of open data] by borg, [[Leela Chess Zero|LCZero blog]], June 15, 2021<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=77496 Leela data publicly available for use] by Madeleine Birchfield, [[CCC]], June 15, 2021<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=77503 will Tcec allow Stockfish with a Leela net to play?] by Wilson, [[CCC]], June 17, 2021 » [[TCEC Season 21]]<br />
<br />
=External Links=<br />
==Chess Engine==<br />
* [https://lczero.org/ Leela Chess Zero]<br />
* [https://lczero.org/blog/ Blog - Leela Chess Zero]<br />
* [https://groups.google.com/forum/#!forum/lczero LCZero – Forum]<br />
* [https://training.lczero.org/ Testing instance of LCZero server]<br />
* [https://en.wikipedia.org/wiki/Leela_Chess_Zero Leela Chess Zero from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Leela_(software) Leela (software) from Wikipedia]<br />
* [https://www.facebook.com/LeelaChessZero/ Leela Chess Zero - Facebook]<br />
* [https://twitter.com/leelachesszero Leela Chess Zero (@LeelaChessZero) | Twitter]<br />
* [https://slides.com/crem/lc0#/logo Lc0 Slides] by [[Alexander Lyashuk]]<br />
==GitHub==<br />
* [https://github.com/LeelaChessZero/ LCZero · GitHub]<br />
* [https://github.com/LeelaChessZero/lczero GitHub - LeelaChessZero/lczero: A chess adaption of GCP's Leela Zero]<br />
* [https://github.com/LeelaChessZero/lc0 GitHub - LeelaChessZero/lc0: The rewritten engine, originally for tensorflow. Now all other backends have been ported here]<br />
* [https://github.com/LeelaChessZero/lc0/wiki Home · LeelaChessZero/lc0 Wiki · GitHub]<br />
* [https://github.com/LeelaChessZero/lc0/wiki/Getting-Started Getting Started · LeelaChessZero/lc0 Wiki · GitHub]<br />
* [https://github.com/LeelaChessZero/lc0/wiki/FAQ FAQ · LeelaChessZero/lc0 Wiki · GitHub]<br />
* [https://github.com/LeelaChessZero/lc0/wiki/Technical-Explanation-of-Leela-Chess-Zero Technical Explanation of Leela Chess Zero · LeelaChessZero/lc0 Wiki · GitHub]<br />
* [https://github.com/LeelaChessZero/lc0/graphs/contributors Contributors to LeelaChessZero/lc0 · GitHub]<br />
* [https://github.com/LeelaChessZero/lc0/commit/62741d56252b23f74e8a7200a42812f27fe32b7d Use NHCW layout for fused winograd residual block (#1567) · LeelaChessZero/lc0@62741d5 · GitHub], commit by [[Ankan Banerjee]], June 10, 2021<br />
* [https://github.com/mooskagh/lc0 GitHub - mooskagh/lc0: The rewritten engine, originally for cudnn. Now all other backends have been ported here]<br />
* [https://github.com/dkappe/leela-chess-weights/wiki/Distilled-Networks Distilled Networks · dkappe/leela-chess-weights Wiki · GitHub]<br />
==Rating Lists==<br />
* [http://computerchess.org.uk/ccrl/404/cgi/compare_engines.cgi?family=Leela%20Chess&print=Rating+list&print=Results+table&print=LOS+table&print=Ponder+hit+table&print=Eval+difference+table&print=Comopp+gamenum+table&print=Overlap+table&print=Score+with+common+opponents Leela Chess Zero] in [[CCRL|CCRL Blitz]]<br />
* [http://computerchess.org.uk/ccrl/4040/cgi/compare_engines.cgi?family=Leela%20Chess&print=Rating+list&print=Results+table&print=LOS+table&print=Ponder+hit+table&print=Eval+difference+table&print=Comopp+gamenum+table&print=Overlap+table&print=Score+with+common+opponents Leela Chess Zero] in [[CCRL|CCRL 40/15]]<br />
==ChessBase==<br />
* [https://en.chessbase.com/post/leela-chess-zero-alphazero-for-the-pc Leela Chess Zero: AlphaZero for the PC] by [[Albert Silver]], [[ChessBase|ChessBase News]], April 26, 2018<br />
* [https://en.chessbase.com/post/standing-on-the-shoulders-of-giants Standing on the shoulders of giants] by [[Albert Silver]], [[ChessBase|ChessBase News]], September 18, 2019<br />
* [https://en.chessbase.com/post/running-leela-and-fat-fritz-on-your-notebook Running Leela and Fat Fritz on your notebook] by [https://ratings.fide.com/card.phtml?event=2099713 Evelyn Zhu], [[ChessBase|ChessBase News]], June 14, 2020 » [[Fat Fritz]]<br />
==Chessdom==<br />
* [http://www.chessdom.com/interview-with-alexander-lyashuk-about-the-recent-success-of-lc0/ Interview with Alexander Lyashuk about the recent success of Lc0], [[Chessdom]], February 6, 2019 » [[TCEC Season 14]]<br />
==Tuning==<br />
* [https://github.com/kiudee/bayes-skopt GitHub - kiudee/bayes-skopt: A fully Bayesian implementation of sequential model-based optimization] by [[Karlson Pfannschmidt]] » [[Fat Fritz]] <ref>[https://en.chessbase.com/post/fat-fritz-update-and-fat-fritz-jr Fat Fritz 1.1 update and a small gift] by [[Albert Silver]]. [[ChessBase|ChessBase News]], March 05, 2020</ref><br />
* [https://github.com/kiudee/chess-tuning-tools GitHub - kiudee/chess-tuning-tools] by [[Karlson Pfannschmidt]] <ref>[https://chess-tuning-tools.readthedocs.io/en/latest/ Welcome to Chess Tuning Tools’s documentation!]</ref><br />
==Misc==<br />
* [https://en.wikipedia.org/wiki/Leela Leela from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Leela_(game) Leela (game) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Leela_(name) Leela (name) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Leela_(Doctor_Who) Leela (Doctor Who) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Leela_(Futurama) Leela (Futurama) from Wikipedia]<br />
* [[:Category:Soft Machine|Soft Machine]] - [https://en.wikipedia.org/wiki/Hidden_Details Hidden Details], 2018, [https://en.wikipedia.org/wiki/YouTube YouTube] Video<br />
: Lineup: [https://en.wikipedia.org/wiki/Theo_Travis Theo Travis], [[:Category:Roy Babbington|Roy Babbington]], [https://en.wikipedia.org/wiki/John_Etheridge John Etheridge], [[:Category:John Marshall|John Marshall]]<br />
: {{#evu:https://www.youtube.com/watch?v=uGIf97m243M|alignment=left|valignment=top}}<br />
<br />
=References= <br />
<references /><br />
'''[[Engines|Up one level]]'''<br />
[[Category:UCI]]<br />
[[Category:Open Source]]<br />
[[Category:GPL]]<br />
[[Category:GPU]]<br />
[[Category:DCNN]]<br />
[[Category:MCTS]]<br />
[[Category:PC]]<br />
[[Category:Windows]]<br />
[[Category:Linux]]<br />
[[Category:Mac]]<br />
[[Category:Fiction]]<br />
[[Category:Given Name]]<br />
[[Category:Soft Machine]]<br />
[[Category:Roy Babbington]]<br />
[[Category:John Marshall]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=Talk:GPU&diff=26624Talk:GPU2022-11-14T13:09:32Z<p>Smatovic: /* Legacy GPGPU */</p>
<hr />
<div>== AMD architectures ==<br />
<br />
My own conclusions are:<br />
<br />
* TeraScale has VLIW design.<br />
* GCN has 16 wide SIMD, executing a Wavefront of 64 threads over 4 cycles.<br />
* RDNA has 32 wide SIMD, executing a Wavefront:32 over 1 cycle and Wavefront:64 over two cycles.<br />
<br />
[[User:Smatovic|Smatovic]] ([[User talk:Smatovic|talk]]) 10:16, 22 April 2021 (CEST)<br />
<br />
== Nvidia architectures ==<br />
<br />
Afaik Nvidia did never official mention SIMD in their papers as hardware architecture, with Tesla they only referred to as SIMT.<br />
<br />
Nevertheless, my own conclusions are:<br />
<br />
* Tesla has 8 wide SIMD, executing a Warp of 32 threads over 4 cycles.<br />
<br />
* Fermi has 16 wide SIMD, executing a Warp of 32 threads over 2 cycles.<br />
<br />
* Kepler is somehow odd, not sure how the compute units are partitioned.<br />
<br />
* Maxwell and Pascal have 32 wide SIMD, executing a Warp of 32 threads over 1 cycle.<br />
<br />
* Volta and Turing seem to have 16 wide FPU SIMDs, but my own experiments show 32 wide VALU.<br />
<br />
[[User:Smatovic|Smatovic]] ([[User talk:Smatovic|talk]]) 10:17, 22 April 2021 (CEST)<br />
<br />
== SIMD + Scalar Unit ==<br />
<br />
It seems every SIMD unit has one scalar unit on GPU architectures, executing things like branch-conditions or special functions the SIMD ALUs are not capable of.<br />
<br />
[[User:Smatovic|Smatovic]] ([[User talk:Smatovic|talk]]) 20:21, 22 April 2021 (CEST)<br />
<br />
== embedded CPU controller ==<br />
<br />
It is not documented in the whitepapers, but it seems that every discrete GPU has an embedded CPU controller (e.g. Nvidia Falcon) who (speculation) launches the kernels.<br />
<br />
[[User:Smatovic|Smatovic]] ([[User talk:Smatovic|talk]]) 10:36, 22 April 2021 (CEST)<br />
<br />
== GPUs and Duncan's taxonomy ==<br />
It is not clear to me how the underlying hardware of GPU SIMD units of architectures with unified shader architecture is realized by different vendors, there is the concept of bit-sliced ALUs, there is the concept of pipelined vector processors, there is the concept of SIMD units with fix bit-width ALUs. The white papers from different vendors leave room for speculation, the different instruction throughputs for higher precision and lower precision too, what is left to the programmer is to do microbenchmarking and make conclusions on their own.<br />
<br />
https://en.wikipedia.org/wiki/Duncan%27s_taxonomy<br />
<br />
https://en.wikipedia.org/wiki/Flynn%27s_taxonomy<br />
<br />
[[User:Smatovic|Smatovic]] ([[User talk:Smatovic|talk]]) 13:58, 16 December 2021 (CET)<br />
<br />
== CPW GPU article ==<br />
<br />
A suggestion of mine, keep this GPU article as an generalized overview of GPUs, with incremental updates for different frameworks and architectures. GPUs and GPGPU is a moving target with different platforms offering new feature sets, better open own articles for things like GPGPU, SIMT, CUDA, ROCm, oneAPI, Metal or simply link to Wikipedia containing the newest specs and infos.<br />
<br />
[[User:Smatovic|Smatovic]] ([[User talk:Smatovic|talk]]) 21:29, 27 April 2021 (CEST)<br />
<br />
== GPGPU architectures ==<br />
Regarding GPGPU architectures or frameworks, a link to the architecture white paper, instruction set architecture, programming guide, and link to Wikipedia with a list of the concrete models with specs would be nice, if available.<br />
<br />
[[User:Smatovic|Smatovic]] ([[User talk:Smatovic|talk]]) 09:21, 25 October 2021 (CEST)<br />
<br />
== Legacy GPGPU ==<br />
<br />
This article does not cover legacy, pre 2007, GPGPU methods, how to use pixel, vertex, geometry, tessellation and compute shaders via OpenGL or DirectX for GPGPU. I can imagine it is possible to backport a neural network Lc0 backend to a certain DirextX/OpenGL API, but I doubt it has real contemporary relevance (running Lc0 on an SGI Indy or alike).<br />
<br />
[[User:Smatovic|Smatovic]] ([[User talk:Smatovic|talk]]) 14:09, 14 November 2022 (CET)</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26623GPU2022-11-14T12:40:51Z<p>Smatovic: /* Thread Examples */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Vendor Terminology<br />
|-<br />
! AMD Terminology !! Nvidia Terminology<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Stream Core || CUDA Core<br />
|-<br />
| Wavefront || Warp<br />
|}<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 CUDA cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors<br />
* organized in 2x16 CUDA cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 Stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16, each SIMT4, per Compute Unit<br />
* Wavefront size of 64 work-items<br />
<br />
===Wavefront and Warp===<br />
Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work-group form NDRange to be executed on the GPU device. The members work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many work-items a work-group can hold and how many threads can run in total concurrently on the device.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Kernel || Kernel<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Processing Element || CUDA Core<br />
|-<br />
| Work-Item || Thread<br />
|-<br />
| Work-Group || Block<br />
|-<br />
| NDRange || Grid<br />
|-<br />
|}<br />
<br />
==Thread Examples==<br />
<br />
Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref><br />
<br />
* Warp size: 32<br />
* Maximum number of threads per block: 1024<br />
* Maximum number of resident blocks per multiprocessor: 32<br />
* Maximum number of resident warps per multiprocessor: 64<br />
* Maximum number of resident threads per multiprocessor: 2048<br />
<br />
<br />
AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref><br />
<br />
* Wavefront size: 64<br />
* Maximum number of work-items per work-group: 1024<br />
* Maximum number of work-groups per compute unit: 40<br />
* Maximum number of Wavefronts per compute unit: 40<br />
* Maximum number of work-items per compute unit: 2560<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Private Memory || Registers<br />
|-<br />
| Local Memory || Shared Memory<br />
|}<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26622GPU2022-11-14T12:37:07Z<p>Smatovic: /* Programming Model */ added terminology and examples</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Vendor Terminology<br />
|-<br />
! AMD Terminology !! Nvidia Terminology<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Stream Core || CUDA Core<br />
|-<br />
| Wavefront || Warp<br />
|}<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 CUDA cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors<br />
* organized in 2x16 CUDA cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 Stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16, each SIMT4, per Compute Unit<br />
* Wavefront size of 64 work-items<br />
<br />
===Wavefront and Warp===<br />
Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work-group form NDRange to be executed on the GPU device. The members work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many work-items a work-group can hold and how many threads can run in total concurrently on the device.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Kernel || Kernel<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Processing Element || CUDA Core<br />
|-<br />
| Work-Item || Thread<br />
|-<br />
| Work-Group || Block<br />
|-<br />
| NDRange || Grid<br />
|-<br />
|}<br />
<br />
==Thread Examples==<br />
<br />
Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref><br />
<br />
* Warp size: 32<br />
* Maximum number of threads per block: 1024<br />
* Maximum number of resident blocks per multiprocessor: 32<br />
* Maximum number of resident warps per multiprocessor: 64<br />
* Maximum number of resident threads per multiprocessor: 2048<br />
<br />
<br />
AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref><br />
<br />
* Wavefront size: 64<br />
* Maximum number of work-items per work-group: 1024<br />
* Maximum number of work-groups per compute unit: 40<br />
* Maximum number of Wavefronts per compute unit: 40<br />
* Maximum number of work-items per computer unit: 2560<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Private Memory || Registers<br />
|-<br />
| Local Memory || Shared Memory<br />
|}<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26621GPU2022-11-14T11:45:46Z<p>Smatovic: /* Memory Model */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Vendor Terminology<br />
|-<br />
! AMD Terminology !! Nvidia Terminology<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Stream Core || CUDA Core<br />
|-<br />
| Wavefront || Warp<br />
|}<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 CUDA cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors<br />
* organized in 2x16 CUDA cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 Stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16, each SIMT4, per Compute Unit<br />
* Wavefront size of 64 work-items<br />
<br />
===Wavefront and Warp===<br />
Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold and how many threads can run in total concurrently on the device.<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Terminology<br />
|-<br />
! OpenCL Terminology !! CUDA Terminology<br />
|-<br />
| Private Memory || Registers<br />
|-<br />
| Local Memory || Shared Memory<br />
|}<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26620GPU2022-11-14T11:39:49Z<p>Smatovic: /* Hardware Model */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
{| class="wikitable" style="margin:auto"<br />
|+ Vendor Terminology<br />
|-<br />
! AMD Terminology !! Nvidia Terminology<br />
|-<br />
| Compute Unit || Streaming Multiprocessor<br />
|-<br />
| Stream Core || CUDA Core<br />
|-<br />
| Wavefront || Warp<br />
|}<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 CUDA cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors<br />
* organized in 2x16 CUDA cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 Stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16, each SIMT4, per Compute Unit<br />
* Wavefront size of 64 work-items<br />
<br />
===Wavefront and Warp===<br />
Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold and how many threads can run in total concurrently on the device.<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26619GPU2022-11-14T10:56:51Z<p>Smatovic: /* Hardware Examples */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 CUDA cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors<br />
* organized in 2x16 CUDA cores per SM<br />
* Warp size of 32 threads (number of SIMT threads)<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16/SIMT4 per Compute Unit<br />
* Wavefront size of 64 work-items (number of SIMT threads)<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold and how many threads can run in total concurrently on the device.<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26618GPU2022-11-14T10:44:35Z<p>Smatovic: /* Memory Examples */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 cuda cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors (Compute Units)<br />
* organized in 2x16 cuda cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16/SIMT4 per Compute Unit<br />
* Wavefront size of 64 Work-Items<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold and how many threads can run in total concurrently on the device.<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 16 KiB L1 cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26617GPU2022-11-14T07:09:18Z<p>Smatovic: /* Hardware Examples */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 cuda cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors (Compute Units)<br />
* organized in 2x16 cuda cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16/SIMT4 per Compute Unit<br />
* Wavefront size of 64 Work-Items<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold and how many threads can run in total concurrently on the device.<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 32 KiB L1 data cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26616GPU2022-11-14T07:03:40Z<p>Smatovic: /* Memory Examples */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 (Fermi, CC2.0) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi microarchitecture on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 cuda cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors (Compute Units)<br />
* organized in 2x16 cuda cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 (GCN 1.0)<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16/SIMT4 per Compute Unit<br />
* Wavefront size of 64 Work-Items<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold and how many threads can run in total concurrently on the device.<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
===Memory Examples===<br />
<br />
Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 32 KiB L1 data cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26615GPU2022-11-14T06:59:20Z<p>Smatovic: /* Hardware Model */ added examples</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
===Hardware Examples===<br />
<br />
Nvidia GeForce GTX 580 (Fermi, CC2.0) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi microarchitecture on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref><br />
<br />
* 512 cuda cores @1.544GHz<br />
* 16 SMs - Streaming Multiprocessors (Compute Units)<br />
* organized in 2x16 cuda cores per SM<br />
* Warp size of 32 threads<br />
<br />
AMD Radeon HD 7970 (GCN 1.0)<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref><br />
<br />
* 2048 stream cores @0.925GHz<br />
* 32 Compute Units<br />
* organized in 4xSIMD16/SIMT4 per Compute Unit<br />
* Wavefront size of 64 Work-Items<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold and how many threads can run in total concurrently on the device.<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
===Memory Examples===<br />
<br />
Here the data for the Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] as an example: <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
Here the data for the AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) as an example: <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 32 KiB L1 data cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]<br />
* [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]<br />
* [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]] <br />
* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]<br />
'''2012'''<br />
* [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]<br />
'''2013'''<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]<br />
* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]<br />
* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]<br />
* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]<br />
'''2014'''<br />
* [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]<br />
* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]<br />
* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]<br />
* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]<br />
==2015 ...==<br />
* [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]<br />
* [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]<br />
* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11<br />
'''2016'''<br />
* <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref><br />
* [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]<br />
* [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]<br />
* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]<br />
* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]<br />
* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref><br />
: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]<br />
'''2017'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]<br />
* [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''. [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]<br />
* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]<br />
'''2018'''<br />
* [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419<br />
<br />
=Forum Posts= <br />
==2005 ...==<br />
* [http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=5480 Hardware assist] by [[Nicolai Czempin]], [[Computer Chess Forums|Winboard Forum]], August 27, 2006<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=22732 Monte carlo on a NVIDIA GPU ?] by [[Marco Costalba]], [[CCC]], August 01, 2008<br />
==2010 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010<br />
'''2011'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011<br />
: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011<br />
'''2012'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref><br />
'''2013'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref><br />
* [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013 » [[Perft]], [[Kogge-Stone Algorithm]] <ref>[https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]</ref><br />
==2015 ...==<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=60386 GPU chess update, local memory...] by [[Srdja Matovic]], [[CCC]], June 06, 2016<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]<br />
'''2017'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]<br />
* [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017 <br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref><br />
'''2018'''<br />
* [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref><br />
: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]<br />
'''2019'''<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]<br />
* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by Percival Tiglao, [[CCC]], June 06, 2019<br />
* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019<br />
==2020 ...==<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref><br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020<br />
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]<br />
* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021<br />
* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]<br />
<br />
=External Links= <br />
* [https://en.wikipedia.org/wiki/Graphics_processing_unit Graphics processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Video_card Video card from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture Heterogeneous System Architecture from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]<br />
* [https://developer.nvidia.com/ NVIDIA Developer]<br />
* [https://developer.nvidia.com/nvidia-gpu-programming-guide NVIDIA GPU Programming Guide]<br />
==OpenCL==<br />
* [https://en.wikipedia.org/wiki/OpenCL OpenCL from Wikipedia]<br />
* [https://www.codeproject.com/Articles/110685/Part-1-OpenCL-Portable-Parallelism Part 1: OpenCL™ – Portable Parallelism - CodeProject]<br />
* [https://www.codeproject.com/Articles/122405/Part-2-OpenCL-Memory-Spaces Part 2: OpenCL™ – Memory Spaces - CodeProject]<br />
==CUDA==<br />
* [https://en.wikipedia.org/wiki/CUDA CUDA from Wikipedia]<br />
* [https://developer.nvidia.com/cuda-zone CUDA Zone | NVIDIA Developer]<br />
* [https://en.wikipedia.org/wiki/NVIDIA_CUDA_Compiler Nvidia CUDA Compiler (NVCC) from Wikipedia]<br />
* [https://llvm.org/docs/CompileCudaWithLLVM.html Compiling CUDA with clang] — [https://en.wikipedia.org/wiki/LLVM LLVM] [https://en.wikipedia.org/wiki/Clang Clang] documentation <br />
* [https://github.com/cppcon/cppcon2016 CppCon 2016]: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by [https://github.com/jlebar Justin Lebar], [https://en.wikipedia.org/wiki/YouTube YouTube] Video <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447&start=1 Re: Generate EGTB with graphics cards?] by [http://www.indriid.com/ Graham Jones], [[CCC]], January 01, 2019</ref><br />
: : {{#evu:https://www.youtube.com/watch?v=KHa-OSrZPGo|alignment=left|valignment=top}}<br />
==Deep Learning==<br />
* [https://developer.nvidia.com/deep-learning Deep Learning | NVIDIA Developer] » [[Deep Learning]]<br />
* [https://developer.nvidia.com/cudnn NVIDIA cuDNN | NVIDIA Developer]<br />
* [http://parse.ele.tue.nl/education/cluster2 Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster]<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/ Deep Learning in a Nutshell: Core Concepts] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], November 3, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-history-training/ Deep Learning in a Nutshell: History and Training] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], December 16, 2015<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-sequence-learning/ Deep Learning in a Nutshell: Sequence Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], March 7, 2016<br />
* [https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-reinforcement-learning/ Deep Learning in a Nutshell: Reinforcement Learning] by [http://timdettmers.com/ Tim Dettmers], [https://devblogs.nvidia.com/parallelforall/ Parallel Forall], September 8, 2016<br />
* [https://blog.dominodatalab.com/gpu-computing-and-deep-learning/ Faster deep learning with GPUs and Theano] <br />
* [https://en.wikipedia.org/wiki/Theano_(software) Theano (software) from Wikipedia]<br />
* [https://en.wikipedia.org/wiki/TensorFlow TensorFlow from Wikipedia]<br />
==Game Programming==<br />
* [http://andy-thomason.github.io/lecture_notes/agp/agp_gpgpu_programming.html Advanced game programming | Session 5 - GPGPU programming] by [[Andy Thomason]]<br />
* [https://zero.sjeng.org/ Leela Zero] by [[Gian-Carlo Pascutto]] » [[Leela Zero]]<br />
: [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]<br />
==Chess Programming==<br />
* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]<br />
* [http://gpuchess.blogspot.com/ GPU Chess Blog]<br />
* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref><br />
* [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]<br />
* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]<br />
* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]<br />
<br />
=References= <br />
<references /><br />
'''[[Hardware|Up one Level]]'''<br />
[[Category:Videos]]</div>Smatovichttps://www.chessprogramming.org/index.php?title=GPU&diff=26614GPU2022-11-14T06:32:42Z<p>Smatovic: /* Memory Examples */</p>
<hr />
<div>'''[[Main Page|Home]] * [[Hardware]] * GPU'''<br />
<br />
[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]] <br />
<br />
'''GPU''' (Graphics Processing Unit),<br/><br />
a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.<br />
<br />
=History=<br />
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.<br />
<br />
=GPU in Computer Chess= <br />
<br />
There are in main three approaches how to use a GPU for Chess:<br />
<br />
* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU.<br />
* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU.<br />
* As an hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.<br />
<br />
=GPU Chess Engines=<br />
* [[:Category:GPU]]<br />
<br />
=GPGPU= <br />
<br />
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].<br />
<br />
== Khronos OpenCL ==<br />
[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.<br />
<br />
* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]<br />
* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]<br />
* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]<br />
<br />
* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]<br />
<br />
== AMD ==<br />
<br />
[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.<br />
<br />
* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]<br />
* [https://rocm.github.io/ ROCm Homepage]<br />
* [http://developer.amd.com/wordpress/media/2013/07/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide-rev-2.7.pdf AMD OpenCL Programming Guide]<br />
* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
== Apple ==<br />
Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].<br />
<br />
* [https://developer.apple.com/opencl/ Apple OpenCL Developer] <br />
* [https://developer.apple.com/metal/ Apple Metal Developer]<br />
* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]<br />
* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]<br />
<br />
== Intel ==<br />
Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.<br />
<br />
* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]<br />
* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]<br />
<br />
== Nvidia ==<br />
<br />
[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].<br />
<br />
* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]<br />
* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]<br />
* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]<br />
<br />
== Further == <br />
<br />
* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)<br />
* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)<br />
* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)<br />
<br />
=Hardware Model=<br />
<br />
A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.<br />
<br />
=Programming Model=<br />
<br />
A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a block (work-group in OpenCL), one or multiple blocks form the grid (NDRange in OpenCL) to be executed on the GPU device. The members of a block resp. work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold and how many threads can run in total concurrently on the device.<br />
<br />
=Memory Model=<br />
<br />
OpenCL offers the following memory model for the programmer:<br />
<br />
* __private - usually registers, accessable only by a single work-item resp. thread.<br />
* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.<br />
* __constant - read-only memory.<br />
* __global - usually VRAM, accessable by all work-items resp. threads.<br />
<br />
===Memory Examples===<br />
<br />
Here the data for the Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] as an example: <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref><br />
* 128 KiB private memory per compute unit<br />
* 48 KiB (16 KiB) local memory per compute unit (configurable)<br />
* 64 KiB constant memory<br />
* 8 KiB constant cache per compute unit<br />
* 16 KiB (48 KiB) L1 cache per compute unit (configurable)<br />
* 768 KiB L2 cache<br />
* 1.5 GiB to 3 GiB global memory<br />
Here the data for the AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) as an example: <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref><br />
* 256 KiB private memory per compute unit<br />
* 64 KiB local memory per compute unit<br />
* 64 KiB constant memory<br />
* 16 KiB constant cache per four compute units<br />
* 32 KiB L1 data cache per compute unit<br />
* 768 KiB L2 cache<br />
* 3 GiB to 6 GiB global memory<br />
<br />
===Unified Memory===<br />
<br />
Usually data has to be transferred/copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.<br />
<br />
=Instruction Throughput= <br />
GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.<br />
<br />
==Integer Instruction Throughput==<br />
* INT32<br />
: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.<br />
<br />
* INT64<br />
: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.<br />
* INT8<br />
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.<br />
<br />
==Floating-Point Instruction Throughput==<br />
<br />
* FP32<br />
: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.<br />
<br />
* FP64<br />
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.<br />
<br />
* FP16<br />
: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.<br />
<br />
==Throughput Examples==<br />
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref><br />
<br />
MAD 16<br />
MUL 16<br />
ADD 32<br />
Bit-shift 16<br />
Bitwise XOR 32<br />
<br />
Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec<br />
<br />
AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref><br />
<br />
MAD 1/4<br />
MUL 1/4<br />
ADD 1<br />
Bit-shift 1<br />
Bitwise XOR 1<br />
<br />
Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec<br />
<br />
=Tensors=<br />
MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.<br />
<br />
==Nvidia TensorCores==<br />
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref><br />
<br />
==AMD Matrix Cores==<br />
: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation accelerators.<br />
<br />
==Intel XMX Cores==<br />
: Intel added XMX, Xe Matrix eXtensions, cores to the [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist] GPU series.<br />
<br />
=Host-Device Latencies= <br />
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.<br />
<br />
=Deep Learning=<br />
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.<br />
<br />
= Architectures =<br />
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.<br />
<br />
== AMD ==<br />
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia] <br />
<br />
=== Navi 3x RDNA 3 === <br />
RDNA 3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation accelerators.<br />
<br />
* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]<br />
<br />
=== CDNA 2 === <br />
CDNA 2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]<br />
<br />
=== CDNA === <br />
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.<br />
<br />
* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]<br />
<br />
=== Navi 2x RDNA 2 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2] cards were unveiled on October 28, 2020.<br />
<br />
* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]<br />
<br />
=== Navi RDNA 1 === <br />
[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1] cards were unveiled on July 7, 2019.<br />
<br />
* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]<br />
* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set]<br />
<br />
=== Vega GCN 5th gen ===<br />
<br />
[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.<br />
<br />
* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]<br />
* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]<br />
<br />
=== Polaris GCN 4th gen === <br />
<br />
[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.<br />
<br />
* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]<br />
<br />
== Apple ==<br />
<br />
=== M series ===<br />
<br />
Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.<br />
<br />
* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]<br />
<br />
== ARM ==<br />
The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.<br />
<br />
* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]<br />
<br />
=== Valhall (2019) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Bifrost (2016) ===<br />
<br />
* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]<br />
<br />
=== Midgard (2012) ===<br />
* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]<br />
<br />
== Intel ==<br />
<br />
=== Xe ===<br />
<br />
[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Arc_Alchemist Arc Alchemist series on Wikipedia]<br />
<br />
==Nvidia==<br />
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.<br />
<br />
* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]<br />
<br />
=== Ada Lovelace Architecture ===<br />
<br />
The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.<br />
<br />
* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]<br />
<br />
=== Hopper Architecture ===<br />
The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.<br />
<br />
* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]<br />
<br />
=== Ampere Architecture ===<br />
The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.<br />
<br />
* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]<br />
* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]<br />
<br />
=== Turing Architecture ===<br />
[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.<br />
<br />
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]<br />
<br />
=== Volta Architecture === <br />
[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].<br />
<br />
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Pascal Architecture ===<br />
[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.<br />
<br />
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]<br />
<br />
=== Maxwell Architecture ===<br />
[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.<br />
<br />
[https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]<br />
<br />
== PowerVR ==<br />
PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.<br />
<br />
=== PowerVR ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]<br />
<br />
=== IMG ===<br />
<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]<br />
* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]<br />
<br />
== Qualcomm ==<br />
Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.<br />
<br />
=== Adreno ===<br />
* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]<br />
<br />
== Vivante Corporation ==<br />
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.<br />
<br />
=== GC-Series ===<br />
<br />
* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]<br />
<br />
=See also= <br />
* [[Deep Learning]]<br />
* [[FPGA]]<br />
* [[Graphics Programming]]<br />
* [[Monte-Carlo Tree Search]]<br />
** [[MCαβ]]<br />
** [[UCT]]<br />
* [[Parallel Search]]<br />
* [[Perft#15|Perft(15)]] <br />
* [[SIMD and SWAR Techniques]]<br />
* [[Thread]]<br />
<br />
=Publications= <br />
<br />
==1986== <br />
* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism<br />
==1990==<br />
* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]<br />
==2008 ...==<br />
* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref><br />
* [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]<br />
* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]<br />
* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]<br />
* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref><br />
==2010...==<br />
* [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]<br />
* John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].<br />
'''2011'''<br />
* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]<br />
* [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]<br />
* [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of C