Latest revision as of 10:34, 24 January 2024

GPU (Graphics Processing Unit),
a specialized processor initially intended for fast image processing. GPUs may have more raw computing power than general purpose CPUs but need a specialized and parallelized way of programming. Leela Chess Zero has proven that a Best-first Monte-Carlo Tree Search (MCTS) with deep learning methodology will work with GPU architectures.

1 History
2 GPU in Computer Chess
3 GPU Chess Engines
4 GPGPU
5 Hardware Model
- 5.1 Hardware Examples
- 5.2 Wavefront and Warp
6 Programming Model
- 6.1 Thread Examples
7 Memory Model
- 7.1 Memory Examples
- 7.2 Unified Memory
8 Instruction Throughput
9 Tensors
10 Host-Device Latencies
11 Deep Learning
12 Architectures
13 See also
14 Publications
15 Forum Posts
16 External Links
17 References

History

In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like TIAin the Atari VCS gaming system, GTIA+ANTIC in the Atari 400/800 series, or Denise+Agnus in the Commodore Amiga series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as SGI Impact (1995) in 3D graphics-workstations or 3dfx Voodoo (1996) for playing 3D games on PCs, emerged. Some game engines could use instead the SIMD-capabilities of CPUs such as the Intel MMX instruction set or AMD's 3DNow! for real-time rendering. Sony's 3D capable chip GTE used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the unified shader architecture, like in Nvidia Tesla (2006), ATI/AMD TeraScale (2007) or Intel GMA X3000 (2006), GPGPU frameworks like CUDA and OpenCL emerged and gained in popularity.

GPU in Computer Chess

There are in main four ways how to use a GPU for chess:

As an accelerator in Lc0: run a neural network for position evaluation on GPU
Offload the search in Zeta: run a parallel game tree search with move generation and position evaluation on GPU
As a hybrid in perft_gpu: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree
Neural network training such as Stockfish NNUE trainer in Pytorch^[2] or Lc0 TensorFlow Training

GPU Chess Engines

Category:GPU

GPGPU

Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like OpenGL or DirextX, followed by first GPGPU frameworks such as Sh/RapidMind or Brook and finally CUDA and OpenCL.

Khronos OpenCL

OpenCL specified by the Khronos Group is widely adopted across all kind of hardware accelerators from different vendors.

List of OpenCL Conformant Products

OpenCL 3.0 Specifications

AMD

AMD supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with ROCm its own parallel compute platform.

Apple

Since macOS 10.14 Mojave a transition from OpenCL to Metal is recommended by Apple.

Intel

Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the oneAPI platform with DPC++ as frontend language.

Nvidia

CUDA is the parallel computing platform by Nvidia. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via OpenACC and OpenMP.

Further

Vulkan (OpenGL sucessor of Khronos Group)
DirectCompute (Microsoft)
C++ AMP (Microsoft)
OpenACC (offload directives)
OpenMP (offload directives)

Hardware Model

A common scheme on GPUs with unified shader architecture is to run multiple threads in SIMT fashion and a multitude of SIMT waves on the same SIMD unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as hardware architecture. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.

Vendor Terminology
AMD Terminology	Nvidia Terminology
Compute Unit	Streaming Multiprocessor
Stream Core	CUDA Core
Wavefront	Warp

Hardware Examples

Nvidia GeForce GTX 580 (Fermi) ^[3]^[4]

512 CUDA cores @1.544GHz
16 SMs - Streaming Multiprocessors
organized in 2x16 CUDA cores per SM
Warp size of 32 threads

AMD Radeon HD 7970 (GCN)^[5]^[6]

2048 Stream cores @0.925GHz
32 Compute Units
organized in 4xSIMD16, each SIMT4, per Compute Unit
Wavefront size of 64 work-items

Wavefront and Warp

Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.

Programming Model

A parallel programming model for GPGPU can be data-parallel, task-parallel, a mixture of both, or with libraries and offload-directives also implicitly-parallel. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work-groups form the NDRange to be executed on the GPU device. The members of a work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many work-items a work-group can hold and how many threads can run in total concurrently on the device.

Terminology
OpenCL Terminology	CUDA Terminology
Kernel	Kernel
Compute Unit	Streaming Multiprocessor
Processing Element	CUDA Core
Work-Item	Thread
Work-Group	Block
NDRange	Grid

Thread Examples

Nvidia GeForce GTX 580 (Fermi, CC2) ^[7]

Warp size: 32
Maximum number of threads per block: 1024
Maximum number of resident blocks per multiprocessor: 32
Maximum number of resident warps per multiprocessor: 64
Maximum number of resident threads per multiprocessor: 2048

AMD Radeon HD 7970 (GCN) ^[8]

Wavefront size: 64
Maximum number of work-items per work-group: 1024
Maximum number of work-groups per compute unit: 40
Maximum number of Wavefronts per compute unit: 40
Maximum number of work-items per compute unit: 2560

Memory Model

OpenCL offers the following memory model for the programmer:

__private - usually registers, accessable only by a single work-item resp. thread.
__local - scratch-pad memory shared across work-items of a work-group resp. threads of block.
__constant - read-only memory.
__global - usually VRAM, accessable by all work-items resp. threads.

Terminology
OpenCL Terminology	CUDA Terminology
Private Memory	Registers
Local Memory	Shared Memory
Constant Memory	Constant Memory
Global Memory	Global Memory

Memory Examples

Nvidia GeForce GTX 580 (Fermi) ^[9]

128 KiB private memory per compute unit
48 KiB (16 KiB) local memory per compute unit (configurable)
64 KiB constant memory
8 KiB constant cache per compute unit
16 KiB (48 KiB) L1 cache per compute unit (configurable)
768 KiB L2 cache in total
1.5 GiB to 3 GiB global memory

AMD Radeon HD 7970 (GCN) ^[10]

256 KiB private memory per compute unit
64 KiB local memory per compute unit
64 KiB constant memory
16 KiB constant cache per four compute units
16 KiB L1 cache per compute unit
768 KiB L2 cache in total
3 GiB to 6 GiB global memory

Unified Memory

Usually data has to be copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.

Instruction Throughput

GPUs are used in HPC environments because of their good FLOP/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's TeraScale, GCN, RDNA), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, Radeon Pro, Radeon Instinct) and the specific model.

Integer Instruction Throughput

INT32

The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.

INT64

In general registers and Vector-ALUs of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.

INT8

Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.

Floating-Point Instruction Throughput

FP32

Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.

FP64

Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.

FP16

Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.

Throughput Examples

Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit ^[11]

   MAD 16
   MUL 16
   ADD 32
   Bit-shift 16
   Bitwise XOR 32

Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec

AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element ^[12]

   MAD 1/4
   MUL 1/4
   ADD 1
   Bit-shift 1
   Bitwise XOR 1

Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec

Tensors

MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations ^[13]. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.

Nvidia TensorCores

With Nvidia Volta series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.^[14] Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.^[15] Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.^[16]Ada Lovelaces's 4th gen adds support for FP8.^[17]

AMD Matrix Cores

AMD released 2020 its server-class CDNA architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation acceleration. AMD's CDNA 3 architecture adds support for FP8 and sparse matrix data (sparsity).

Intel XMX Cores

Intel added XMX, Xe Matrix eXtensions, cores to some of the Intel Xe GPU series, like Arc Alchemist and Intel Data Center GPU Max Series.

Host-Device Latencies

One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds ^[18] up to 100s of microseconds ^[19]. One solution to overcome this limitation is to couple tasks to batches to be executed in one run ^[20].

Deep Learning

GPUs are much more suited than CPUs to implement and train Convolutional Neural Networks (CNN), and were therefore also responsible for the deep learning boom, also affecting game playing programs combining CNN with MCTS, as pioneered by Google DeepMind's AlphaGo and AlphaZero entities in Go, Shogi and Chess using TPUs, and the open source projects Leela Zero headed by Gian-Carlo Pascutto for Go and its Leela Chess Zero adaption.

Architectures

The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.

AMD

AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.

List of AMD graphics processing units on Wikipedia

CDNA3

CDNA3 HPC architecture was unveiled in December, 2023. With MI300A APU model (CPU+GPU+HBM) and MI300X GPU model, both with multi-chip modules design. Featuring Matrix Cores with support for a broad type of precision, as INT8, FP8, BF16, FP16, TF32, FP32, FP64, as well as sparse matrix data (sparsity). Supported by AMD's ROCm open software stack for AMD Instinct accelerators.

Navi 3x RDNA3

RDNA3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation acceleration.

CDNA2

CDNA2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.

CDNA

CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.

Navi 2x RDNA2

RDNA2 cards were unveiled on October 28, 2020.

Navi RDNA

RDNA cards were unveiled on July 7, 2019.

Vega GCN 5th gen

Vega cards were unveiled on August 14, 2017.

Polaris GCN 4th gen

Polaris cards were first released in 2016.

Southern Islands GCN 1st gen

Southern Island cards introduced the GCN architecture in 2012.

Apple

M series

Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.

Apple M series on Wikipedia

ARM

The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.

Mali variants on Wikipedia

Intel

Xe

Intel Xe line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).

Nvidia

Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.

List of Nvidia graphics processing units on Wikipedia

Grace Hopper Superchip

The Nvidia GH200 Grace Hopper Superchip was unveiled August, 2023 and combines the Nvidia Grace CPU (ARM v9) and Nvidia Hopper GPU architectures via NVLink to deliver a CPU+GPU coherent memory model for accelerated AI and HPC applications.

Ada Lovelace Architecture

The Ada Lovelace microarchitecture was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.

Hopper Architecture

The Hopper GPU Datacenter microarchitecture was announced on March 22, 2022, featuring Transfomer Engines for large language models.

Ampere Architecture

The Ampere microarchitecture was announced on May 14, 2020 ^[21]. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 ^[22].

Turing Architecture

Turing cards were first released in 2018. They are the first consumer cores to launch with RTX, for raytracing, features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate convolutional neural networks. The Turing GTX line of chips do not offer RTX or TensorCores.

Volta Architecture

Volta cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate convolutional neural networks.

Pascal Architecture

Pascal cards were first released in 2016.

Maxwell Architecture

Maxwell cards were first released in 2014.

PowerVR

PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.

PowerVR

PowerVR series on Wikipedia

IMG

Qualcomm

Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.

Adreno

Adreno variants on Wikipedia

Vivante Corporation

Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.

GC-Series

GC series on Wikipedia

Publications

1986

W. Daniel Hillis, Guy L. Steele, Jr. (1986). Data parallel algorithms. Communications of the ACM, Vol. 29, No. 12, Special Issue on Parallelism

1990

Guy E. Blelloch (1990). Vector Models for Data-Parallel Computing. MIT Press, pdf

2008 ...

Vlad Stamate (2008). Real Time Photon Mapping Approximation on the GPU. in ShaderX6 - Advanced Rendering Techniques ^[23]
Ren Wu, Bin Zhang, Meichun Hsu (2009). Clustering billions of data points using GPUs. ACM International Conference on Computing Frontiers
Mark Govett, Craig Tierney, Jacques Middlecoff, Tom Henderson (2009). Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models. CAS2K9 Workshop
Hank Dietz, Bobby Dalton Young (2009). MIMD Interpretation on a GPU. LCPC 2009, pdf, slides.pdf
Sander van der Maar, Joost Batenburg, Jan Sijbers (2009). Experiences with Cell-BE and GPU for Tomography. SAMOS 2009 ^[24]

2010...

Avi Bleiweiss (2010). Playing Zero-Sum Games on the GPU. NVIDIA Corporation, GPU Technology Conference 2010, slides as pdf
Mark Govett, Jacques Middlecoff, Tom Henderson (2010). Running the NIM Next-Generation Weather Model on GPUs. CCGRID 2010
John Nickolls, William J. Dally (2010). The GPU Computing Era. IEEE Micro.

2011

Mark Govett, Jacques Middlecoff, Tom Henderson, Jim Rosinski, Craig Tierney (2011). Parallelization of the NIM Dynamical Core for GPUs. slides as pdf
Ľubomír Lackovič (2011). Parallel Game Tree Search Using GPU. Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, pdf
Dan Anthony Feliciano Alcantara (2011). Efficient Hash Tables on the GPU. Ph. D. thesis, University of California, Davis, pdf » Hash Table
Damian Sulewski (2011). Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks. Ph.D. thesis, University of Dortmund, pdf
Damjan Strnad, Nikola Guid (2011). Parallel Alpha-Beta Algorithm on the GPU. CIT. Journal of Computing and Information Technology, Vol. 19, No. 4 » Parallel Search, Reversi
Balázs Jákó (2011). Fast Hydraulic and Thermal Erosion on GPU. M.Sc. thesis, Supervisor Balázs Tóth, Eurographics 2011, pdf

2012

Liang Li, Hong Liu, Peiyu Liu, Taoying Liu, Wei Li, Hao Wang (2012). A Node-based Parallel Game Tree Algorithm Using GPUs. CLUSTER 2012 » Parallel Search

2013

S. Ali Mirsoleimani, Ali Karami Ali Karami, Farshad Khunjush (2013). A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments. GECCO '13
Ali Karami, S. Ali Mirsoleimani, Farshad Khunjush (2013). A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs. CADS 2013
Diego Rodríguez-Losada, Pablo San Segundo, Miguel Hernando, Paloma de la Puente, Alberto Valero-Gomez (2013). GPU-Mapping: Robotic Map Building with Graphical Multiprocessors. IEEE Robotics & Automation Magazine, Vol. 20, No. 2, pdf
David Williams, Valeriu Codreanu, Po Yang, Baoquan Liu, Feng Dong, Burhan Yasar, Babak Mahdian, Alessandro Chiarini, Xia Zhao, Jos Roerdink (2013). Evaluation of Autoparallelization Toolkits for Commodity GPUs. PPAM 2013

2014

Qingqing Dang, Shengen Yan, Ren Wu (2014). A fast integral image generation algorithm on GPUs. ICPADS 2014
S. Ali Mirsoleimani, Ali Karami Ali Karami, Farshad Khunjush (2014). A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor. ARCS 2014, Lecture Notes in Computer Science, Vol. 8350, Springer
Steinar H. Gunderson (2014). Movit: High-speed, high-quality video filters on the GPU. FOSDEM 2014, pdf
Baoquan Liu, Alexandru Telea, Jos Roerdink, Gordon Clapworthy, David Williams, Po Yang, Feng Dong, Valeriu Codreanu, Alessandro Chiarini (2018). Parallel centerline extraction on the GPU. Computers & Graphics, Vol. 41, pdf

2015 ...

Peter H. Jin, Kurt Keutzer (2015). Convolutional Monte Carlo Rollouts in Go. arXiv:1512.03375 » Deep Learning, Go, MCTS
Liang Li, Hong Liu, Hao Wang, Taoying Liu, Wei Li (2015). A Parallel Algorithm for Game Tree Search Using GPGPU. IEEE Transactions on Parallel and Distributed Systems, Vol. 26, No. 8 » Parallel Search
Simon Portegies Zwart, Jeroen Bédorf (2015). Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics. IEEE Computer, Vol. 48, No. 11

2016

Sean Sheen (2016). Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1. Master's thesis, California Polytechnic State University, pdf ^[25] ^[26]
Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt (2016). gpucc: an open-source GPGPU compiler. CGO 2016
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis (2016). Mastering the game of Go with deep neural networks and tree search. Nature, Vol. 529 » AlphaGo
Balázs Jákó (2016). Hardware accelerated hybrid rendering on PowerVR GPUs. ^[27] IEEE 20th Jubilee International Conference on Intelligent Engineering Systems
Diogo R. Ferreira, Rui M. Santos (2016). Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs. BPM 2016
Ole Schütt, Peter Messmer, Jürg Hutter, Joost VandeVondele (2016). GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory. pdf ^[28]

Chapter 8 in Ross C. Walker, Andreas W. Götz (2016). Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics. John Wiley & Sons

2017

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815 » AlphaZero
Tristan Cazenave (2017). Residual Networks for Computer Go. IEEE Transactions on Computational Intelligence and AI in Games, Vol. PP, No. 99, pdf
Jayvant Anantpur, Nagendra Gulur Dwarakanath, Shivaram Kalyanakrishnan, Shalabh Bhatnagar, R. Govindarajan (2017). RLWS: A Reinforcement Learning based GPU Warp Scheduler. arXiv:1712.04303

2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, Vol. 362, No. 6419

Forum Posts

2005 ...

Hardware assist by Nicolai Czempin, Winboard Forum, August 27, 2006
Monte carlo on a NVIDIA GPU ? by Marco Costalba, CCC, August 01, 2008

2010 ...

Using the GPU by Louis Zulli, CCC, February 19, 2010

2011

GPGPU and computer chess by Wim Sjoho, CCC, February 09, 2011
Possible Board Presentation and Move Generation for GPUs? by Srdja Matovic, CCC, March 19, 2011

Re: Possible Board Presentation and Move Generation for GPUs by Steffan Westcott, CCC, March 20, 2011

Zeta plays chess on a gpu by Srdja Matovic, CCC, June 23, 2011 » Zeta
GPU Search Methods by Joshua Haglund, CCC, July 04, 2011

2012

Possible Search Algorithms for GPUs? by Srdja Matovic, CCC, January 07, 2012 ^[29] ^[30]
uct on gpu by Daniel Shawul, CCC, February 24, 2012 » UCT
Is there such a thing as branchless move generation? by John Hamlen, CCC, June 07, 2012 » Move Generation
Choosing a GPU platform: AMD and Nvidia by John Hamlen, CCC, June 10, 2012
Nvidias K20 with Recursion by Srdja Matovic, CCC, December 04, 2012 ^[31]

2013

Kogge Stone, Vector Based by Srdja Matovic, CCC, January 22, 2013 » Kogge-Stone Algorithm ^[32] ^[33]
GPU chess engine by Samuel Siltanen, CCC, February 27, 2013
Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013 » Perft, Kogge-Stone Algorithm ^[34]

2015 ...

GPU chess update, local memory... by Srdja Matovic, CCC, June 06, 2016
Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016 » Astro
Pigeon is now running on the GPU by Stuart Riffle, CCC, November 02, 2016 » Pigeon

2017

Back to the basics, generating moves on gpu in parallel... by Srdja Matovic, CCC, March 05, 2017 » Move Generation
Re: Perft(15): comparison of estimates with Ankan's result by Ankan Banerjee, CCC, August 26, 2017 » Perft(15)
Chess Engine and GPU by Fishpov , Rybka Forum, October 09, 2017
To TPU or not to TPU... by Srdja Matovic, CCC, December 16, 2017 » Deep Learning ^[35]

2018

Announcing lczero by Gary, CCC, January 09, 2018 » Leela Chess Zero
GPU ANN, how to deal with host-device latencies? by Srdja Matovic, CCC, May 06, 2018 » Neural Networks
GPU contention by Ian Kennedy, CCC, May 07, 2018 » Leela Chess Zero
How good is the RTX 2080 Ti for Leela? by Hai, September 15, 2018 » Leela Chess Zero ^[36]

Re: How good is the RTX 2080 Ti for Leela? by Ankan Banerjee, CCC, September 16, 2018

My non-OC RTX 2070 is very fast with Lc0 by Kai Laskos, CCC, November 19, 2018 » Leela Chess Zero
LC0 using 4 x 2080 Ti GPU's on Chess.com tourney? by M. Ansari, CCC, December 28, 2018 » Leela Chess Zero

2019

Generate EGTB with graphics cards? by Nguyen Pham, CCC, January 01, 2019 » Endgame Tablebases
LCZero FAQ is missing one important fact by Jouni Uski, CCC, January 01, 2019 » Leela Chess Zero
Michael Larabel benches lc0 on various GPUs by Warren D. Smith, LCZero Forum, January 14, 2019 » Lc0 ^[37]
Using LC0 with one or two GPUs - a guide by Srdja Matovic, CCC, March 30, 2019 » Lc0
Wouldn't it be nice if C++ GPU by Chris Whittington, CCC, April 25, 2019 » C++
Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search by Percival Tiglao, CCC, June 06, 2019
My home-made CUDA kernel for convolutions by Rémi Coulom, Game-AI Forum, November 09, 2019 » Deep Learning
GPU rumors 2020 by Srdja Matovic, CCC, November 13, 2019

2020 ...

AB search with NN on GPU... by Srdja Matovic, CCC, August 13, 2020 » Neural Networks ^[38]
I stumbled upon this article on the new Nvidia RTX GPUs by Kai Laskos, CCC, September 10, 2020
Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0? by Srdja Matovic, CCC, November 01, 2020
Zeta with NNUE on GPU? by Srdja Matovic, CCC, March 31, 2021 » Zeta, NNUE
GPU rumors 2021 by Srdja Matovic, CCC, April 16, 2021
Comparison of all known Sliding lookup algorithms [CUDA] by Daniel Infuehr, CCC, January 08, 2022 » Sliding Piece Attacks
Re: China boosts in silicon... by Srdja Matovic, CCC, January 13, 2024

External Links

OpenCL

CUDA

CUDA from Wikipedia
CUDA Zone | NVIDIA Developer
Nvidia CUDA Compiler (NVCC) from Wikipedia
Compiling CUDA with clang — LLVM Clang documentation
CppCon 2016: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by Justin Lebar, YouTube Video ^[39]

:

Deep Learning

Deep Learning | NVIDIA Developer » Deep Learning
NVIDIA cuDNN | NVIDIA Developer
Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster
Deep Learning in a Nutshell: Core Concepts by Tim Dettmers, Parallel Forall, November 3, 2015
Deep Learning in a Nutshell: History and Training by Tim Dettmers, Parallel Forall, December 16, 2015
Deep Learning in a Nutshell: Sequence Learning by Tim Dettmers, Parallel Forall, March 7, 2016
Deep Learning in a Nutshell: Reinforcement Learning by Tim Dettmers, Parallel Forall, September 8, 2016
Faster deep learning with GPUs and Theano
Theano (software) from Wikipedia
TensorFlow from Wikipedia

Game Programming

GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper

Chess Programming

References

↑ Image by Mahogny, February 09, 2008, Wikimedia Commons
↑ Pytorch NNUE training by Gary Linscott, CCC, November 08, 2020
↑ Fermi white paper from Nvidia
↑ GeForce 500 series on Wikipedia
↑ Graphics Core Next on Wikipedia
↑ Radeon HD 7000 series on Wikipedia
↑ CUDA Technical_Specification on Wikipedia
↑ AMD GPU Hardware Basics
↑ CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES
↑ AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
↑ CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions
↑ AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths
↑ Re: To TPU or not to TPU... by Rémi Coulom, CCC, December 16, 2017
↑ INSIDE VOLTA
↑ AnandTech - Nvidia Turing Deep Dive page 6
↑ Wikipedia - Ampere microarchitecture
↑ - Ada Lovelace microarchitecture
↑ host-device latencies? by Srdja Matovic, Nvidia CUDA ZONE, Feb 28, 2019
↑ host-device latencies? by Srdja Matovic AMD Developer Community, Feb 28, 2019
↑ Re: GPU ANN, how to deal with host-device latencies? by Milos Stanisavljevic, CCC, May 06, 2018
↑ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog by Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam and Sridhar Ramaswamy, May 14, 2020
↑ CUDA 11 Features Revealed | NVIDIA Developer Blog by Pramod Ramarao, May 14, 2020
↑ Photon mapping from Wikipedia
↑ Cell (microprocessor) from Wikipedia
↑ Jetson TK1 Embedded Development Kit | NVIDIA
↑ Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016
↑ PowerVR from Wikipedia
↑ Density functional theory from Wikipedia
↑ Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2
↑ Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2
↑ Tesla K20 GPU Compute Processor Specifications Released | techPowerUp
↑ Parallel Thread Execution from Wikipedia
↑ NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf
↑ ankan-ban/perft_gpu · GitHub
↑ Tensor processing unit from Wikipedia
↑ GeForce 20 series from Wikipedia
↑ Phoronix Test Suite from Wikipedia
↑ kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums by LukeCuda, June 18, 2018
↑ Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019
↑ Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

Up one Level

[1] Image by Mahogny, February 09, 2008, Wikimedia Commons

[2] Pytorch NNUE training by Gary Linscott, CCC, November 08, 2020

[3] Fermi white paper from Nvidia

[4] GeForce 500 series on Wikipedia

[5] Graphics Core Next on Wikipedia

[6] Radeon HD 7000 series on Wikipedia

[7] CUDA Technical_Specification on Wikipedia

[8] AMD GPU Hardware Basics

[9] CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES

[10] AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices

[11] CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions

[12] AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths

[13] Re: To TPU or not to TPU... by Rémi Coulom, CCC, December 16, 2017

[14] INSIDE VOLTA

[15] AnandTech - Nvidia Turing Deep Dive page 6

[16] Wikipedia - Ampere microarchitecture

[17] - Ada Lovelace microarchitecture

[18] st-device latencies? by Srdja Matovic, Nvidia CUDA ZONE, Feb 28, 2019

[19] st-device latencies? by Srdja Matovic AMD Developer Community, Feb 28, 2019

[20] Re: GPU ANN, how to deal with host-device latencies? by Milos Stanisavljevic, CCC, May 06, 2018

[21] NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog by Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam and Sridhar Ramaswamy, May 14, 2020

[22] CUDA 11 Features Revealed | NVIDIA Developer Blog by Pramod Ramarao, May 14, 2020

[23] Photon mapping from Wikipedia

[24] Cell (microprocessor) from Wikipedia

[25] Jetson TK1 Embedded Development Kit | NVIDIA

[26] Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016

[27] PowerVR from Wikipedia

[28] Density functional theory from Wikipedia

[29] Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2

[30] Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2

[31] Tesla K20 GPU Compute Processor Specifications Released | techPowerUp

[32] Parallel Thread Execution from Wikipedia

[33] NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf

[34] -ban/perft_gpu · GitHub

[35] Tensor processing unit from Wikipedia

[36] GeForce 20 series from Wikipedia

[37] Phoronix Test Suite from Wikipedia

[38] rnel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums by LukeCuda, June 18, 2018

[39] Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019

[40] Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

@@ Line 1: / Line 1: @@
 '''[[Main Page|Home]] * [[Hardware]] * GPU'''
-[[FILE:6600GT GPU.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/GeForce_6_series GeForce 6600GT (NV43)] GPU <ref>[https://commons.wikimedia.org/wiki/Graphics_processing_unit Graphics processing unit - Wikimedia Commons]</ref> ]]
+[[FILE:NvidiaTesla.jpg|border|right|thumb| [https://en.wikipedia.org/wiki/Nvidia_Tesla Nvidia Tesla] <ref>[https://commons.wikimedia.org/wiki/File:NvidiaTesla.jpg Image] by Mahogny, February 09, 2008, [https://en.wikipedia.org/wiki/Wikimedia_Commons Wikimedia Commons]</ref> ]]
 '''GPU''' (Graphics Processing Unit),<br/>
-a specialized processor primarily intended for [https://en.wikipedia.org/wiki/Video_card graphic cards] to rapidly manipulate and alter [[Memory|memory]] for fast [https://en.wikipedia.org/wiki/Image_processing image processing], usually but not necessarily mapped to a [https://en.wikipedia.org/wiki/Framebuffer framebuffer] of a display. GPUs have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a limited, specialized and massive parallelized way of programming, not that conform with the serial nature of [[Alpha-Beta|alpha-beta]] if it is about a massive [[Parallel Search|parallel search]] in chess. Instead, [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) approaches in conjunction with [[Deep Learning|deep learning]] proved a successful way to go on GPU architectures.
+a specialized processor initially intended for fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.
+=History=
+In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer resp. texture buffer, like  [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as [https://en.wikipedia.org/wiki/IMPACT_(computer_graphics) SGI Impact] (1995) in 3D graphics-workstations or [https://en.wikipedia.org/wiki/3dfx#Voodoo_Graphics_PCI 3dfx Voodoo] (1996) for playing 3D games on PCs, emerged. Some game engines could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]] for [https://en.wikipedia.org/wiki/Real-time_computer_graphics real-time rendering]. Sony's 3D capable chip [https://en.wikipedia.org/wiki/PlayStation_technical_specifications#Graphics_processing_unit_(GPU) GTE] used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like [https://en.wikipedia.org/wiki/NV1 NV1] (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like [https://en.wikipedia.org/wiki/CUDA CUDA] and [[OpenCL|OpenCL]] emerged and gained in popularity.
+=GPU in Computer Chess=
+There are in main four ways how to use a GPU for chess:
+* As an accelerator in [[Leela_Chess_Zero|Lc0]]: run a neural network for position evaluation on GPU
+* Offload the search in [[Zeta|Zeta]]: run a parallel game tree search with move generation and position evaluation on GPU
+* As a hybrid in [http://www.talkchess.com/forum3/viewtopic.php?t=64983&start=4#p729152 perft_gpu]: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree
+* Neural network training such as [https://github.com/glinscott/nnue-pytorch Stockfish NNUE trainer in Pytorch]<ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=75724 Pytorch NNUE training] by [[Gary Linscott]], [[CCC]], November 08, 2020</ref> or [https://github.com/LeelaChessZero/lczero-training Lc0 TensorFlow Training]
+=GPU Chess Engines=
+* [[:Category:GPU]]
 =GPGPU=
-There are various frameworks for [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units GPGPU], General Purpose computing on Graphics Processing Unit. Despite language wrappers and mobile devices with special APIs, there are in main three ways to make use of GPGPU.
-==Mapping to an API==
+Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].
-* [https://en.wikipedia.org/wiki/BrookGPU BrookGPU] (translates to [https://en.wikipedia.org/wiki/OpenGL OpenGL] and [https://en.wikipedia.org/wiki/DirectX DirectX])
-* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Open standard by [[Microsoft|Microsoft]] that extends [[Cpp|C++]])
-* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (GPGPU API by Microsoft)
-==Native Compilers==
-* [https://en.wikipedia.org/wiki/CUDA CUDA] (GPGPU framework by [https://en.wikipedia.org/wiki/Nvidia Nvidia])
-* [https://en.wikipedia.org/wiki/OpenCL OpenCL] (Open Compute Language specified by [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group])
-==Intermediate Languages==
-* [https://en.wikipedia.org/wiki/Heterogeneous_System_Architecture#HSA_Intermediate_Layer HSAIL]
-* [https://en.wikipedia.org/wiki/Parallel_Thread_Execution PTX]
-* [https://en.wikipedia.org/wiki/Standard_Portable_Intermediate_Representation SPIR]
-=Inside=
+== Khronos OpenCL ==
-Modern GPUs consist of up to hundreds of [[SIMD and SWAR Techniques|SIMD]] or [https://en.wikipedia.org/wiki/Vector_processor Vector] units, coupled to compute units.
+[[OpenCL|OpenCL]] specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.
-Each compute unit processes multiple [https://en.wikipedia.org/wiki/Thread_block#Warps Warps] (Nvidia term) resp. Wavefronts ([[AMD]] term) in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion.
-Each Warp resp. Wavefront runs n (32 or 64) [[Thread|threads]] simultaneously.
-The Nvidia  [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_Series GeForce GTX 580], for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. <ref>CUDA C Programming Guide v7.0, Appendix G. COMPUTE CAPABILITIES, Table 12 Technical Specifications per Compute Capability</ref>
+* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]
-The AMD [https://en.wikipedia.org/wiki/Radeon_HD_7000_Series#Radeon_HD_7900 Radeon HD 7970] is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores. <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>. In real life the register and shared memory size limits this total.
-=Memory=
+* [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]
-The memory hierarchy of an GPU consists in main of private memory (registers accessed by an single thread resp. work-item), local memory (shared by threads of an block resp. work-items of an work-group ), constant memory, different types of cache and global memory. Size, latency and bandwidth vary between vendors and architectures.
+* [https://www.khronos.org/registry/OpenCL//sdk/1.2/docs/man/xhtml/ OpenCL 1.2 Reference]
-Here the data for the Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] as an example: <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>
+* [https://www.khronos.org/registry/OpenCL/specs/opencl-2.0.pdf OpenCL 2.0 Specification]
+* [https://www.khronos.org/registry/OpenCL/specs/2.2/pdf/OpenCL_C.pdf OpenCL 2.0 C Language Specification]
+* [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]
+* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]
+== AMD ==
+[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.
+* [https://community.amd.com/t5/opencl/bd-p/opencl-discussions AMD OpenCL Developer Community]
+* [https://rocmdocs.amd.com/en/latest/index.html AMD ROCm™ documentation]
+* [https://manualzz.com/doc/o/cggy6/amd-opencl-programming-user-guide-contents AMD OpenCL Programming Guide]
+* [http://developer.amd.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf AMD OpenCL Optimization Guide]
+* [https://gpuopen.com/amd-isa-documentation/ AMD GPU ISA documentation]
+== Apple ==
+Since macOS 10.14 Mojave a transition from OpenCL to [https://en.wikipedia.org/wiki/Metal_(API) Metal] is recommended by [[Apple]].
+* [https://developer.apple.com/opencl/ Apple OpenCL Developer]
+* [https://developer.apple.com/metal/ Apple Metal Developer]
+* [https://developer.apple.com/library/archive/documentation/Miscellaneous/Conceptual/MetalProgrammingGuide/Introduction/Introduction.html Apple Metal Programming Guide]
+* [https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf Metal Shading Language Specification]
+== Intel ==
+Intel supports OpenCL with implementations like BEIGNET and NEO for different GPU architectures and the [https://en.wikipedia.org/wiki/OneAPI_(compute_acceleration) oneAPI] platform with [https://en.wikipedia.org/wiki/DPC++ DPC++] as frontend language.
+* [https://www.intel.com/content/www/us/en/developer/overview.html#gs.pu62bi Intel Developer Zone]
+* [https://www.intel.com/content/www/us/en/develop/documentation/oneapi-programming-guide/top.html Intel oneAPI Programming Guide]
+== Nvidia ==
+[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports language frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].
+* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]
+* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]
+* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]
+* [https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html Nvidia CUDA C++ Programming Guide]
+* [https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html Nvidia CUDA C++ Best Practices Guide]
+== Further ==
+* [https://en.wikipedia.org/wiki/Vulkan#Planned_features Vulkan] (OpenGL sucessor of Khronos Group)
+* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)
+* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)
+* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)
+* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)
+=Hardware Model=
+A common scheme on GPUs with unified shader architecture is to run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities - floating-point and/or integer with specific bit-width of the FPU/ALU and registers. There is a difference between a vector-processor with variable bit-width and SIMD units with fix bit-width cores. Different architecture white papers from different vendors leave room for speculation about the concrete underlying hardware implementation and the concrete classification as [https://en.wikipedia.org/wiki/Flynn%27s_taxonomy hardware architecture]. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMAC units (matrix-multiply-accumulate units) are used to speed up neural networks further.
+{| class="wikitable" style="margin:auto"
+|+ Vendor Terminology
+|-
+! AMD Terminology !! Nvidia Terminology
+|-
+| Compute Unit || Streaming Multiprocessor
+|-
+| Stream Core || CUDA Core
+|-
+| Wavefront || Warp
+|}
+===Hardware Examples===
+Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi]) <ref>[https://www.nvidia.com/content/PDF/fermi_white_papers/NVIDIA_Fermi_Compute_Architecture_Whitepaper.pdf Fermi white paper from Nvidia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_series GeForce 500 series on Wikipedia]</ref>
+* 512 CUDA cores @1.544GHz
+* 16 SMs - Streaming Multiprocessors
+* organized in 2x16 CUDA cores per SM
+* Warp size of 32 threads
+AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN)]<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next Graphics Core Next on Wikipedia]</ref><ref>[https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units#Radeon_HD_7000_series Radeon HD 7000 series on Wikipedia]</ref>
+* 2048 Stream cores @0.925GHz
+* 32 Compute Units
+* organized in 4xSIMD16, each SIMT4, per Compute Unit
+* Wavefront size of 64 work-items
+===Wavefront and Warp===
+Generalized the definition of the Wavefront and Warp size is the amount of threads executed in SIMT fashion on a GPU with unified shader architecture.
+=Programming Model=
+A [https://en.wikipedia.org/wiki/Parallel_programming_model parallel programming model] for GPGPU can be [https://en.wikipedia.org/wiki/Data_parallelism data-parallel], [https://en.wikipedia.org/wiki/Task_parallelism task-parallel], a mixture of both, or with libraries and offload-directives also [https://en.wikipedia.org/wiki/Implicit_parallelism implicitly-parallel]. Single GPU threads (work-items in OpenCL) contain the kernel to be computed and are coupled to a work-group, one or multiple work-groups form the NDRange to be executed on the GPU device. The members of a work-group execute the same kernel, can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many work-items a work-group can hold and how many threads can run in total concurrently on the device.
+{| class="wikitable" style="margin:auto"
+|+ Terminology
+|-
+! OpenCL Terminology !! CUDA Terminology
+|-
+| Kernel || Kernel
+|-
+| Compute Unit || Streaming Multiprocessor
+|-
+| Processing Element || CUDA Core
+|-
+| Work-Item || Thread
+|-
+| Work-Group || Block
+|-
+| NDRange || Grid
+|-
+|}
+==Thread Examples==
+Nvidia GeForce GTX 580 (Fermi, CC2) <ref>[https://en.wikipedia.org/wiki/CUDA#Technical_Specification CUDA Technical_Specification on Wikipedia]</ref>
+* Warp size: 32
+* Maximum number of threads per block: 1024
+* Maximum number of resident blocks per multiprocessor: 32
+* Maximum number of resident warps per multiprocessor: 64
+* Maximum number of resident threads per multiprocessor: 2048
+AMD Radeon HD 7970 (GCN) <ref>[https://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf AMD GPU Hardware Basics]</ref>
+* Wavefront size: 64
+* Maximum number of work-items per work-group: 1024
+* Maximum number of work-groups per compute unit: 40
+* Maximum number of Wavefronts per compute unit: 40
+* Maximum number of work-items per compute unit: 2560
+=Memory Model=
+OpenCL offers the following memory model for the programmer:
+* __private - usually registers, accessable only by a single work-item resp. thread.
+* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.
+* __constant - read-only memory.
+* __global - usually VRAM, accessable by all work-items resp. threads.
+{| class="wikitable" style="margin:auto"
+|+ Terminology
+|-
+! OpenCL Terminology !! CUDA Terminology
+|-
+| Private Memory || Registers
+|-
+| Local Memory || Shared Memory
+|-
+| Constant Memory || Constant Memory
+|-
+| Global Memory || Global Memory
+|}
+===Memory Examples===
+Nvidia GeForce GTX 580 ([https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi)] <ref>CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES</ref>
 * 128 KiB private memory per compute unit
 * 48 KiB (16 KiB) local memory per compute unit (configurable)
@@ Line 38: / Line 188: @@
 * 8 KiB constant cache per compute unit
 * 16 KiB (48 KiB) L1 cache per compute unit (configurable)
-* 768 KiB L2 cache
+* 768 KiB L2 cache in total
 * 1.5 GiB to 3 GiB global memory
-Here the data for the AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) as an example: <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>
+AMD Radeon HD 7970 ([https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]) <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>
 * 256 KiB private memory per compute unit
 * 64 KiB local memory per compute unit
@@ Line 46: / Line 196: @@
 * 16 KiB constant cache per four compute units
 * 16 KiB L1 cache per compute unit
-* 768 KiB L2 cache
+* 768 KiB L2 cache in total
 * 3 GiB to 6 GiB global memory
-=Integer Throughput=
+===Unified Memory===
-GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The 32 bit integer performance can be less than 32 bit FLOP or 24 bit integer performance.
-The instruction throughput depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 Terascale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/AMD_FirePro FirePro], [https://en.wikipedia.org/wiki/AMD_FireStream FireStream]) and the specific model.
+Usually data has to be copied between a CPU host and a discrete GPU device, but different architectures from different vendors with different frameworks on different operating systems may offer a unified and accessible address space between CPU and GPU.
+=Instruction Throughput=
+GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.
+==Integer Instruction Throughput==
+* INT32
+: The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.
+* INT64
+: In general [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] of consumer brand GPUs are 32-bit wide and have to emulate 64-bit integer operations.
+* INT8
+: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.
+==Floating-Point Instruction Throughput==
+* FP32
+: Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.
+* FP64
+: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.
+* FP16
+: Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.
+==Throughput Examples==
+Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32-bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref>
+    MAD 16
+    MUL 16
+    ADD 32
+    Bit-shift 16
+    Bitwise XOR 32
+Max theoretic ADD operation throughput: 32 Ops x 16 CUs x 1544 MHz = 790.528 GigaOps/sec
+AMD Radeon HD 7970 (GCN 1.0) - 32-bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref>
+    MAD 1/4
+    MUL 1/4
+    ADD 1
+    Bit-shift 1
+    Bitwise XOR 1
+Max theoretic ADD operation throughput: 1 Op x 2048 PEs x 925 MHz = 1894.4 GigaOps/sec
+=Tensors=
+MMAC (matrix-multiply-accumulate) units are used in consumer brand GPUs for neural network based upsampling of video game resolutions, in professional brands for upsampling of images and videos, and in server brand GPUs for accelerating convolutional neural networks in general. Convolutions can be implemented as a series of matrix-multiplications via Winograd-transformations <ref>[https://talkchess.com/forum3/viewtopic.php?f=7&t=66025&p=743355#p743355 Re: To TPU or not to TPU...] by [[Rémi Coulom]], [[CCC]], December 16, 2017</ref>. Mobile SoCs usually have an dedicated neural network engine as MMAC unit.
+==Nvidia TensorCores==
+: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6  AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>Ada Lovelaces's 4th gen adds support for FP8.<ref>[https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) - Ada Lovelace microarchitecture]</ref>
+==AMD Matrix Cores==
+: AMD released 2020 its server-class [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf CDNA] architecture with Matrix Cores which support MFMA (matrix-fused-multiply-add) operations on various data types like INT8, FP16, BF16, FP32. AMD's CDNA 2 architecture adds FP64 optimized throughput for matrix operations. AMD's RDNA 3 architecture features dedicated AI tensor operation acceleration. AMD's CDNA 3 architecture adds support for FP8 and sparse matrix data (sparsity).
-As an example, here the 32 bit integer performance of the Nvidia GeForce GTX 580 (Fermi, CC 2.0) and AMD Radeon HD 7970 (GCN 1.0):
+==Intel XMX Cores==
+: Intel added XMX, Xe Matrix eXtensions, cores to some of the [https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] GPU series, like [https://en.wikipedia.org/wiki/Intel_Arc#Alchemist Arc Alchemist] and [https://www.intel.com/content/www/us/en/products/sku/232876/intel-data-center-gpu-max-1100/specifications.html Intel Data Center GPU Max Series].
-Nvidia GeForce GTX 580 - 32 bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref>
+=Host-Device Latencies=
-* MAD 16
+One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is a measurable latency for null-kernels of 5 microseconds <ref>[https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latencies-/post/5318041/#5318041 host-device latencies?] by [[Srdja Matovic]], Nvidia CUDA ZONE, Feb 28, 2019</ref> up to 100s of microseconds <ref>[https://community.amd.com/thread/237337#comment-2902071 host-device latencies?] by [[Srdja Matovic]] AMD Developer Community, Feb 28, 2019</ref>. One solution to overcome this limitation is to couple tasks to batches to be executed in one run <ref>[http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347#p761239 Re: GPU ANN, how to deal with host-device latencies?] by [[Milos Stanisavljevic]], [[CCC]], May 06, 2018</ref>.
-* MUL 16
-* ADD 32
-* Bit-shift 16
-* Bitwise XOR 32
-Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec
-AMD Radeon HD 7970 - 32 bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref>
-* MAD 1/4
-* MUL 1/4
-* ADD 1
-* Bit-shift 1
-* Bitwise XOR 1
-Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec
 =Deep Learning=
-GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom,
+GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.
-also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.
+= Architectures =
+The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.
+== AMD ==
+AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.
+* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]
+=== CDNA3 ===
+CDNA3 HPC architecture was unveiled in December, 2023. With MI300A APU model (CPU+GPU+HBM) and MI300X GPU model, both with multi-chip modules design. Featuring Matrix Cores with support for a broad type of precision, as INT8, FP8, BF16, FP16, TF32, FP32, FP64, as well as sparse matrix data (sparsity). Supported by AMD's ROCm open software stack for AMD Instinct accelerators.
+* [https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf AMD CDNA3 Whitepaper]
+* [https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/amd-instinct-mi300-cdna3-instruction-set-architecture.pdf AMD Instinct MI300/CDNA3 Instruction Set Architecture]
+* [https://www.amd.com/en/developer/resources/rocm-hub.html AMD ROCm Developer Hub]
+=== Navi 3x RDNA3 ===
+RDNA3 architecture in Radeon RX 7000 series was announced on November 3, 2022, featuring dedicated AI tensor operation acceleration.
+* [https://en.wikipedia.org/wiki/Radeon_RX_7000_series AMD Radeon RX 7000 on Wikipedia]
+* [https://developer.amd.com/wp-content/resources/RDNA3_Shader_ISA_December2022.pdf RDNA3 Instruction Set Architecture]
+=== CDNA2 ===
+CDNA2 architecture in MI200 HPC-GPU with optimized FP64 throughput (matrix and vector), multi-chip-module design and Infinity Fabric was unveiled in November, 2021.
+* [https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf AMD CDNA2 Whitepaper]
+* [https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf CDNA2 Instruction Set Architecture]
+=== CDNA ===
+CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.
+* [https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf AMD CDNA Whitepaper]
+* [https://developer.amd.com/wp-content/resources/CDNA1_Shader_ISA_14December2020.pdf CDNA Instruction Set Architecture]
+=== Navi 2x RDNA2 ===
+[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA2] cards were unveiled on October 28, 2020.
+* [https://en.wikipedia.org/wiki/Radeon_RX_6000_series AMD Radeon RX 6000 on Wikipedia]
+* [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]
+=== Navi RDNA ===
+[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA] cards were unveiled on July 7, 2019.
+* [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]
+* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]
+* [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Shader_ISA_5August2019.pdf RDNA Instruction Set Architecture]
+=== Vega GCN 5th gen ===
+[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.
+* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]
+* [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set Architecture]
+=== Polaris GCN 4th gen ===
+[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.
+* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]
+* [https://developer.amd.com/wordpress/media/2013/12/AMD_GCN3_Instruction_Set_Architecture_rev1.1.pdf GCN3/4 Instruction Set Architecture]
+=== Southern Islands GCN 1st gen ===
+Southern Island cards introduced the [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN] architecture in 2012.
+* [https://en.wikipedia.org/wiki/Radeon_HD_7000_series AMD Radeon HD 7000 on Wikipedia]
+* [https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/programmer-references/si_programming_guide_v2.pdf Southern Islands Programming Guide]
+* [https://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf Southern Islands Instruction Set Architecture]
+== Apple ==
+=== M series ===
+Apple released its M series SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.
+* [https://en.wikipedia.org/wiki/Apple_silicon#M_series Apple M series on Wikipedia]
+== ARM ==
+The ARM Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.
+* [https://en.wikipedia.org/wiki/Mali_(GPU)#Variants Mali variants on Wikipedia]
+=== Valhall (2019) ===
+* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]
+=== Bifrost (2016) ===
+* [https://developer.arm.com/documentation/101574/latest Bifrost and Valhall OpenCL Developer Guide]
+=== Midgard (2012) ===
+* [https://developer.arm.com/documentation/100614/latest Midgard OpenCL Developer Guide]
+== Intel ==
+=== Xe ===
+[https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).
+* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]
+* [https://en.wikipedia.org/wiki/Intel_Arc#Alchemist Arc Alchemist series on Wikipedia]
+==Nvidia==
+Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.
+* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]
+=== Grace Hopper Superchip ===
+The Nvidia GH200 Grace Hopper Superchip was unveiled August, 2023 and combines the Nvidia Grace CPU ([[ARM|ARM v9]]) and  Nvidia Hopper GPU architectures via NVLink to deliver a CPU+GPU coherent memory model for accelerated AI and HPC applications.
+* [https://resources.nvidia.com/en-us-grace-cpu/grace-hopper-superchip NVIDIA Grace Hopper Superchip Data Sheet]
+* [https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper NVIDIA Grace Hopper Superchip Architecture Whitepaper]
+=== Ada Lovelace Architecture ===
+The [https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture) Ada Lovelace microarchitecture] was announced on September 20, 2022, featuring 4th-generation Tensor Cores with FP8, FP16, BF16, TF32 and sparsity acceleration.
+* [https://images.nvidia.com/aem-dam/Solutions/geforce/ada/nvidia-ada-gpu-architecture.pdf Ada GPU Whitepaper]
+* [https://docs.nvidia.com/cuda/ada-tuning-guide/index.html Ada Tuning Guide]
+=== Hopper Architecture ===
+The [https://en.wikipedia.org/wiki/Hopper_(microarchitecture) Hopper GPU Datacenter microarchitecture] was announced on March 22, 2022, featuring Transfomer Engines for large language models.
+* [https://resources.nvidia.com/en-us-tensor-core Hopper H100 Whitepaper]
+* [https://docs.nvidia.com/cuda/hopper-tuning-guide/index.html Hopper Tuning Guide]
+=== Ampere Architecture ===
+The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.
+* [https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper.pdf Ampere GA100 Whitepaper]
+* [https://www.nvidia.com/content/PDF/nvidia-ampere-ga-102-gpu-architecture-whitepaper-v2.pdf Ampere GA102 Whitepaper]
+* [https://docs.nvidia.com/cuda/ampere-tuning-guide/index.html Ampere GPU Architecture Tuning Guide]
+=== Turing Architecture ===
+[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.
+* [https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Turing Architecture Whitepaper]
+* [https://docs.nvidia.com/cuda/turing-tuning-guide/index.html Turing Tuning Guide]
+=== Volta Architecture ===
+[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].
+* [https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Volta Architecture Whitepaper]
+* [https://docs.nvidia.com/cuda/volta-tuning-guide/index.html Volta Tuning Guide]
+=== Pascal Architecture ===
+[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.
+* [https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Pascal Architecture Whitepaper]
+* [https://docs.nvidia.com/cuda/pascal-tuning-guide/index.html Pascal Tuning Guide]
+=== Maxwell Architecture ===
+[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.
+* [https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Maxwell Architecture Whitepaper on archiv.org]
+* [https://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html Maxwell Tuning Guide]
+== PowerVR ==
+PowerVR (Imagination Technologies) licenses IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.
+=== PowerVR ===
+* [https://en.wikipedia.org/wiki/PowerVR#PowerVR_Graphics PowerVR series on Wikipedia]
+=== IMG ===
+* [https://en.wikipedia.org/wiki/PowerVR#IMG_A-Series_(Albiorix) IMG A series on Wikipedia]
+* [https://en.wikipedia.org/wiki/PowerVR#IMG_B-Series IMG B series on Wikipedia]
+== Qualcomm ==
+Qualcomm offers Adreno GPUs in various types as a component of their Snapdragon SoCs. Since Adreno 300 series OpenCL support is offered.
+=== Adreno ===
+* [https://en.wikipedia.org/wiki/Adreno#Variants Adreno variants on Wikipedia]
+== Vivante Corporation ==
+Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.
+=== GC-Series ===
+* [https://en.wikipedia.org/wiki/Vivante_Corporation#Products GC series on Wikipedia]
 =See also=
 * [[Deep Learning]]
-** [[AlphaGo]]
-** [[AlphaZero]]
-** [[Neural Networks#Convolutional|Convolutional Neural Networks]]
-** [[Leela Zero]]
-** [[Leela Chess Zero]]
 * [[FPGA]]
 * [[Graphics Programming]]
@@ Line 93: / Line 458: @@
 =Publications=
-==2009==
+==1986==
+* [[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism
+==1990==
+* [[Mathematician#GEBlelloch|Guy E. Blelloch]] ('''1990'''). ''[https://dl.acm.org/citation.cfm?id=91254 Vector Models for Data-Parallel Computing]''. [https://en.wikipedia.org/wiki/MIT_Press MIT Press], [https://www.cs.cmu.edu/~guyb/papers/Ble90.pdf pdf]
+==2008 ...==
+* [[Vlad Stamate]] ('''2008'''). ''Real Time Photon Mapping Approximation on the GPU''. in [http://shaderx6.com/TOC.html ShaderX6 - Advanced Rendering Techniques] <ref>[https://en.wikipedia.org/wiki/Photon_mapping Photon mapping from Wikipedia]</ref>
 * [[Ren Wu]], [http://www.cedar.buffalo.edu/~binzhang/ Bin Zhang], [http://www.hpl.hp.com/people/meichun_hsu/ Meichun Hsu] ('''2009'''). ''[http://portal.acm.org/citation.cfm?id=1531668 Clustering billions of data points using GPUs]''. [http://www.computingfrontiers.org/2009/ ACM International Conference on Computing Frontiers]
-* [http://www.esrl.noaa.gov/research/review/bios/mark.govett.html Mark Govett], [http://www.esrl.noaa.gov/gsd/media/tierney.html Craig Tierney], [[Jacques Middlecoff]], [http://www.cira.colostate.edu/people/view.php?id=297 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop], [http://www.cisl.ucar.edu/dir/CAS2K9/Presentations/govett.pdf pdf]
+* [https://github.com/markgovett Mark Govett], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2009'''). ''Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models''. [http://www.cisl.ucar.edu/dir/CAS2K9/ CAS2K9 Workshop]
+* [[Hank Dietz]], [https://dblp.uni-trier.de/pers/hd/y/Young:Bobby_Dalton Bobby Dalton Young] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-13374-9_5 MIMD Interpretation on a GPU]''. [https://dblp.uni-trier.de/db/conf/lcpc/lcpc2009.html LCPC 2009], [http://aggregate.ee.engr.uky.edu/EXHIBITS/SC09/mogsimlcpc09final.pdf pdf], [http://aggregate.org/GPUMC/mogsimlcpc09slides.pdf slides.pdf]
+* [https://dblp.uni-trier.de/pid/28/7183.html Sander van der Maar], [[Joost Batenburg]], [https://scholar.google.com/citations?user=TtXZhj8AAAAJ&hl=en Jan Sijbers] ('''2009'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-03138-0_33 Experiences with Cell-BE and GPU for Tomography]''. [https://dblp.uni-trier.de/db/conf/samos/samos2009.html#MaarBS09 SAMOS 2009] <ref>[https://en.wikipedia.org/wiki/Cell_(microprocessor) Cell (microprocessor) from Wikipedia]</ref>
 ==2010...==
 * [https://www.linkedin.com/in/avi-bleiweiss-456a5644 Avi Bleiweiss] ('''2010'''). ''Playing Zero-Sum Games on the GPU''. [https://en.wikipedia.org/wiki/Nvidia NVIDIA Corporation], [http://www.nvidia.com/object/io_1269574709099.html GPU Technology Conference 2010], [http://www.nvidia.com/content/gtc-2010/pdfs/2207_gtc2010.pdf slides as pdf]
-* [http://www.esrl.noaa.gov/research/review/bios/mark.govett.html Mark Govett], [[Jacques Middlecoff]], [http://www.cira.colostate.edu/people/view.php?id=297 Tom Henderson] ('''2010'''). ''[http://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [http://www.informatik.uni-trier.de/~ley/db/conf/ccgrid/ccgrid2010.html#GovettMH10 CCGRID 2010]
+* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson] ('''2010'''). ''[https://dl.acm.org/citation.cfm?id=1845128 Running the NIM Next-Generation Weather Model on GPUs]''. [https://dblp.uni-trier.de/db/conf/ccgrid/ccgrid2010.html CCGRID 2010]
-* [http://www.esrl.noaa.gov/research/review/bios/mark.govett.html Mark Govett], [[Jacques Middlecoff]], [http://www.cira.colostate.edu/people/view.php?id=297 Tom Henderson], [http://www.cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [http://www.linkedin.com/pub/craig-tierney/5/854/956 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/documents/Govett.pdf slides as pdf]
+*  John Nickolls, William J. Dally ('''2010'''). [https://ieeexplore.ieee.org/document/5446251 The GPU Computing Era]. [https://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=40 IEEE Micro].
+'''2011'''
+* [https://github.com/markgovett Mark Govett], [[Jacques Middlecoff]], [https://www.researchgate.net/profile/Tom_Henderson4 Tom Henderson], [https://cug.org/5-publications/proceedings_attendee_lists/CUG09CD/S09_Proceedings/pages/authors/11-15Wednesday/12A-Rosinski/Rosinski-paper.html Jim Rosinski], [https://www.linkedin.com/in/craig-tierney-9568545 Craig Tierney] ('''2011'''). ''Parallelization of the NIM Dynamical Core for GPUs''. [https://is.enes.org/archive-1/archive/documents/Govett.pdf slides as pdf]
 * [[Ľubomír Lackovič]] ('''2011'''). ''[https://hgpu.org/?p=5772 Parallel Game Tree Search Using GPU]''. Institute of Informatics and Software Engineering, [https://en.wikipedia.org/wiki/Faculty_of_Informatics_and_Information_Technologies Faculty of Informatics and Information Technologies], [https://en.wikipedia.org/wiki/Slovak_University_of_Technology_in_Bratislava Slovak University of Technology in Bratislava], [http://acmbulletin.fiit.stuba.sk/vol3num2/lackovic.pdf pdf]
 * [[Dan Anthony Feliciano Alcantara]] ('''2011'''). ''Efficient Hash Tables on the GPU''. Ph. D. thesis, [https://en.wikipedia.org/wiki/University_of_California,_Davis University of California, Davis], [http://idav.ucdavis.edu/~dfalcant//downloads/dissertation.pdf pdf] » [[Hash Table]]
 * [[Damian Sulewski]] ('''2011'''). ''Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks''. Ph.D. thesis, [[University of Dortmund]], [https://eldorado.tu-dortmund.de/dspace/bitstream/2003/29418/1/Dissertation.pdf pdf]
 * [[Damjan Strnad]], [[Nikola Guid]] ('''2011'''). ''[http://cit.fer.hr/index.php/CIT/article/view/2029 Parallel Alpha-Beta Algorithm on the GPU]''. [http://cit.fer.hr/index.php/CIT CIT. Journal of Computing and Information Technology], Vol. 19, No. 4 » [[Parallel Search]], [[Othello|Reversi]]
+* [[Balázs Jako|Balázs Jákó]] ('''2011'''). ''Fast Hydraulic and Thermal Erosion on GPU''. M.Sc. thesis, Supervisor [https://hu.linkedin.com/in/bal%C3%A1zs-t%C3%B3th-1b151329 Balázs Tóth], [http://eg2011.bangor.ac.uk/ Eurographics 2011], [http://old.cescg.org/CESCG-2011/papers/TUBudapest-Jako-Balazs.pdf pdf]
+'''2012'''
 * [[Liang Li]], [[Hong Liu]], [[Peiyu Liu]], [[Taoying Liu]], [[Wei Li]], [[Hao Wang]] ('''2012'''). ''[https://www.semanticscholar.org/paper/A-Node-based-Parallel-Game-Tree-Algorithm-Using-Li-Liu/be21d7b9b91957b700aab4ce002e6753b826ff54 A Node-based Parallel Game Tree Algorithm Using GPUs]''. CLUSTER 2012 » [[Parallel Search]]
-* [[S. Ali Mirsoleimani]], [http://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [http://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]
+'''2013'''
+* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://scholar.google.de/citations?view_op=view_citation&hl=en&user=VvkRESgAAAAJ&citation_for_view=VvkRESgAAAAJ:ufrVoPGSRksC A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments]''. [http://www.sigevo.org/gecco-2013/program.html GECCO '13]
+* [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami], [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2013'''). ''[https://ieeexplore.ieee.org/document/6714232 A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs]''. [https://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6708586 CADS 2013]
+* [[Diego Rodríguez-Losada]], [[Pablo San Segundo]], [[Miguel Hernando]], [https://dblp.uni-trier.de/pers/hd/p/Puente:Paloma_de_la Paloma de la Puente], [https://dblp.uni-trier.de/pers/hd/v/Valero=Gomez:Alberto Alberto Valero-Gomez] ('''2013'''). ''GPU-Mapping: Robotic Map Building with Graphical Multiprocessors''. [https://dblp.uni-trier.de/db/journals/ram/ram20.html IEEE Robotics & Automation Magazine, Vol. 20, No. 2], [https://www.acin.tuwien.ac.at/fileadmin/acin/v4r/v4r/GPUMap_RAM2013.pdf pdf]
+* [https://dblp.org/pid/28/977-2.html David Williams], [[Valeriu Codreanu]], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://dblp.org/pid/54/784.html Baoquan Liu], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [https://dblp.org/pid/136/5430.html Burhan Yasar], [https://scholar.google.com/citations?user=FZVGYiQAAAAJ&hl=en Babak Mahdian], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini], [https://zhaoxiahust.github.io/ Xia Zhao], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink] ('''2013'''). ''[https://link.springer.com/chapter/10.1007/978-3-642-55224-3_42 Evaluation of Autoparallelization Toolkits for Commodity GPUs]''. [https://dblp.org/db/conf/ppam/ppam2013-1.html#WilliamsCYLDYMCZR13 PPAM 2013]
+'''2014'''
 * [https://dblp.uni-trier.de/pers/hd/d/Dang:Qingqing Qingqing Dang], [https://dblp.uni-trier.de/pers/hd/y/Yan:Shengen Shengen Yan], [[Ren Wu]] ('''2014'''). ''[https://ieeexplore.ieee.org/document/7097862 A fast integral image generation algorithm on GPUs]''. [https://dblp.uni-trier.de/db/conf/icpads/icpads2014.html ICPADS 2014]
+* [[S. Ali Mirsoleimani]], [https://dblp.uni-trier.de/pers/hd/k/Karami:Ali Ali Karami Ali Karami], [https://dblp.uni-trier.de/pers/hd/k/Khunjush:Farshad Farshad Khunjush] ('''2014'''). ''[https://link.springer.com/chapter/10.1007/978-3-319-04891-8_12 A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor]''. [https://dblp.uni-trier.de/db/conf/arcs/arcs2014.html ARCS 2014], [https://en.wikipedia.org/wiki/Lecture_Notes_in_Computer_Science Lecture Notes in Computer Science], Vol. 8350, [https://en.wikipedia.org/wiki/Springer_Science%2BBusiness_Media Springer]
+* [[Steinar H. Gunderson]] ('''2014'''). ''[https://archive.fosdem.org/2014/schedule/event/movit/ Movit: High-speed, high-quality video filters on the GPU]''. [https://en.wikipedia.org/wiki/FOSDEM FOSDEM] [https://archive.fosdem.org/2014/ 2014], [https://movit.sesse.net/movit-fosdem2014.pdf pdf]
+* [https://dblp.org/pid/54/784.html Baoquan Liu], [https://scholar.google.com/citations?user=VspO6ZUAAAAJ&hl=en Alexandru Telea], [https://scholar.google.com/citations?user=jCFYHlkAAAAJ&hl=en Jos Roerdink], [https://dblp.org/pid/87/6797.html Gordon Clapworthy], [https://dblp.org/pid/28/977-2.html David Williams], [https://dblp.org/pid/88/5343-1.html Po Yang], [https://www.strath.ac.uk/staff/dongfengprofessor/ Feng Dong], [[Valeriu Codreanu]], [https://scholar.google.com/citations?user=8WO6cVUAAAAJ&hl=en Alessandro Chiarini] ('''2018'''). ''Parallel centerline extraction on the GPU''. [https://www.journals.elsevier.com/computers-and-graphics Computers & Graphics], Vol. 41, [https://strathprints.strath.ac.uk/70614/1/Liu_etal_CG2014_Parallel_centerline_extraction_GPU.pdf pdf]
 ==2015 ...==
 * [[Peter H. Jin]], [[Kurt Keutzer]] ('''2015'''). ''Convolutional Monte Carlo Rollouts in Go''. [http://arxiv.org/abs/1512.03375 arXiv:1512.03375] » [[Deep Learning]], [[Go]], [[Monte-Carlo Tree Search|MCTS]]
 * [[Liang Li]], [[Hong Liu]], [[Hao Wang]], [[Taoying Liu]], [[Wei Li]] ('''2015'''). ''[https://ieeexplore.ieee.org/document/6868996 A Parallel Algorithm for Game Tree Search Using GPGPU]''. [[IEEE#TPDS|IEEE Transactions on Parallel and Distributed Systems]], Vol. 26, No. 8 » [[Parallel Search]]
+* [[Simon Portegies Zwart]], [https://github.com/jbedorf Jeroen Bédorf] ('''2015'''). ''[https://www.computer.org/csdl/magazine/co/2015/11/mco2015110050/13rRUx0Pqwe Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics]''. [[IEEE #Computer|IEEE Computer]], Vol. 48, No. 11
+'''2016'''
 * <span id="Astro"></span>[https://www.linkedin.com/in/sean-sheen-b99aba89 Sean Sheen] ('''2016'''). ''[https://digitalcommons.calpoly.edu/theses/1567/ Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1]''. Master's thesis, [https://en.wikipedia.org/wiki/California_Polytechnic_State_University California Polytechnic State University], [https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?referer=&httpsredir=1&article=2723&context=theses pdf] <ref>[http://www.nvidia.com/object/jetson-tk1-embedded-dev-kit.html Jetson TK1 Embedded Development Kit | NVIDIA]</ref> <ref>[http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016</ref>
 * [https://scholar.google.com/citations?user=YyD7mwcAAAAJ&hl=en Jingyue Wu], [https://scholar.google.com/citations?user=EJcIByYAAAAJ&hl=en Artem Belevich], [https://scholar.google.com/citations?user=X5WAGdEAAAAJ&hl=en Eli Bendersky], [https://www.linkedin.com/in/mark-heffernan-873b663/ Mark Heffernan], [https://scholar.google.com/citations?user=Guehv9sAAAAJ&hl=en Chris Leary], [https://scholar.google.com/citations?user=fAmfZAYAAAAJ&hl=en Jacques Pienaar], [http://www.broune.com/ Bjarke Roune], [https://scholar.google.com/citations?user=Der7mNMAAAAJ&hl=en Rob Springer], [https://scholar.google.com/citations?user=zvfOH0wAAAAJ&hl=en Xuetian Weng], [https://scholar.google.com/citations?user=s7VCtl8AAAAJ&hl=en Robert Hundt] ('''2016'''). ''[https://dl.acm.org/citation.cfm?id=2854041 gpucc: an open-source GPGPU compiler]''. [https://cgo.org/cgo2016/ CGO 2016]
 * [[David Silver]], [[Shih-Chieh Huang|Aja Huang]], [[Chris J. Maddison]], [[Arthur Guez]], [[Laurent Sifre]], [[George van den Driessche]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Veda Panneershelvam]], [[Marc Lanctot]], [[Sander Dieleman]], [[Dominik Grewe]], [[John Nham]], [[Nal Kalchbrenner]], [[Ilya Sutskever]], [[Timothy Lillicrap]], [[Madeleine Leach]], [[Koray Kavukcuoglu]], [[Thore Graepel]], [[Demis Hassabis]] ('''2016'''). ''[http://www.nature.com/nature/journal/v529/n7587/full/nature16961.html Mastering the game of Go with deep neural networks and tree search]''. [https://en.wikipedia.org/wiki/Nature_%28journal%29 Nature], Vol. 529 » [[AlphaGo]]
+* [[Balázs Jako|Balázs Jákó]] ('''2016'''). ''[https://www.semanticscholar.org/paper/Hardware-accelerated-hybrid-rendering-on-PowerVR-J%C3%A1k%C3%B3/d9d7f5784263c5abdcd6c1bf93267e334468b9b2 Hardware accelerated hybrid rendering on PowerVR GPUs]''. <ref>[https://en.wikipedia.org/wiki/PowerVR PowerVR from Wikipedia]</ref> [[IEEE]] [https://ieeexplore.ieee.org/xpl/conhome/7547434/proceeding 20th Jubilee International Conference on Intelligent Engineering Systems]
+* [[Diogo R. Ferreira]], [https://dblp.uni-trier.de/pers/hd/s/Santos:Rui_M= Rui M. Santos] ('''2016'''). ''[https://github.com/diogoff/transition-counting-gpu Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs]''. [https://dblp.uni-trier.de/db/conf/bpm/bpmw2016.html BPM 2016]
+* [https://dblp.org/pers/hd/s/Sch=uuml=tt:Ole Ole Schütt], [https://developer.nvidia.com/blog/author/peter-messmer/ Peter Messmer], [https://scholar.google.ch/citations?user=ajbBWN0AAAAJ&hl=en Jürg Hutter], [[Joost VandeVondele]] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/10.1002/9781118670712.ch8 GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory]''. [https://www.cp2k.org/_media/gpu_book_chapter_submitted.pdf pdf] <ref>[https://en.wikipedia.org/wiki/Density_functional_theory Density functional theory from Wikipedia]</ref>
+: Chapter 8 in [https://scholar.google.com/citations?user=AV307ZUAAAAJ&hl=en Ross C. Walker], [https://scholar.google.com/citations?user=PJusscIAAAAJ&hl=en Andreas W. Götz] ('''2016'''). ''[https://onlinelibrary.wiley.com/doi/book/10.1002/9781118670712 Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics]''. [https://en.wikipedia.org/wiki/Wiley_(publisher) John Wiley & Sons]
+'''2017'''
 * [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2017'''). ''Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm''. [https://arxiv.org/abs/1712.01815 arXiv:1712.01815] » [[AlphaZero]]
 * [[Tristan Cazenave]] ('''2017'''). ''[http://ieeexplore.ieee.org/document/7875402/ Residual Networks for Computer Go]''.  [[IEEE#TOCIAIGAMES|IEEE Transactions on Computational Intelligence and AI in Games]], Vol. PP, No. 99, [http://www.lamsade.dauphine.fr/~cazenave/papers/resnet.pdf pdf]
+* [https://scholar.google.com/citations?user=zLksndkAAAAJ&hl=en Jayvant Anantpur], [https://dblp.org/pid/09/10702.html Nagendra Gulur Dwarakanath], [https://dblp.org/pid/16/4410.html Shivaram Kalyanakrishnan], [[Shalabh Bhatnagar]], [https://dblp.org/pid/45/3592.html R. Govindarajan] ('''2017'''). ''RLWS: A Reinforcement Learning based GPU Warp Scheduler''. [https://arxiv.org/abs/1712.04303 arXiv:1712.04303]
+'''2018'''
 * [[David Silver]], [[Thomas Hubert]], [[Julian Schrittwieser]], [[Ioannis Antonoglou]], [[Matthew Lai]], [[Arthur Guez]], [[Marc Lanctot]], [[Laurent Sifre]], [[Dharshan Kumaran]], [[Thore Graepel]], [[Timothy Lillicrap]], [[Karen Simonyan]], [[Demis Hassabis]] ('''2018'''). ''[http://science.sciencemag.org/content/362/6419/1140 A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play]''. [https://en.wikipedia.org/wiki/Science_(journal) Science], Vol. 362, No. 6419
@@ Line 123: / Line 517: @@
 ==2010 ...==
 * [http://www.talkchess.com/forum/viewtopic.php?t=32750 Using the GPU] by [[Louis Zulli]], [[CCC]], February 19, 2010
+'''2011'''
 * [http://www.talkchess.com/forum/viewtopic.php?t=38002 GPGPU and computer chess] by Wim Sjoho, [[CCC]], February 09, 2011
 * [http://www.talkchess.com/forum/viewtopic.php?t=38478 Possible Board Presentation and Move Generation for GPUs?] by [[Srdja Matovic]], [[CCC]], March 19, 2011
+: [http://www.talkchess.com/forum/viewtopic.php?t=38478&start=8 Re: Possible Board Presentation and Move Generation for GPUs] by [[Steffan Westcott]], [[CCC]], March 20, 2011
 * [http://www.talkchess.com/forum/viewtopic.php?t=39459 Zeta plays chess on a gpu] by [[Srdja Matovic]], [[CCC]], June 23, 2011 » [[Zeta]]
 * [http://www.talkchess.com/forum/viewtopic.php?t=39606 GPU Search Methods] by [[Joshua Haglund]], [[CCC]], July 04, 2011
-* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2001'''). ''Parallel randomized best-first minimax search''. School of Computer Science, [https://en.wikipedia.org/wiki/Tel_Aviv_University Tel-Aviv University], [http://www.tau.ac.il/%7Estoledo/Pubs/rbf-ai-revised.pdf pdf]</ref> <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence] 140, [http://jmvidal.cse.sc.edu/library/segre02a.pdf pdf], [http://compepi.cs.uiowa.edu/uploads/Profiles/Segre/nag.pdf pdf]</ref>
+'''2012'''
+* [http://www.talkchess.com/forum/viewtopic.php?topic_view=threads&p=442052&t=41853 Possible Search Algorithms for GPUs?] by [[Srdja Matovic]], [[CCC]], January 07, 2012 <ref>[[Yaron Shoham]], [[Sivan Toledo]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S0004370202001959 Parallel Randomized Best-First Minimax Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_(journal) Artificial Intelligence], Vol. 137, Nos. 1-2</ref>  <ref>[[Alberto Maria Segre]], [[Sean Forman]], [[Giovanni Resta]], [[Andrew Wildenberg]] ('''2002'''). ''[https://www.sciencedirect.com/science/article/pii/S000437020200228X Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search]''. [https://en.wikipedia.org/wiki/Artificial_Intelligence_%28journal%29 Artificial Intelligence], Vol. 140, Nos. 1-2</ref>
 * [http://www.talkchess.com/forum/viewtopic.php?t=42590 uct on gpu] by [[Daniel Shawul]], [[CCC]], February 24, 2012 » [[UCT]]
 * [http://www.talkchess.com/forum/viewtopic.php?t=43971 Is there such a thing as branchless move generation?] by [[John Hamlen]], [[CCC]], June 07, 2012 » [[Move Generation]]
 * [http://www.talkchess.com/forum/viewtopic.php?t=44014 Choosing a GPU platform: AMD and Nvidia] by [[John Hamlen]], [[CCC]], June 10, 2012
 * [http://www.talkchess.com/forum/viewtopic.php?t=46277 Nvidias K20 with Recursion] by [[Srdja Matovic]], [[CCC]], December 04, 2012 <ref>[http://www.techpowerup.com/173846/Tesla-K20-GPU-Compute-Processor-Specifications-Released.html Tesla K20 GPU Compute Processor Specifications Released | techPowerUp]</ref>
+'''2013'''
 * [http://www.talkchess.com/forum/viewtopic.php?t=46974 Kogge Stone, Vector Based] by [[Srdja Matovic]], [[CCC]], January 22, 2013 » [[Kogge-Stone Algorithm]] <ref>[https://en.wikipedia.org/wiki/Parallel_Thread_Execution Parallel Thread Execution from Wikipedia]</ref> <ref>NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, [http://www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf pdf]</ref>
 * [http://www.talkchess.com/forum/viewtopic.php?t=47344 GPU chess engine] by Samuel Siltanen, [[CCC]], February 27, 2013
@@ Line 139: / Line 537: @@
 * [http://www.talkchess.com/forum/viewtopic.php?t=61761 Jetson GPU architecture] by [[Dann Corbit]], [[CCC]], October 18, 2016 » [[GPU#Astro|Astro]]
 * [http://www.talkchess.com/forum/viewtopic.php?t=61925 Pigeon is now running on the GPU] by [[Stuart Riffle]], [[CCC]], November 02, 2016 » [[Pigeon]]
+'''2017'''
 * [http://www.talkchess.com/forum/viewtopic.php?t=63346 Back to the basics, generating moves on gpu in parallel...] by [[Srdja Matovic]], [[CCC]], March 05, 2017 » [[Move Generation]]
 * [http://www.talkchess.com/forum/viewtopic.php?t=64983&start=9 Re: Perft(15): comparison of estimates with Ankan's result] by [[Ankan Banerjee]], [[CCC]], August 26, 2017 » [[Perft#15|Perft(15)]]
 * [http://rybkaforum.net/cgi-bin/rybkaforum/topic_show.pl?tid=32317 Chess Engine and GPU] by Fishpov , [[Computer Chess Forums|Rybka Forum]], October 09, 2017
 * [http://www.talkchess.com/forum/viewtopic.php?t=66025 To TPU or not to TPU...] by [[Srdja Matovic]], [[CCC]], December 16, 2017 » [[Deep Learning]] <ref>[https://en.wikipedia.org/wiki/Tensor_processing_unit Tensor processing unit from Wikipedia]</ref>
+'''2018'''
 * [http://www.talkchess.com/forum/viewtopic.php?t=66280 Announcing lczero] by [[Gary Linscott|Gary]], [[CCC]], January 09, 2018 » [[Leela Chess Zero]]
+* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=67347 GPU ANN, how to deal with host-device latencies?] by [[Srdja Matovic]], [[CCC]], May 06, 2018 » [[Neural Networks]]
+* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=67357 GPU contention] by [[Ian Kennedy]], [[CCC]], May 07, 2018 » [[Leela Chess Zero]]
+* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448 How good is the RTX 2080 Ti for Leela?] by Hai, September 15, 2018 » [[Leela Chess Zero]] <ref>[https://en.wikipedia.org/wiki/GeForce_20_series GeForce 20 series from Wikipedia]</ref>
+: [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68448&start=2 Re: How good is the RTX 2080 Ti for Leela?] by [[Ankan Banerjee]], [[CCC]], September 16, 2018
 * [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=68973 My non-OC RTX 2070 is very fast with Lc0] by [[Kai Laskos]], [[CCC]], November 19, 2018 » [[Leela Chess Zero]]
 * [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69400 LC0 using 4 x 2080 Ti GPU's on Chess.com tourney?] by M. Ansari, [[CCC]], December 28, 2018 » [[Leela Chess Zero]]
+'''2019'''
 * [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=69447 Generate EGTB with graphics cards?] by [[Pham Hong Nguyen|Nguyen Pham]], [[CCC]], January 01, 2019 » [[Endgame Tablebases]]
 * [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=69478 LCZero FAQ is missing one important fact] by [[Jouni Uski]], [[CCC]], January 01, 2019 » [[Leela Chess Zero]]
+* [https://groups.google.com/d/msg/lczero/I0lTgR-fFFU/NGC3kJDzAwAJ Michael Larabel benches lc0 on various GPUs] by [[Warren D. Smith]], [[Computer Chess Forums|LCZero Forum]], January 14, 2019 » [[Leela Chess Zero#Lc0|Lc0]] <ref>[https://en.wikipedia.org/wiki/Phoronix_Test_Suite Phoronix Test Suite from Wikipedia]</ref>
+* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=70362 Using LC0 with one or two GPUs - a guide] by [[Srdja Matovic]], [[CCC]], March 30, 2019 » [[Leela Chess Zero#Lc0|Lc0]]
+* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=70584 Wouldn't it be nice if C++ GPU] by [[Chris Whittington]], [[CCC]], April 25, 2019 » [[Cpp|C++]]
+* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=71058 Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search] by  Percival Tiglao, [[CCC]], June 06, 2019
+* [https://www.game-ai-forum.org/viewtopic.php?f=21&t=694 My home-made CUDA kernel for convolutions] by [[Rémi Coulom]], [[Computer Chess Forums|Game-AI Forum]], November 09, 2019 » [[Deep Learning]]
+* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=72320 GPU rumors 2020] by [[Srdja Matovic]], [[CCC]], November 13, 2019
+==2020 ...==
+* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=74771 AB search with NN on GPU...] by [[Srdja Matovic]], [[CCC]], August 13, 2020 » [[Neural Networks]] <ref>[https://forums.developer.nvidia.com/t/kernel-launch-latency/62455 kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums] by LukeCuda, June 18, 2018</ref>
+* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75073 I stumbled upon this article on the new Nvidia RTX GPUs] by [[Kai Laskos]], [[CCC]], September 10, 2020
+* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75639 Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0?] by [[Srdja Matovic]], [[CCC]], November 01, 2020
+* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=76986 Zeta with NNUE on GPU?] by [[Srdja Matovic]], [[CCC]], March 31, 2021 » [[Zeta]], [[NNUE]]
+* [https://talkchess.com/forum3/viewtopic.php?f=2&t=77097 GPU rumors 2021] by [[Srdja Matovic]], [[CCC]], April 16, 2021
+* [https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078 Comparison of all known Sliding lookup algorithms <nowiki>[CUDA]</nowiki>] by [[Daniel Infuehr]], [[CCC]], January 08, 2022 » [[Sliding Piece Attacks]]
+* [https://talkchess.com/forum3/viewtopic.php?f=7&t=72566&p=955538#p955538 Re: China boosts in silicon...] by [[Srdja Matovic]], [[CCC]], January 13, 2024
 =External Links=
@@ Line 156: / Line 575: @@
 * [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]
 * [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]
+* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]
 * [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]
 * [https://developer.nvidia.com/ NVIDIA Developer]
@@ Line 186: / Line 606: @@
 : [https://github.com/gcp/leela-zero GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper]
 ==Chess Programming==
+* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]
 * [http://gpuchess.blogspot.com/ GPU Chess Blog]
-* [https://chessgpgpu.blogspot.com/ Chess on a GPGPU]
+* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub] » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref>
-* [https://web.archive.org/web/20140604141114/http://zeta-chess.blogspot.com/ Zeta OpenCL Chess]
-* [https://github.com/ankan-ban/perft_gpu ankan-ban/perft_gpu · GitHub]  » [[Perft]] <ref>[http://www.talkchess.com/forum/viewtopic.php?t=48387 Fast perft on GPU (upto 20 Billion nps w/o hashing)] by [[Ankan Banerjee]], [[CCC]], June 22, 2013</ref>
 * [https://github.com/LeelaChessZero LCZero · GitHub] » [[Leela Chess Zero]]
+* [https://github.com/StuartRiffle/Jaglavak GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine] » [[Jaglavak]]
+* [https://zeta-chess.app26.de/ Zeta OpenCL Chess] » [[Zeta]]
 =References=

Difference between revisions of "GPU"

Latest revision as of 10:34, 24 January 2024

Contents

History

GPU in Computer Chess

GPU Chess Engines

GPGPU

Khronos OpenCL

AMD

Apple

Intel

Nvidia

Further

Hardware Model

Hardware Examples

Wavefront and Warp

Programming Model

Thread Examples

Memory Model

Memory Examples

Unified Memory

Instruction Throughput

Integer Instruction Throughput

Floating-Point Instruction Throughput

Throughput Examples

Tensors

Nvidia TensorCores

AMD Matrix Cores

Intel XMX Cores

Host-Device Latencies

Deep Learning

Architectures

AMD

CDNA3

Navi 3x RDNA3

CDNA2

CDNA

Navi 2x RDNA2

Navi RDNA

Vega GCN 5th gen

Polaris GCN 4th gen

Southern Islands GCN 1st gen

Apple

M series

ARM

Valhall (2019)

Bifrost (2016)

Midgard (2012)

Intel

Xe

Nvidia

Grace Hopper Superchip

Ada Lovelace Architecture

Hopper Architecture

Ampere Architecture

Turing Architecture

Volta Architecture

Pascal Architecture

Maxwell Architecture

PowerVR

PowerVR

IMG

Qualcomm

Adreno

Vivante Corporation

GC-Series

See also

Publications

1986

1990

2008 ...

2010...

2015 ...

Forum Posts

2005 ...

2010 ...

2015 ...

2020 ...

External Links

OpenCL