Difference between revisions of "GPU"
m (→Floating Point Instruction Throughput)
m (→Floating Point Instruction Throughput)
|Line 120:||Line 120:|
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.
: Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.
==Floating Point Instruction Throughput==
==FloatingPoint Instruction Throughput==
: Consumer GPU performance is measured usually in single-precision (32-bit) floating point FMA
: Consumer GPU performance is measured usually in single-precision (32-bit) floatingpoint FMA fused-multiply-addthroughput.
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating point operations throughput than server brand GPUs.
: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floatingpoint operations throughput than server brand GPUs.
: Some GPGPU architectures offer half-precision (16-bit) floating point operation throughput with an FP32:FP16 ratio of 1:2.
: Some GPGPU architectures offer half-precision (16-bit) floatingpoint operation throughput with an FP32:FP16 ratio of 1:2.
Revision as of 11:08, 23 October 2021
GPU (Graphics Processing Unit),
a specialized processor primarily intended to fast image processing. GPUs may have more raw computing power than general purpose CPUs but need a specialized and parallelized way of programming. Leela Chess Zero has proven that a Best-first Monte-Carlo Tree Search (MCTS) with deep learning methodology will work with GPU architectures.
- 1 History
- 2 GPU in Computer Chess
- 3 GPU Chess Engines
- 4 GPGPU
- 5 Hardware Model
- 6 Programming Model
- 7 Memory Model
- 8 Instruction Throughput
- 9 Host-Device Latencies
- 10 Deep Learning
- 11 Architectures
- 11.1 AMD
- 11.2 Apple
- 11.3 ARM Mali
- 11.4 Intel
- 11.5 Nvidia
- 11.6 PowerVR
- 11.7 Vivante Corporation
- 12 See also
- 13 Publications
- 14 Forum Posts
- 15 External Links
- 16 References
In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer, like TIAin the Atari VCS gaming system, GTIA+ANTIC in the Atari 400/800 series, or Denise+Agnus in the Commodore Amiga series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the 3dfx Voodoo2, were used by the video game community to play 3D graphics. Some game engines could use instead the SIMD-capabilities of CPUs such as the Intel MMX instruction set or AMD's 3DNow! for real-time rendering. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the unified shader architecture, like in Nvidia Tesla (2006), ATI/AMD TeraScale (2007) or Intel GMA X3000 (2006), GPGPU frameworks like CUDA and OpenCL emerged and gained in popularity.
GPU in Computer Chess
There are in main three approaches how to use a GPU for Chess:
- As an accelerator in Lc0: run a neural network for position evaluation on GPU.
- Offload the search in Zeta: run a parallel game tree search with move generation and position evaluation on GPU.
- As an hybrid in perft_gpu: expand the game tree to a certain degree on CPU and offload to GPU to compute the sub-tree.
GPU Chess Engines
Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like OpenGL or DirextX, followed by first GPGPU frameworks such as Sh/RapidMind or Brook and finally CUDA and OpenCL.
- AMD OpenCL Developer Community
- ROCm Homepage
- AMD OpenCL Programming Guide
- AMD OpenCL Optimization Guide
- RDNA Instruction Set
- Vega Instruction Set
Since macOS 10.14 Mojave a transition from OpenCL to Metal is recommended by Apple.
- Apple OpenCL Developer
- Apple Metal Developer
- Apple Metal Programming Guide
- Metal Shading Language Specification
- oneAPI (Intel)
- C++ AMP (Microsoft)
- DirectCompute (Microsoft)
- OpenACC (offload directives)
- OpenMP (offload directives)
A common scheme on GPUs is to run multiple threads in SIMT fashion and a multitude of SIMT waves on the same SIMD unit to hide memory latencies. Multiple processing elements (GPU cores) are members of a SIMD unit, multiple SIMD units are coupled to a compute unit, with up to hundreds of compute units present on a discrete GPU. The actual SIMD units may have architecture dependent different numbers of cores (SIMD8, SIMD16, SIMD32), and different computation abilities, floating-point and/or integer with specific bit-width of the FPU/ALU. Scalar units present in the compute unit perform special functions the SIMD units are not capable of and MMACUs (matrix-multiply-accumulate-units) are used to speed up neural networks further.
A parallel programming model for GPGPU can be data-parallel, task-parallel, a mixture of both, or with libraries and offload-directives also implicitly-parallel. Single GPU threads (resp. work-items in OpenCL) are coupled to a block (resp. work-group in OpenCL), these can be usually synchronized and have access to the same scratch-pad memory, with an architecture limit of how many threads a block can hold.
OpenCL offers the following memory model for the programmer:
- __private - usually registers, accessable only by a single work-item resp. thread.
- __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.
- __constant - read-only memory.
- __global - usually VRAM, accessable by all work-items resp. threads.
- 128 KiB private memory per compute unit
- 48 KiB (16 KiB) local memory per compute unit (configurable)
- 64 KiB constant memory
- 8 KiB constant cache per compute unit
- 16 KiB (48 KiB) L1 cache per compute unit (configurable)
- 768 KiB L2 cache
- 1.5 GiB to 3 GiB global memory
- 256 KiB private memory per compute unit
- 64 KiB local memory per compute unit
- 64 KiB constant memory
- 16 KiB constant cache per four compute units
- 16 KiB L1 cache per compute unit
- 768 KiB L2 cache
- 3 GiB to 6 GiB global memory
GPUs are used in HPC environments because of their good FLOP/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's TeraScale, GCN, RDNA), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, Radeon Pro, Radeon Instinct) and the specific model.
Integer Instruction Throughput
- The 32-bit integer performance can be architecture and operation depended less than 32-bit FLOP or 24-bit integer performance.
- In general GPU registers and Vector-ALUs are 32-bit wide and have to emulate 64-bit integer operations.
- Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.
Floating-Point Instruction Throughput
- Consumer GPU performance is measured usually in single-precision (32-bit) floating-point FMA (fused-multiply-add) throughput.
- Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64-bit) floating-point operations throughput than server brand GPUs.
- Some GPGPU architectures offer half-precision (16-bit) floating-point operation throughput with an FP32:FP16 ratio of 1:2.
- With Nvidia Volta series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks. Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation. Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.
AMD Matrix Cores
- AMD released 2020 its server-class CDNA architecture with Matrix Cores which support MFMA, matrix-fused-multiply-add, operations on various data types like INT8, FP16, BF16, FP32.
Intel XMX Cores
- Intel plans XMX, Xe Matrix eXtensions, for its upcoming Xe discrete GPU series.
Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations/clock cycle per compute unit 
MAD 16 MUL 16 ADD 32 Bit-shift 16 Bitwise XOR 32
Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec
AMD Radeon HD 7970 (GCN 1.0) - 32 bit integer operations/clock cycle per processing element 
MAD 1/4 MUL 1/4 ADD 1 Bit-shift 1 Bitwise XOR 1
Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec
One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an measurable latency for null-kernels of 5 microseconds  up to 100s of microseconds . One solution to overcome this limitation is to couple tasks to batches to be executed in one run .
GPUs are much more suited than CPUs to implement and train Convolutional Neural Networks (CNN), and were therefore also responsible for the deep learning boom, also affecting game playing programs combining CNN with MCTS, as pioneered by Google DeepMind's AlphaGo and AlphaZero entities in Go, Shogi and Chess using TPUs, and the open source projects Leela Zero headed by Gian-Carlo Pascutto for Go and its Leela Chess Zero adaption.
The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.
AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.
CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.
RDNA 2.0 cards were unveiled on October 28, 2020.
RDNA 1.0 cards were unveiled on July 7, 2019.
Vega GCN 5th gen
Vega cards were unveiled on August 14, 2017.
Polaris GCN 4th gen
Polaris cards were first released in 2016.
Apple released its M1 SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.
The Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.
Intel Xe 'Gen12'
Intel Xe line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).
Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.
The Ampere microarchitecture was announced on May 14, 2020 . The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 .
Turing cards were first released in 2018. They are the first consumer cores to launch with RTX, for raytracing, features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate convolutional neural networks. The Turing GTX line of chips do not offer RTX or TensorCores.
Pascal cards were first released in 2016.
Maxwell cards were first released in 2014.
Imagination Technologies licenses PowerVR IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.
Vivante licenses IP to third parties for embedded systems, the GC series offers optional OpenCL support.
- Deep Learning
- Graphics Programming
- Monte-Carlo Tree Search
- Parallel Search
- SIMD and SWAR Techniques
- W. Daniel Hillis, Guy L. Steele, Jr. (1986). Data parallel algorithms. Communications of the ACM, Vol. 29, No. 12, Special Issue on Parallelism
- Vlad Stamate (2008). Real Time Photon Mapping Approximation on the GPU. in ShaderX6 - Advanced Rendering Techniques 
- Ren Wu, Bin Zhang, Meichun Hsu (2009). Clustering billions of data points using GPUs. ACM International Conference on Computing Frontiers
- Mark Govett, Craig Tierney, Jacques Middlecoff, Tom Henderson (2009). Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models. CAS2K9 Workshop
- Hank Dietz, Bobby Dalton Young (2009). MIMD Interpretation on a GPU. LCPC 2009, pdf, slides.pdf
- Sander van der Maar, Joost Batenburg, Jan Sijbers (2009). Experiences with Cell-BE and GPU for Tomography. SAMOS 2009 
- Avi Bleiweiss (2010). Playing Zero-Sum Games on the GPU. NVIDIA Corporation, GPU Technology Conference 2010, slides as pdf
- Mark Govett, Jacques Middlecoff, Tom Henderson (2010). Running the NIM Next-Generation Weather Model on GPUs. CCGRID 2010
- John Nickolls, William J. Dally (2010). The GPU Computing Era. IEEE Micro.
- Mark Govett, Jacques Middlecoff, Tom Henderson, Jim Rosinski, Craig Tierney (2011). Parallelization of the NIM Dynamical Core for GPUs. slides as pdf
- Ľubomír Lackovič (2011). Parallel Game Tree Search Using GPU. Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, pdf
- Dan Anthony Feliciano Alcantara (2011). Efficient Hash Tables on the GPU. Ph. D. thesis, University of California, Davis, pdf » Hash Table
- Damian Sulewski (2011). Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks. Ph.D. thesis, University of Dortmund, pdf
- Damjan Strnad, Nikola Guid (2011). Parallel Alpha-Beta Algorithm on the GPU. CIT. Journal of Computing and Information Technology, Vol. 19, No. 4 » Parallel Search, Reversi
- Balázs Jákó (2011). Fast Hydraulic and Thermal Erosion on GPU. M.Sc. thesis, Supervisor Balázs Tóth, Eurographics 2011, pdf
- Liang Li, Hong Liu, Peiyu Liu, Taoying Liu, Wei Li, Hao Wang (2012). A Node-based Parallel Game Tree Algorithm Using GPUs. CLUSTER 2012 » Parallel Search
- S. Ali Mirsoleimani, Ali Karami Ali Karami, Farshad Khunjush (2013). A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments. GECCO '13
- Ali Karami, S. Ali Mirsoleimani, Farshad Khunjush (2013). A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs. CADS 2013
- Diego Rodríguez-Losada, Pablo San Segundo, Miguel Hernando, Paloma de la Puente, Alberto Valero-Gomez (2013). GPU-Mapping: Robotic Map Building with Graphical Multiprocessors. IEEE Robotics & Automation Magazine, Vol. 20, No. 2, pdf
- David Williams, Valeriu Codreanu, Po Yang, Baoquan Liu, Feng Dong, Burhan Yasar, Babak Mahdian, Alessandro Chiarini, Xia Zhao, Jos Roerdink (2013). Evaluation of Autoparallelization Toolkits for Commodity GPUs. PPAM 2013
- Qingqing Dang, Shengen Yan, Ren Wu (2014). A fast integral image generation algorithm on GPUs. ICPADS 2014
- S. Ali Mirsoleimani, Ali Karami Ali Karami, Farshad Khunjush (2014). A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor. ARCS 2014, Lecture Notes in Computer Science, Vol. 8350, Springer
- Steinar H. Gunderson (2014). Movit: High-speed, high-quality video filters on the GPU. FOSDEM 2014, pdf
- Baoquan Liu, Alexandru Telea, Jos Roerdink, Gordon Clapworthy, David Williams, Po Yang, Feng Dong, Valeriu Codreanu, Alessandro Chiarini (2018). Parallel centerline extraction on the GPU. Computers & Graphics, Vol. 41, pdf
- Peter H. Jin, Kurt Keutzer (2015). Convolutional Monte Carlo Rollouts in Go. arXiv:1512.03375 » Deep Learning, Go, MCTS
- Liang Li, Hong Liu, Hao Wang, Taoying Liu, Wei Li (2015). A Parallel Algorithm for Game Tree Search Using GPGPU. IEEE Transactions on Parallel and Distributed Systems, Vol. 26, No. 8 » Parallel Search
- Simon Portegies Zwart, Jeroen Bédorf (2015). Using GPUs to Enable Simulation with Computational Gravitational Dynamics in Astrophysics. IEEE Computer, Vol. 48, No. 11
- Sean Sheen (2016). Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1. Master's thesis, California Polytechnic State University, pdf  
- Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt (2016). gpucc: an open-source GPGPU compiler. CGO 2016
- David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis (2016). Mastering the game of Go with deep neural networks and tree search. Nature, Vol. 529 » AlphaGo
- Balázs Jákó (2016). Hardware accelerated hybrid rendering on PowerVR GPUs.  IEEE 20th Jubilee International Conference on Intelligent Engineering Systems
- Diogo R. Ferreira, Rui M. Santos (2016). Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs. BPM 2016
- Ole Schütt, Peter Messmer, Jürg Hutter, Joost VandeVondele (2016). GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory. pdf 
- Chapter 8 in Ross C. Walker, Andreas W. Götz (2016). Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics. John Wiley & Sons
- David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815 » AlphaZero
- Tristan Cazenave (2017). Residual Networks for Computer Go. IEEE Transactions on Computational Intelligence and AI in Games, Vol. PP, No. 99, pdf
- Jayvant Anantpur, Nagendra Gulur Dwarakanath, Shivaram Kalyanakrishnan, Shalabh Bhatnagar, R. Govindarajan (2017). RLWS: A Reinforcement Learning based GPU Warp Scheduler. arXiv:1712.04303
- David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, Vol. 362, No. 6419
- Hardware assist by Nicolai Czempin, Winboard Forum, August 27, 2006
- Monte carlo on a NVIDIA GPU ? by Marco Costalba, CCC, August 01, 2008
- GPGPU and computer chess by Wim Sjoho, CCC, February 09, 2011
- Possible Board Presentation and Move Generation for GPUs? by Srdja Matovic, CCC, March 19, 2011
- Re: Possible Board Presentation and Move Generation for GPUs by Steffan Westcott, CCC, March 20, 2011
- Zeta plays chess on a gpu by Srdja Matovic, CCC, June 23, 2011 » Zeta
- GPU Search Methods by Joshua Haglund, CCC, July 04, 2011
- Possible Search Algorithms for GPUs? by Srdja Matovic, CCC, January 07, 2012  
- uct on gpu by Daniel Shawul, CCC, February 24, 2012 » UCT
- Is there such a thing as branchless move generation? by John Hamlen, CCC, June 07, 2012 » Move Generation
- Choosing a GPU platform: AMD and Nvidia by John Hamlen, CCC, June 10, 2012
- Nvidias K20 with Recursion by Srdja Matovic, CCC, December 04, 2012 
- Kogge Stone, Vector Based by Srdja Matovic, CCC, January 22, 2013 » Kogge-Stone Algorithm  
- GPU chess engine by Samuel Siltanen, CCC, February 27, 2013
- Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013 » Perft, Kogge-Stone Algorithm 
- GPU chess update, local memory... by Srdja Matovic, CCC, June 06, 2016
- Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016 » Astro
- Pigeon is now running on the GPU by Stuart Riffle, CCC, November 02, 2016 » Pigeon
- Back to the basics, generating moves on gpu in parallel... by Srdja Matovic, CCC, March 05, 2017 » Move Generation
- Re: Perft(15): comparison of estimates with Ankan's result by Ankan Banerjee, CCC, August 26, 2017 » Perft(15)
- Chess Engine and GPU by Fishpov , Rybka Forum, October 09, 2017
- To TPU or not to TPU... by Srdja Matovic, CCC, December 16, 2017 » Deep Learning 
- Announcing lczero by Gary, CCC, January 09, 2018 » Leela Chess Zero
- GPU ANN, how to deal with host-device latencies? by Srdja Matovic, CCC, May 06, 2018 » Neural Networks
- GPU contention by Ian Kennedy, CCC, May 07, 2018 » Leela Chess Zero
- How good is the RTX 2080 Ti for Leela? by Hai, September 15, 2018 » Leela Chess Zero 
- My non-OC RTX 2070 is very fast with Lc0 by Kai Laskos, CCC, November 19, 2018 » Leela Chess Zero
- LC0 using 4 x 2080 Ti GPU's on Chess.com tourney? by M. Ansari, CCC, December 28, 2018 » Leela Chess Zero
- Generate EGTB with graphics cards? by Nguyen Pham, CCC, January 01, 2019 » Endgame Tablebases
- LCZero FAQ is missing one important fact by Jouni Uski, CCC, January 01, 2019 » Leela Chess Zero
- Michael Larabel benches lc0 on various GPUs by Warren D. Smith, LCZero Forum, January 14, 2019 » Lc0 
- Using LC0 with one or two GPUs - a guide by Srdja Matovic, CCC, March 30, 2019 » Lc0
- Wouldn't it be nice if C++ GPU by Chris Whittington, CCC, April 25, 2019 » C++
- Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search by Percival Tiglao, CCC, June 06, 2019
- My home-made CUDA kernel for convolutions by Rémi Coulom, Game-AI Forum, November 09, 2019 » Deep Learning
- GPU rumors 2020 by Srdja Matovic, CCC, November 13, 2019
- AB search with NN on GPU... by Srdja Matovic, CCC, August 13, 2020 » Neural Networks 
- I stumbled upon this article on the new Nvidia RTX GPUs by Kai Laskos, CCC, September 10, 2020
- Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0? by Srdja Matovic, CCC, November 01, 2020
- Zeta with NNUE on GPU? by Srdja Matovic, CCC, March 31, 2021 » Zeta, NNUE
- Graphics processing unit from Wikipedia
- Video card from Wikipedia
- Heterogeneous System Architecture from Wikipedia
- Tensor processing unit from Wikipedia
- General-purpose computing on graphics processing units (GPGPU) from Wikipedia
- List of AMD graphics processing units from Wikipedia
- List of Intel graphics processing units from Wikipedia
- List of Nvidia graphics processing units from Wikipedia
- NVIDIA Developer
- NVIDIA GPU Programming Guide
- OpenCL from Wikipedia
- Part 1: OpenCL™ – Portable Parallelism - CodeProject
- Part 2: OpenCL™ – Memory Spaces - CodeProject
- CUDA from Wikipedia
- CUDA Zone | NVIDIA Developer
- Nvidia CUDA Compiler (NVCC) from Wikipedia
- Compiling CUDA with clang — LLVM Clang documentation
- CppCon 2016: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by Justin Lebar, YouTube Video 
- Deep Learning | NVIDIA Developer » Deep Learning
- NVIDIA cuDNN | NVIDIA Developer
- Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster
- Deep Learning in a Nutshell: Core Concepts by Tim Dettmers, Parallel Forall, November 3, 2015
- Deep Learning in a Nutshell: History and Training by Tim Dettmers, Parallel Forall, December 16, 2015
- Deep Learning in a Nutshell: Sequence Learning by Tim Dettmers, Parallel Forall, March 7, 2016
- Deep Learning in a Nutshell: Reinforcement Learning by Tim Dettmers, Parallel Forall, September 8, 2016
- Faster deep learning with GPUs and Theano
- Theano (software) from Wikipedia
- TensorFlow from Wikipedia
- Advanced game programming | Session 5 - GPGPU programming by Andy Thomason
- Leela Zero by Gian-Carlo Pascutto » Leela Zero
- GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper
- Chess on a GPGPU
- GPU Chess Blog
- ankan-ban/perft_gpu · GitHub » Perft 
- LCZero · GitHub » Leela Chess Zero
- GitHub - StuartRiffle/Jaglavak: Corvid Chess Engine » Jaglavak
- Zeta OpenCL Chess » Zeta
- Graphics processing unit - Wikimedia Commons
- CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES
- AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
- INSIDE VOLTA
- AnandTech - Nvidia Turing Deep Dive page 6
- Wikipedia - Ampere microarchitecture
- CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions
- AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths
- host-device latencies? by Srdja Matovic, Nvidia CUDA ZONE, Feb 28, 2019
- host-device latencies? by Srdja Matovic AMD Developer Community, Feb 28, 2019
- Re: GPU ANN, how to deal with host-device latencies? by Milos Stanisavljevic, CCC, May 06, 2018
- NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog by Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam and Sridhar Ramaswamy, May 14, 2020
- CUDA 11 Features Revealed | NVIDIA Developer Blog by Pramod Ramarao, May 14, 2020
- Photon mapping from Wikipedia
- Cell (microprocessor) from Wikipedia
- Jetson TK1 Embedded Development Kit | NVIDIA
- Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016
- PowerVR from Wikipedia
- Density functional theory from Wikipedia
- Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2
- Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2
- Tesla K20 GPU Compute Processor Specifications Released | techPowerUp
- Parallel Thread Execution from Wikipedia
- NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf
- ankan-ban/perft_gpu · GitHub
- Tensor processing unit from Wikipedia
- GeForce 20 series from Wikipedia
- Phoronix Test Suite from Wikipedia
- kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums by LukeCuda, June 18, 2018
- Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019
- Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013