Revision as of 11:43, 18 April 2021

GPU (Graphics Processing Unit),
a specialized processor primarily intended to fast image processing. GPUs may have more raw computing power than general purpose CPUs but need a specialized and parallelized way of programming. Leela Chess Zero has proven that a Best-first Monte-Carlo Tree Search (MCTS) with deep learning methodology will work with GPU architectures.

History

In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer, like TIAin the Atari VCS gaming system, GTIA+ANTIC in the Atari 400/800 series, or Denise+Agnus in the Commodore Amiga series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the 3dfx Voodoo2, were used by the video game community to play 3D graphics. Some game engines, such as Quake, could use instead the SIMD-capabilities of CPUs such as the Intel MMX instruction set or AMD's 3DNow!. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the unified shader architecture, like in Nvidia Tesla (2006), ATI/AMD TeraScale (2007) or Intel GMA X3000 (2006), GPGPU frameworks like CUDA and OpenCL emerged and gained in popularity.

GPGPU

Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like OpenGL or DirextX, followed by first GPGPU frameworks such as Sh/RapidMind or Brook and finally CUDA and OpenCL.

Khronos OpenCL

OpenCL specified by the Khronos Group is widely adopted across all kind of hardware accelerators from different vendors.

List of OpenCL Conformant Products

OpenCL 3.0 Specifications

AMD

AMD supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with ROCm its own parallel compute platform.

Nvidia

CUDA is the parallel computing platform by Nvidia. It supports languages frontends like C, C++, Fortran, OpenCL and offload directives via OpenACC and OpenMP.

Further

Metal (Apple)
oneAPI (Intel)
C++ AMP (Microsoft)
DirectCompute (Microsoft)
OpenACC (offload directives)
OpenMP (offload directives)

SIMT and SIMD on GPU

GPUs run multiple threads in SIMT fashion and are capable to hide memory latencies by running a multitude of SIMT waves on the same SIMD unit.

Memory Model

OpenCL offers the following memory model for the programmer:

__private - usually registers, accessable only by a single work-item resp. thread.
__local - scratch-pad memory shared across work-items of a work-group resp. threads of block.
__constant - read-only variable.
__global - usually VRAM, accessable by all work-items resp. threads.

Here the data for the Nvidia GeForce GTX 580 (Fermi) as an example: ^[2]

128 KiB private memory per compute unit
48 KiB (16 KiB) local memory per compute unit (configurable)
64 KiB constant memory
8 KiB constant cache per compute unit
16 KiB (48 KiB) L1 cache per compute unit (configurable)
768 KiB L2 cache
1.5 GiB to 3 GiB global memory

Here the data for the AMD Radeon HD 7970 (GCN) as an example: ^[3]

256 KiB private memory per compute unit
64 KiB local memory per compute unit
64 KiB constant memory
16 KiB constant cache per four compute units
16 KiB L1 cache per compute unit
768 KiB L2 cache
3 GiB to 6 GiB global memory

Architectures and Physical Hardware

The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.

AMD

AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.

List of AMD graphics processing units on Wikipedia

CDNA

CDNA architecture in MI100 HPC-GPU with Matrix Cores was unveiled in November, 2020.

AMD CDNA Whitepaper

Navi 2X RDNA 2.0

RDNA 2.0 cards were unveiled on October 28, 2020.

RDNA 2 Instruction Set Architecture

Navi RDNA 1.0

RDNA 1.0 cards were unveiled on July 7, 2019.

Vega GCN 5th gen

Vega cards were unveiled on August 14, 2017.

Architecture Whitepaper

Polaris GCN 4th gen

Polaris cards were first released in 2016.

Architecture Whitepaper

Apple

M1

Apple released its M1 SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.

Apple M1 on Wikipedia

ARM Mali

The Mali GPU variants can be found on various systems on chips (SoCs) from different vendors. Since Midgard (2012) with unified-shader-model OpenCL support is offered.

Bifrost (2016) and Valhall (2019)

Bifrost and Valhall OpenCL Developer Guide

Midgard (2012)

Midgard OpenCL Developer Guide

Intel

Intel Xe 'Gen12'

Intel Xe line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).

List of Intel Gen12 GPUs on Wikipedia

Nvidia

Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.

List of Nvidia graphics processing units on Wikipedia

Ampere Architecture

The Ampere microarchitecture was announced on May 14, 2020 ^[4]. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 ^[5].

Turing Architecture

Turing cards were first released in 2018. They are the first consumer cores to launch with RTX, for raytracing, features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate convolutional neural networks. The Turing GTX line of chips do not offer RTX or TensorCores.

Architectural Whitepaper

Volta Architecture

Volta cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate convolutional neural networks.

Architecture Whitepaper

Pascal Architecture

Pascal cards were first released in 2016.

Architecture Whitepaper

Maxwell Architecture

Maxwell cards were first released in 2014.

Architecture Whitepaper on archiv.org

PowerVR - Imagination Technologies

Imagination Technologies licenses PowerVR IP to third parties (most notable Apple) used for system on a chip (SoC) designs. Since Series5 SGX OpenCL support via licensees is available.

PowerVR Series5 SGX

PowerVR Series5 SGX on Wikipedia

Instruction Throughput

GPUs are used in HPC environments because of their good FLOP/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's TeraScale, GCN, RDNA), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, Radeon Pro, Radeon Instinct) and the specific model.

Integer Instruction Throughput

INT32

The 32 bit integer performance can be architecture and operation depended less than 32 bit FLOP or 24 bit integer performance.

INT64

Current GPU registers and Vector-ALUs are 32 bit wide and have to emulate 64 bit integer operations.

INT8

Some architectures offer higher throughput with lower precision. They quadruple the INT8 or octuple the INT4 throughput.

Floating Point Instruction Throughput

FP32

Consumer GPU performance is measured usually in single-precision (32 bit) floating point FMA, fused-multiply-add, throughput.

FP64

Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64 bit) floating point operations than server brand GPUs, like 4:1 down to 32:1 compared to 2:1 to 4:1.

FP16

Some GPGPU architectures offer half-precision (16 bit) floating point operation throughput with an FP32:FP16 ratio of 1:2. Older architectures migth not support FP16 at all, at the same rate as FP32, or at very low rates.

Tensors

Nvidia TensorCores

With Nvidia Volta series TensorCores were introduced. They offer FP16xFP16+FP32, matrix-multiplication-accumulate-units, used to accelerate neural networks.^[6] Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.^[7] Amperes's 3rd gen adds support for BF16, TF32, FP64 and sparsity acceleration.^[8]

AMD Matrix Cores

AMD released 2020 its server-class CDNA architecture with Matrix Cores which support MFMA, matrix-fused-multiply-add, operations on various data types like INT8, FP16, BF16, FP32.

Intel XMX Cores

Intel plans XMX, Xe Matrix eXtensions, for its upcoming Xe discrete GPU series.

Throughput Examples

Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations/clock cycle per compute unit ^[9]

   MAD 16
   MUL 16
   ADD 32
   Bit-shift 16
   Bitwise XOR 32

Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec

AMD Radeon HD 7970 (GCN 1.0) - 32 bit integer operations/clock cycle per processing element ^[10]

   MAD 1/4
   MUL 1/4
   ADD 1
   Bit-shift 1
   Bitwise XOR 1

Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec

Host-Device Latencies

One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an measurable latency for null-kernels of 5 microseconds ^[11] up to 100s of microseconds ^[12]. One solution to overcome this limitation is to couple tasks to batches to be executed in one run ^[13].

Deep Learning

GPUs are much more suited than CPUs to implement and train Convolutional Neural Networks (CNN), and were therefore also responsible for the deep learning boom, also affecting game playing programs combining CNN with MCTS, as pioneered by Google DeepMind's AlphaGo and AlphaZero entities in Go, Shogi and Chess using TPUs, and the open source projects Leela Zero headed by Gian-Carlo Pascutto for Go and its Leela Chess Zero adaption.

Chess Engines

Category:GPU

Publications

1986

W. Daniel Hillis, Guy L. Steele, Jr. (1986). Data parallel algorithms. Communications of the ACM, Vol. 29, No. 12, Special Issue on Parallelism

1990

Guy E. Blelloch (1990). Vector Models for Data-Parallel Computing. MIT Press, pdf

2008 ...

Vlad Stamate (2008). Real Time Photon Mapping Approximation on the GPU. in ShaderX6 - Advanced Rendering Techniques ^[14]
Ren Wu, Bin Zhang, Meichun Hsu (2009). Clustering billions of data points using GPUs. ACM International Conference on Computing Frontiers
Mark Govett, Craig Tierney, Jacques Middlecoff, Tom Henderson (2009). Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models. CAS2K9 Workshop
Hank Dietz, Bobby Dalton Young (2009). MIMD Interpretation on a GPU. LCPC 2009, pdf, slides.pdf
Sander van der Maar, Joost Batenburg, Jan Sijbers (2009). Experiences with Cell-BE and GPU for Tomography. SAMOS 2009 ^[15]

2010...

Avi Bleiweiss (2010). Playing Zero-Sum Games on the GPU. NVIDIA Corporation, GPU Technology Conference 2010, slides as pdf
Mark Govett, Jacques Middlecoff, Tom Henderson (2010). Running the NIM Next-Generation Weather Model on GPUs. CCGRID 2010
John Nickolls, William J. Dally (2010). The GPU Computing Era. IEEE Micro.

2011

Mark Govett, Jacques Middlecoff, Tom Henderson, Jim Rosinski, Craig Tierney (2011). Parallelization of the NIM Dynamical Core for GPUs. slides as pdf
Ľubomír Lackovič (2011). Parallel Game Tree Search Using GPU. Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, pdf
Dan Anthony Feliciano Alcantara (2011). Efficient Hash Tables on the GPU. Ph. D. thesis, University of California, Davis, pdf » Hash Table
Damian Sulewski (2011). Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks. Ph.D. thesis, University of Dortmund, pdf
Damjan Strnad, Nikola Guid (2011). Parallel Alpha-Beta Algorithm on the GPU. CIT. Journal of Computing and Information Technology, Vol. 19, No. 4 » Parallel Search, Reversi
Balázs Jákó (2011). Fast Hydraulic and Thermal Erosion on GPU. M.Sc. thesis, Supervisor Balázs Tóth, Eurographics 2011, pdf

2012

Liang Li, Hong Liu, Peiyu Liu, Taoying Liu, Wei Li, Hao Wang (2012). A Node-based Parallel Game Tree Algorithm Using GPUs. CLUSTER 2012 » Parallel Search

2013

S. Ali Mirsoleimani, Ali Karami Ali Karami, Farshad Khunjush (2013). A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments. GECCO '13
Ali Karami, S. Ali Mirsoleimani, Farshad Khunjush (2013). A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs. CADS 2013
Diego Rodríguez-Losada, Pablo San Segundo, Miguel Hernando, Paloma de la Puente, Alberto Valero-Gomez (2013). GPU-Mapping: Robotic Map Building with Graphical Multiprocessors. IEEE Robotics & Automation Magazine, Vol. 20, No. 2, pdf

2014

Qingqing Dang, Shengen Yan, Ren Wu (2014). A fast integral image generation algorithm on GPUs. ICPADS 2014
S. Ali Mirsoleimani, Ali Karami Ali Karami, Farshad Khunjush (2014). A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor. ARCS 2014, Lecture Notes in Computer Science, Vol. 8350, Springer
Steinar H. Gunderson (2014). Movit: High-speed, high-quality video filters on the GPU. FOSDEM 2014, pdf

2015 ...

Peter H. Jin, Kurt Keutzer (2015). Convolutional Monte Carlo Rollouts in Go. arXiv:1512.03375 » Deep Learning, Go, MCTS
Liang Li, Hong Liu, Hao Wang, Taoying Liu, Wei Li (2015). A Parallel Algorithm for Game Tree Search Using GPGPU. IEEE Transactions on Parallel and Distributed Systems, Vol. 26, No. 8 » Parallel Search

2016

Sean Sheen (2016). Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1. Master's thesis, California Polytechnic State University, pdf ^[16] ^[17]
Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt (2016). gpucc: an open-source GPGPU compiler. CGO 2016
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis (2016). Mastering the game of Go with deep neural networks and tree search. Nature, Vol. 529 » AlphaGo
Balázs Jákó (2016). Hardware accelerated hybrid rendering on PowerVR GPUs. ^[18] IEEE 20th Jubilee International Conference on Intelligent Engineering Systems
Diogo R. Ferreira, Rui M. Santos (2016). Parallelization of Transition Counting for Process Mining on Multi-core CPUs and GPUs. BPM 2016
Ole Schütt, Peter Messmer, Jürg Hutter, Joost VandeVondele (2016). GPU Accelerated Sparse Matrix–Matrix Multiplication for Linear Scaling Density Functional Theory. pdf ^[19]

Chapter 8 in Ross C. Walker, Andreas W. Götz (2016). Electronic Structure Calculations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics. John Wiley & Sons

2017

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815 » AlphaZero
Tristan Cazenave (2017). Residual Networks for Computer Go. IEEE Transactions on Computational Intelligence and AI in Games, Vol. PP, No. 99, pdf
Jayvant Anantpur, Nagendra Gulur Dwarakanath, Shivaram Kalyanakrishnan, Shalabh Bhatnagar, R. Govindarajan (2017). RLWS: A Reinforcement Learning based GPU Warp Scheduler. arXiv:1712.04303

2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, Vol. 362, No. 6419

Forum Posts

2005 ...

Hardware assist by Nicolai Czempin, Winboard Forum, August 27, 2006
Monte carlo on a NVIDIA GPU ? by Marco Costalba, CCC, August 01, 2008

2010 ...

Using the GPU by Louis Zulli, CCC, February 19, 2010

2011

GPGPU and computer chess by Wim Sjoho, CCC, February 09, 2011
Possible Board Presentation and Move Generation for GPUs? by Srdja Matovic, CCC, March 19, 2011

Re: Possible Board Presentation and Move Generation for GPUs by Steffan Westcott, CCC, March 20, 2011

Zeta plays chess on a gpu by Srdja Matovic, CCC, June 23, 2011 » Zeta
GPU Search Methods by Joshua Haglund, CCC, July 04, 2011

2012

Possible Search Algorithms for GPUs? by Srdja Matovic, CCC, January 07, 2012 ^[20] ^[21]
uct on gpu by Daniel Shawul, CCC, February 24, 2012 » UCT
Is there such a thing as branchless move generation? by John Hamlen, CCC, June 07, 2012 » Move Generation
Choosing a GPU platform: AMD and Nvidia by John Hamlen, CCC, June 10, 2012
Nvidias K20 with Recursion by Srdja Matovic, CCC, December 04, 2012 ^[22]

2013

Kogge Stone, Vector Based by Srdja Matovic, CCC, January 22, 2013 » Kogge-Stone Algorithm ^[23] ^[24]
GPU chess engine by Samuel Siltanen, CCC, February 27, 2013
Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013 » Perft, Kogge-Stone Algorithm ^[25]

2015 ...

GPU chess update, local memory... by Srdja Matovic, CCC, June 06, 2016
Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016 » Astro
Pigeon is now running on the GPU by Stuart Riffle, CCC, November 02, 2016 » Pigeon

2017

Back to the basics, generating moves on gpu in parallel... by Srdja Matovic, CCC, March 05, 2017 » Move Generation
Re: Perft(15): comparison of estimates with Ankan's result by Ankan Banerjee, CCC, August 26, 2017 » Perft(15)
Chess Engine and GPU by Fishpov , Rybka Forum, October 09, 2017
To TPU or not to TPU... by Srdja Matovic, CCC, December 16, 2017 » Deep Learning ^[26]

2018

Announcing lczero by Gary, CCC, January 09, 2018 » Leela Chess Zero
GPU ANN, how to deal with host-device latencies? by Srdja Matovic, CCC, May 06, 2018 » Neural Networks
GPU contention by Ian Kennedy, CCC, May 07, 2018 » Leela Chess Zero
How good is the RTX 2080 Ti for Leela? by Hai, September 15, 2018 » Leela Chess Zero ^[27]

Re: How good is the RTX 2080 Ti for Leela? by Ankan Banerjee, CCC, September 16, 2018

My non-OC RTX 2070 is very fast with Lc0 by Kai Laskos, CCC, November 19, 2018 » Leela Chess Zero
LC0 using 4 x 2080 Ti GPU's on Chess.com tourney? by M. Ansari, CCC, December 28, 2018 » Leela Chess Zero

2019

Generate EGTB with graphics cards? by Nguyen Pham, CCC, January 01, 2019 » Endgame Tablebases
LCZero FAQ is missing one important fact by Jouni Uski, CCC, January 01, 2019 » Leela Chess Zero
Michael Larabel benches lc0 on various GPUs by Warren D. Smith, LCZero Forum, January 14, 2019 » Lc0 ^[28]
Using LC0 with one or two GPUs - a guide by Srdja Matovic, CCC, March 30, 2019 » Lc0
Wouldn't it be nice if C++ GPU by Chris Whittington, CCC, April 25, 2019 » C++
Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search by Percival Tiglao, CCC, June 06, 2019
My home-made CUDA kernel for convolutions by Rémi Coulom, Game-AI Forum, November 09, 2019 » Deep Learning
GPU rumors 2020 by Srdja Matovic, CCC, November 13, 2019

2020 ...

AB search with NN on GPU... by Srdja Matovic, CCC, August 13, 2020 » Neural Networks ^[29]
I stumbled upon this article on the new Nvidia RTX GPUs by Kai Laskos, CCC, September 10, 2020
Will AMD RDNA2 based Radeon RX 6000 series kick butt with Lc0? by Srdja Matovic, CCC, November 01, 2020
Zeta with NNUE on GPU? by Srdja Matovic, CCC, March 31, 2021 » Zeta, NNUE

External Links

OpenCL

CUDA

CUDA from Wikipedia
CUDA Zone | NVIDIA Developer
Nvidia CUDA Compiler (NVCC) from Wikipedia
Compiling CUDA with clang — LLVM Clang documentation
CppCon 2016: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by Justin Lebar, YouTube Video ^[30]

:

Deep Learning

Deep Learning | NVIDIA Developer » Deep Learning
NVIDIA cuDNN | NVIDIA Developer
Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster
Deep Learning in a Nutshell: Core Concepts by Tim Dettmers, Parallel Forall, November 3, 2015
Deep Learning in a Nutshell: History and Training by Tim Dettmers, Parallel Forall, December 16, 2015
Deep Learning in a Nutshell: Sequence Learning by Tim Dettmers, Parallel Forall, March 7, 2016
Deep Learning in a Nutshell: Reinforcement Learning by Tim Dettmers, Parallel Forall, September 8, 2016
Faster deep learning with GPUs and Theano
Theano (software) from Wikipedia
TensorFlow from Wikipedia

Game Programming

GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper

Chess Programming

References

↑ Graphics processing unit - Wikimedia Commons
↑ CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES
↑ AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
↑ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog by Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam and Sridhar Ramaswamy, May 14, 2020
↑ CUDA 11 Features Revealed | NVIDIA Developer Blog by Pramod Ramarao, May 14, 2020
↑ INSIDE VOLTA
↑ AnandTech - Nvidia Turing Deep Dive page 6
↑ Wikipedia - Ampere microarchitecture
↑ CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions
↑ AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths
↑ host-device latencies? by Srdja Matovic, Nvidia CUDA ZONE, Feb 28, 2019
↑ host-device latencies? by Srdja Matovic AMD Developer Community, Feb 28, 2019
↑ Re: GPU ANN, how to deal with host-device latencies? by Milos Stanisavljevic, CCC, May 06, 2018
↑ Photon mapping from Wikipedia
↑ Cell (microprocessor) from Wikipedia
↑ Jetson TK1 Embedded Development Kit | NVIDIA
↑ Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016
↑ PowerVR from Wikipedia
↑ Density functional theory from Wikipedia
↑ Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2
↑ Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2
↑ Tesla K20 GPU Compute Processor Specifications Released | techPowerUp
↑ Parallel Thread Execution from Wikipedia
↑ NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf
↑ ankan-ban/perft_gpu · GitHub
↑ Tensor processing unit from Wikipedia
↑ GeForce 20 series from Wikipedia
↑ Phoronix Test Suite from Wikipedia
↑ kernel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums by LukeCuda, June 18, 2018
↑ Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019
↑ Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

Up one Level

[1] Graphics processing unit - Wikimedia Commons

[2] CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES

[3] AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices

[4] NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog by Ronny Krashinsky, Olivier Giroux, Stephen Jones, Nick Stam and Sridhar Ramaswamy, May 14, 2020

[5] CUDA 11 Features Revealed | NVIDIA Developer Blog by Pramod Ramarao, May 14, 2020

[6] INSIDE VOLTA

[7] AnandTech - Nvidia Turing Deep Dive page 6

[8] Wikipedia - Ampere microarchitecture

[9] CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions

[10] AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths

[11] st-device latencies? by Srdja Matovic, Nvidia CUDA ZONE, Feb 28, 2019

[12] st-device latencies? by Srdja Matovic AMD Developer Community, Feb 28, 2019

[13] Re: GPU ANN, how to deal with host-device latencies? by Milos Stanisavljevic, CCC, May 06, 2018

[14] Photon mapping from Wikipedia

[15] Cell (microprocessor) from Wikipedia

[16] Jetson TK1 Embedded Development Kit | NVIDIA

[17] Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016

[18] PowerVR from Wikipedia

[19] Density functional theory from Wikipedia

[20] Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2

[21] Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2

[22] Tesla K20 GPU Compute Processor Specifications Released | techPowerUp

[23] Parallel Thread Execution from Wikipedia

[24] NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf

[25] -ban/perft_gpu · GitHub

[26] Tensor processing unit from Wikipedia

[27] GeForce 20 series from Wikipedia

[28] Phoronix Test Suite from Wikipedia

[29] rnel launch latency - CUDA / CUDA Programming and Performance - NVIDIA Developer Forums by LukeCuda, June 18, 2018

[30] Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019

[31] Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

@@ Line 4: / Line 4: @@
 '''GPU''' (Graphics Processing Unit),<br/>
-a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and massive parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.
+a specialized processor primarily intended to fast [https://en.wikipedia.org/wiki/Image_processing image processing]. GPUs may have more raw computing power than general purpose [https://en.wikipedia.org/wiki/Central_processing_unit CPUs] but need a specialized and parallelized way of programming. [[Leela Chess Zero]] has proven that a [[Best-First|Best-first]] [[Monte-Carlo Tree Search|Monte-Carlo Tree Search]] (MCTS) with [[Deep Learning|deep learning]] methodology will work with GPU architectures.
+=History=
+In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer, like  [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines, such as [https://en.wikipedia.org/wiki/Quake_(video_game) Quake], could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like CUDA and OpenCL emerged and gained in popularity.
 =GPGPU=
-The traditional job of a GPU is to take the [https://en.wikipedia.org/wiki/Three-dimensional_space x,y,z coordinates] of [https://en.wikipedia.org/wiki/Triangle_strip triangles], and [https://en.wikipedia.org/wiki/3D_projection map] these triangles to [https://en.wikipedia.org/wiki/Glossary_of_computer_graphics#screen_space screen space] through a [https://en.wikipedia.org/wiki/Matrix_multiplication matrix multiplication]. As video game graphics grew more sophisticated, the number of triangles per scene grew larger. GPUs similarly grew in size to massively parallel behemoths capable of performing billions of transformations hundreds of times per second.
+Early efforts to leverage a GPU for general-purpose computing required reformulating computational problems in terms of graphics primitives via graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirextX], followed by first GPGPU frameworks such as [https://en.wikipedia.org/wiki/Lib_Sh Sh/RapidMind] or [https://en.wikipedia.org/wiki/BrookGPU Brook] and finally [https://en.wikipedia.org/wiki/CUDA CUDA] and [https://www.chessprogramming.org/OpenCL OpenCL].
-These lists of triangles were specified in Graphics APIs like [https://en.wikipedia.org/wiki/OpenGL OpenGL] or [https://en.wikipedia.org/wiki/DirectX DirectX]. But video game programmers demanded more flexibility from their hardware: such as lighting, transparency, and reflections. This flexibility was granted with specialized programming languages, called [https://en.wikipedia.org/wiki/Shader#Vertex_shaders vertex shaders] or [https://en.wikipedia.org/wiki/Shader#Pixel_shaders pixel shaders]. GPUs evolved to accelerate general purpose compute from pixel shader and vertex shader programmers, and even merged the functionality into "universal" shaders (which can perform either vertex shading or pixel shading).
-Today, these universal shaders are flexible enough to provide General Purpose compute for GPUs (GPGPU). GPGPU languages, such as OpenCL or CUDA, is how the programmer can access this capability.
 == Khronos OpenCL ==
+OpenCL specified by the [https://en.wikipedia.org/wiki/Khronos_Group Khronos Group] is widely adopted across all kind of hardware accelerators from different vendors.
-The [https://en.wikipedia.org/wiki/Khronos_Group Khronos group] is a committee formed to oversee the [https://en.wikipedia.org/wiki/OpenGL OpenGL], [[OpenCL]], and [https://en.wikipedia.org/wiki/Vulkan_(API) Vulkan] standards. Although compute shaders exist in all languages, OpenCL is the designated general purpose compute language.
+* [https://www.khronos.org/conformance/adopters/conformant-products/opencl List of OpenCL Conformant Products]
-OpenCL 1.2 is widely supported by [[AMD]], [[Nvidia]], and [[Intel]]. OpenCL 2.0, although specified in 2013, has had a slow rollout, and the specific features aren't necessarily widespread in modern GPUs yet. AMD continues to target OpenCL 2.0 support in their ROCm environment, while Nvidia has implemented some OpenCL 2.0 features.
 * [https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf OpenCL 1.2 Specification]
@@ Line 27: / Line 25: @@
 * [https://www.khronos.org/registry/OpenCL//sdk/2.0/docs/man/xhtml/ OpenCL 2.0 Reference]
-== Nvidia Software overview ==
+* [https://www.khronos.org/registry/OpenCL/specs/3.0-unified/pdf/ OpenCL 3.0 Specifications]
-[[Nvidia]] [https://en.wikipedia.org/wiki/CUDA CUDA] is their general purpose compute framework. CUDA has a [[Cpp|C++]] compiler based on [https://en.wikipedia.org/wiki/LLVM LLVM] / [https://en.wikipedia.org/wiki/Clang clang], which compiles into an assembly-like language called [https://en.wikipedia.org/wiki/Parallel_Thread_Execution PTX]. Nvidia device drivers take PTX and compile that down to the final machine code (called Nvidia SASS). Nvidia keeps PTX portable between its GPUs, while its SASS assembly language may change from year-to-year as Nvidia releases new GPUs. A defining feature of CUDA was the "single source" C++ compiler, the same compiler would work with both CPU host-code and GPU device-code. This meant that the data-structures and even pointers from the CPU can be shared directly with the GPU code.
+== AMD ==
-* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]
+[[AMD]] supports language frontends like OpenCL, HIP, C++ AMP and with OpenMP offload directives. It offers with [https://rocmdocs.amd.com/en/latest/ ROCm] its own parallel compute platform.
-* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]
-* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]
-== AMD Software Overview ==
-[[AMD|AMD's]] original software stack, called [https://en.wikipedia.org/wiki/AMDGPU AMDGPU-pro], provides OpenCL 1.2 and 2.0 capabilities on [[Linux]] and [[Windows]]. However, most of AMD's efforts today is on an experimental framework called [https://en.wikipedia.org/wiki/OpenCL#Implementations ROCm]. ROCm is AMD's open source compiler and device driver stack intended for general purpose compute. ROCm supports two languages: [https://en.wikipedia.org/wiki/GPUOpen#AMD_Boltzmann_Initiative HIP] (a CUDA-like single-source C++ compiler also based on LLVM/clang), and OpenCL 2.0. ROCm only works on Linux machines supporting modern hardware, such as [https://en.wikipedia.org/wiki/PCI_Express#3.0 PCIe 3.0] and relatively recent GPUs (such as the [https://en.wikipedia.org/wiki/AMD_Radeon_500_series RX 580], and [https://en.wikipedia.org/wiki/AMD_RX_Vega_series Vega] GPUs).
-AMD regularly publishes the assembly language details of their architectures. Their "GCN Assembly" changes slightly from generation to generation, but the fundamental principles have remained the same.
-AMD's OpenCL documentation, especially the "OpenCL Programming Guide" and the "Optimization Guide" are good places to start for beginners looking to program their GPUs. For Linux developers, the ROCm environment is under active development and has enough features to get code working well.
 * [https://rocm.github.io/ ROCm Homepage]
@@ Line 49: / Line 37: @@
 * [https://developer.amd.com/wp-content/resources/Vega_Shader_ISA_28July2017.pdf Vega Instruction Set]
-== Other 3rd party tools ==
+== Nvidia ==
-* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (GPGPU API by Microsoft)
+[https://en.wikipedia.org/wiki/CUDA CUDA] is the parallel computing platform by [[Nvidia]]. It supports languages frontends like C, C++, Fortran, OpenCL and offload directives via [https://en.wikipedia.org/wiki/OpenACC OpenACC] and [https://en.wikipedia.org/wiki/OpenMP OpenMP].
-* [https://en.wikipedia.org/wiki/OpenMP OpenMP] Device Offload
-* [https://en.wikipedia.org/wiki/OpenACC OpenACC] Device Offload
-* [https://en.wikipedia.org/wiki/Metal_(API) Metal] (GPU and GPGPU API by Apple)
-* [https://en.wikipedia.org/wiki/OneAPI_(programming_model) oneAPI] (Data Parallel C++ by Intel)
-=The SIMT Programming Model=
+* [https://developer.nvidia.com/cuda-zone Nvidia CUDA Zone]
+* [https://docs.nvidia.com/cuda/parallel-thread-execution/index.html Nvidia PTX ISA]
-CUDA, OpenCL, ROCm HIP, all have the same model of implicitly parallel programming. All threads are given an identifier: a threadIdx in CUDA or local_id in OpenCL. Aside from this index, all threads of a kernel will execute the same code. The only way to alter the behavior of code is to use this threadIdx to access different data.
+* [https://docs.nvidia.com/cuda/index.html Nvidia CUDA Toolkit Documentation]
-The executed code is always implicitly [[SIMD]]. Instead of thinking of SIMD-lanes, each lane is considered its own thread. The smallest group of threads is called a CUDA Warp, or OpenCL Wavefront. Nvidia GPUs execute 32-threads per warp, while AMD GCN GPUs execute 64-threads per wavefront. All threads within a Warp or Wavefront share an instruction pointer. Consider the following CUDA code:
-    if(threadIdx.x == 0){
-        doA();
-    } else {
-        doB();
-    }
-While there is only one thread in the warp that has threadIdx == 0, all 32 threads of the warp will have their shared instruction pointer execute doA() together. To keep the code semantically correct, threads #1 through #31 will have their Nvidia Predicate-register cleared (or AMD Execution Mask cleared), which means the thread will throw away the work after executing a specific statement. For those familiar with x64 AVX code, a GPU thread is comparable to a SIMD-lane in AVX. All lanes of an AVX instruction will execute any particular instruction, but you may throw away the results of some registers using mask or comparison instructions.
-Once doA() is complete, the machine will continue and doB(). In this case, thread#0 will have its execution mask-cleared, while threads #1 through #31 will actually complete the results of doB().
-This highlights the fundamental trade off of the GPU platform. GPUs have many threads of execution, but they are forced to execute with their warps or wavefronts. In complicated loops or trees of if-statements, this thread divergence problem can cause your code to potentially leave many hardware threads idle. In the above example code, 97% of the threads will be effectively idle during doA(), while 3% of the threads will be idle during doB().
+== Further ==
-== Blocks and Workgroups ==
+* [https://en.wikipedia.org/wiki/Metal_(API) Metal] (Apple)
+* [https://en.wikipedia.org/wiki/OneAPI_(programming_model) oneAPI] (Intel)
+* [https://en.wikipedia.org/wiki/C%2B%2B_AMP C++ AMP] (Microsoft)
+* [https://en.wikipedia.org/wiki/DirectCompute DirectCompute] (Microsoft)
+* [https://en.wikipedia.org/wiki/OpenACC OpenACC] (offload directives)
+* [https://en.wikipedia.org/wiki/OpenMP OpenMP] (offload directives)
-Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (Nvidia), sharing L1 cache, shared memory and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both Nvidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.
+=SIMT and SIMD on GPU=
-== Grids and NDRange ==
+GPUs run multiple threads in [https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads SIMT] fashion and are capable to hide memory latencies by running a multitude of SIMT waves on the same [https://en.wikipedia.org/wiki/SIMD SIMD] unit.
-While warps, blocks, wavefronts and workgroups are concepts that the machine executes... Grids and NDRanges are the scope of the problem specified by a programmer. For example, the 1920x1080 screen could be defined as a Grid with 2073600 threads to execute (likely organized as a 2-dimensional 1920x1080 grid for convenience). Specifying these 2,073,600 work items is the purpose of a CUDA Grid or OpenCL NDRange.
+=Memory Model=
-The programmer may choose to cut up the 1920x1080 screen into blocks of size 32x32 pixels. Or maybe an algorithm is horizontal in nature, and it may be more convenient to work with blocks of 1x1024 pixels instead. Or maybe the block-sizes have been set to some video standards, and maybe 8x8 blocks (64-threads) are the biggest you can practically work with (say MPEG-2 decoder 8x8 macroblocks). Regardless, the programmer chooses a block size which is most convenient and optimized for their purposes. To complete this hypothetical example, a 1920x1080 screen could be split up into 60x34 CUDA Blocks (or OpenCL Workgroups), each covering 32x32 pixels with 1024 CUDA Threads (or OpenCL Workitems) each.
+OpenCL offers the following memory model for the programmer:
-These blocks and workgroups will execute with as much parallel processing as the underlying hardware can support. Roughly 150 CUDA Blocks or OpenCL Workgroups at a time on a typical midrange GPU circa from 2019 (such as a Nvidia 2060 Super or AMD 5700). The most important note is that blocks within a grid (or workgroups within an NDRange) may not execute concurrently with each other. Some degree of sequential processing may happen. If thread #0 creates a Spinlock waiting for thread #1000000 to communicate with it, modern hardware will probably never have the two threads executing concurrently with each other, and the code would likely timeout. In practice, the easiest mechanism for Grid or NDRange sized synchronization is to wait for the kernel to finish executing: to have the CPU wait and process the results in between Grid or NDRanges.
+* __private - usually registers, accessable only by a single work-item resp. thread.
+* __local - scratch-pad memory shared across work-items of a work-group resp. threads of block.
-For example: LeelaZero will schedule an NDRange for each [https://github.com/leela-zero/leela-zero/blob/next/src/kernels/convolve1.opencl Convolve operation], as well as merge and other primitives. The convolve operation is over a 3-dimensional NDRange for <channel, output, row_batch>. To build up a full CNN operation, the CPU will schedule different operations for the GPU: convolve, merge, transform and more.
+* __constant - read-only variable.
+* __global - usually VRAM, accessable by all work-items resp. threads.
-==Memory Model==
-OpenCL, CUDA, ROCM, and other GPU-languages all have a similar memory model.
-* __device__ (CUDA) or __global (OpenCL) memory -- OpenCL __global and CUDA __device__ memory exists on the GPU's VRAM. Any threads can access any part of __device__ or __global memory, although memory-ordering and caching details can get quite complicated if multiple threads simultaneously read and write to a particular memory location. Proper memory ordering with __threadfence() (CUDA) or mem_fence() (OpenCL) is essential to preventing memory-consistency issues.
-* __constant__ (CUDA) or __constant (OpenCL) memory -- Constants are not allowed to change during the execution of a particular kernel. Historically, this was used by Pixel Shaders as they read texture data. The texture-data could be computed and loaded onto the GPU, but the data was not allowed to change during the Pixel Shader's execution. Both Nvidia and AMD GPUs have special caches (and in AMD's case: special registers called sGPRs) which accelerate constant-data.  The caches associated with this memory space is sometimes called K$ (Konstant-cache), and has to be independently flushed if its data ever changes. The main benefit in both AMD and Nvidia systems is that K$ values are broadcast extremely efficiently to all threads in a wavefront, but only if all threads in a wavefront are reading from the same memory location. Instead of haing 32-memory reads (Nvidia) or 64-memory reads (AMD GCN), a read from K$ can be optimized into a single-read, broadcast to all 32 or 64-threads of a Warp or Wavefront.
-* __shared__ (CUDA) or __local (OpenCL) memory -- This is highly-accelerated memory regions designed for threads to exchange data within a CUDA Block or OpenCL Workgroup. On AMD Systems, there is more Local "LDS" memory than even L1 Cache (GCN) or L0 Cache (RDNA).
-* Default (CUDA) or __private (OpenCL) Memory -- Private memory typically maps to a GPU-register, and is inaccessible to other threads. If a kernel requires more memory than what can exist in GPU-registers, the data will automatically spill over into global VRAM (with an associated performance penalty). In practice, this spillover is well interleaved, well-optimized, and reduced to as small a subset as possible through compiler optimizations.
@@ Line 121: / Line 87: @@
 = Architectures and Physical Hardware =
-The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for 3D CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in drivers, VRAM, or computation abilities.
+The market is split into two categories, integrated and discrete GPUs. The first being the most important by quantity, the second by performance. Discrete GPUs are divided as consumer brands for playing 3D games, professional brands for CAD/CGI programs and server brands for big-data and number-crunching workloads. Each brand offering different feature sets in driver, VRAM, or computation abilities.
 == AMD ==
 AMD line of discrete GPUs is branded as Radeon for consumer, Radeon Pro for professional and Radeon Instinct for server.
+* [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units on Wikipedia]
 === CDNA ===
@@ Line 132: / Line 100: @@
 === Navi 2X RDNA 2.0 ===
-* [https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2 from Wikipedua]
+[https://en.wikipedia.org/wiki/RDNA_(microarchitecture)#RDNA_2 RDNA 2.0] cards were unveiled on October 28, 2020.
 * [https://developer.amd.com/wp-content/resources/RDNA2_Shader_ISA_November2020.pdf RDNA 2 Instruction Set Architecture]
-RDNA 2.0 cards were unveiled on October 28, 2020.
-* [https://en.wikipedia.org/wiki/Radeon_RX_6000_series Radeon RX 6000 series from Wikipedua]
 === Navi RDNA 1.0 ===
+[https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA 1.0] cards were unveiled on July 7, 2019.
 * [https://www.amd.com/system/files/documents/rdna-whitepaper.pdf RDNA Whitepaper]
 * [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]
-* [https://en.wikipedia.org/wiki/RDNA_(microarchitecture) RDNA (microarchitecture) from Wikipedua]
-RDNA cards were first released in 2019. RDNA is a major change for AMD cards: the underlying hardware supports both Wave32 and Wave64 gangs of threads. Compute Units have 2x32 wide SIMD units, each of which executes 32 threads per clock tick. A Wave64 workgroup will execute on a single SIMD unit, but over two clock ticks. It should be noted that these Wave32 still have 5 cycles of latency before registers can be reused, so a Wave64 executing over two clock ticks will have fewer stalls than a Wave32.
-* [https://en.wikipedia.org/wiki/Radeon_RX_5000_series Radeon RX 5000 series from Wikipedua]
-* Radeon 5700 XT
-* Radeon 5700
 === Vega GCN 5th gen ===
-[https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]
+[https://en.wikipedia.org/wiki/Radeon_RX_Vega_series Vega] cards were unveiled on August 14, 2017.
-Vega cards were first released in 2017. Vega is the last in the line of the GCN Architecture: 64 threads per wavefront. Each compute unit contains 4x SIMD units, supporting a total of 40 wavefronts per compute unit (a queue of 10-wavefronts per SIMD Unit). Each SIMD unit contains 16 vALUs for general compute + 1 sALU for branching and constant logic. Each SIMD unit executes the same instruction over four clock ticks (16 vALUs x 4 clock ticks == 64 threads per Wavefront).
-Vega specifically added Packed FP16 instructions, such as dot-product and packed add and packed multiply. From a programming level, these packed FP16 instructions are SIMD-within-SIMD, each SIMD thread could operate its own SIMD FP16 instruction akin to AVX or SSE from the x64 architecture.
-* Radeon VII
+* [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]
-* Vega64
-* Vega56
 === Polaris GCN 4th gen ===
-Polaris cards were first released in 2016 under the AMD Radeon 400 series name.
+[https://en.wikipedia.org/wiki/Graphics_Core_Next#Graphics_Core_Next_4 Polaris] cards were first released in 2016.
-[https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]
+* [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]
-* RX 580
-* RX 570
-* RX 560
 == Apple ==
@@ Line 177: / Line 126: @@
 === M1 ===
-Apple released its M1 SoC with integrated GPU for desktops and notebooks in 2020.
+Apple released its M1 SoC (system on a chip) with integrated GPU for desktops and notebooks in 2020.
 * [https://en.wikipedia.org/wiki/Apple_M1 Apple M1 on Wikipedia]
@@ Line 197: / Line 146: @@
 [https://en.wikipedia.org/wiki/Intel_Xe Intel Xe] line of GPUs (released since 2020) is divided as Xe-LP (low-power), Xe-HPG (high-performance-gaming), Xe-HP (high-performace) and Xe-HPC (high-performance-computing).
-* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Xe 'Gen12' GPUs on Wikipedia]
+* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units#Gen12 List of Intel Gen12 GPUs on Wikipedia]
 ==Nvidia==
 Nvidia line of discrete GPUs is branded as GeForce for consumer, Quadro for professional and Tesla for server.
+* [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units on Wikipedia]
 === Ampere Architecture ===
 The [https://en.wikipedia.org/wiki/Ampere_(microarchitecture) Ampere microarchitecture] was announced on May 14, 2020 <ref>[https://devblogs.nvidia.com/nvidia-ampere-architecture-in-depth/ NVIDIA Ampere Architecture In-Depth | NVIDIA Developer Blog] by [https://people.csail.mit.edu/ronny/ Ronny Krashinsky], [https://cppcast.com/guest/ogiroux/ Olivier Giroux], [https://blogs.nvidia.com/blog/author/stephenjones/ Stephen Jones], [https://blogs.nvidia.com/blog/author/nick-stam/ Nick Stam] and [https://en.wikipedia.org/wiki/Sridhar_Ramaswamy Sridhar Ramaswamy], May 14, 2020</ref>. The Nvidia A100 GPU based on the Ampere architecture delivers a generational leap in accelerated computing in conjunction with CUDA 11 <ref>[https://devblogs.nvidia.com/cuda-11-features-revealed/ CUDA 11 Features Revealed | NVIDIA Developer Blog] by [https://devblogs.nvidia.com/author/pramarao/ Pramod Ramarao], May 14, 2020</ref>.
-* DGX A100
+=== Turing Architecture ===
-* HGX A100
+[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, for [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. These are also the first consumer cards to launch with TensorCores used for matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]]. The Turing GTX line of chips do not offer RTX or TensorCores.
-=== Turing Architecture ===
 [https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]
-[https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] cards were first released in 2018. They are the first consumer cores to launch with RTX, or [https://en.wikipedia.org/wiki/Ray_tracing_(graphics) raytracing], features. RTX instructions will more quickly traverse an [https://en.wikipedia.org/wiki/Minimum_bounding_box#Axis-aligned_minimum_bounding_box aabb] [https://en.wikipedia.org/wiki/Bounding_volume_hierarchy tree] to discover ray-intersections with lists of bounding-boxes, accelerating raytracing performance. These are also the first consumer cards to launch with Tensor cores, 4x4 matrix multiplication FP16 instructions to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].
-* RTX 2080 Ti
-* RTX 2080
-* RTX 2070 Ti
-* RTX 2070 Super
-* RTX 2070
-* RTX 2060 Super
-* RTX 2060
-* GTX 1660 -- Low-end GPU without Tensor cores or RTX Cores.
 === Volta Architecture ===
+[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. They were the first cards to launch with TensorCores, supporting matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].
 [https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]
-[https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] cards were released in 2017. Only Tesla and Titan cards were produced in this generation, aiming only for the most expensive end of the market. They were the first cards to launch with Tensor cores, supporting 4x4 FP16 matrix multiplications to accelerate [[Neural Networks#Convolutional|convolutional neural networks]].
-* Tesla V100
-* Titan V
 === Pascal Architecture ===
+[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.
 [https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]
-[https://en.wikipedia.org/wiki/Pascal_(microarchitecture) Pascal] cards were first released in 2016.
-* Tesla P100
-* Titan Xp
-* GTX 1080 Ti
-* GTX 1080
-* GTX 1070 Ti
-* GTX 1060
-* GTX 1050
-* GTX 1030
 === Maxwell Architecture ===
+[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.
 [https://web.archive.org/web/20170721113746/http://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF Architecture Whitepaper on archiv.org]
-[https://en.wikipedia.org/wiki/Maxwell(microarchitecture) Maxwell] cards were first released in 2014.
 == PowerVR - Imagination Technologies ==
@@ Line 260: / Line 184: @@
 =Instruction Throughput=
-GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 Terascale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.
+GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 TeraScale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.
 ==Integer Instruction Throughput==
@@ Line 317: / Line 241: @@
 =Deep Learning=
-GPUs were originally intended to process matrix multiplications for graphical transformations and rendering. [[Neural Networks#Convolutional|Convolutional Neural Networks]] can have their operations interpreted as a series of matrix multiplications. GPUs are therefore a natural fit to parallelize and process CNNs.
+GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom, also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.
-GPUs traditionally operated on 32-bit floating point numbers. However, CNNs can make due with 16-bit half floats (FP16), or even 8-bit or 4-bit numbers. One thousand single-precision floats will take up 4kB of space, while one-thousand FP16 will take up 2kB of space. A half-float uses half the memory, eats only half the memory bandwidth, and only half the space in caches. As such, GPUs such as AMD Vega or Nvidia Volta added support for FP16 processing.
-Specialized units, such as Nvidia Volta's "Tensor cores", can perform an entire 4x4 block of FP16 matrix multiplications in just one PTX assembly language statement. It is with these instructions that CNN operations are accelerated.
-GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom,
-also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.
-=History=
-In the 1970s and 1980s RAM was expensive and Home Computers used custom graphics chips to operate directly on registers/memory without a dedicated frame buffer, like  [https://en.wikipedia.org/wiki/Television_Interface_Adaptor TIA]in the [[Atari 8-bit|Atari VCS]] gaming system, [https://en.wikipedia.org/wiki/CTIA_and_GTIA GTIA]+[https://en.wikipedia.org/wiki/ANTIC ANTIC] in the [[Atari 8-bit|Atari 400/800]] series, or [https://en.wikipedia.org/wiki/Original_Chip_Set#Denise Denise]+[https://en.wikipedia.org/wiki/Original_Chip_Set#Agnus Agnus] in the [[Amiga|Commodore Amiga]] series. The 1990s would make 3D graphics and 3D modeling more popular, especially for video games. Cards specifically designed to accelerate 3D math, such as the [https://en.wikipedia.org/wiki/Voodoo2 3dfx Voodoo2], were used by the video game community to play 3D graphics. Some game engines, such as [https://en.wikipedia.org/wiki/Quake_(video_game) Quake], could use instead the [[SIMD and SWAR Techniques|SIMD-capabilities]] of CPUs such as the [[Intel]] [[MMX]] instruction set or [[AMD|AMD's]] [[X86#3DNow!|3DNow!]]. Sony's 3D capable chip used in the PlayStation (1994) and Nvidia's 2D/3D combi chips like NV1 (1995) coined the term GPU for 3D graphics hardware acceleration. With the advent of the [https://en.wikipedia.org/wiki/Unified_shader_model unified shader architecture], like in Nvidia [https://en.wikipedia.org/wiki/Tesla_(microarchitecture) Tesla] (2006), ATI/AMD [https://en.wikipedia.org/wiki/TeraScale_(microarchitecture) TeraScale] (2007) or Intel [https://en.wikipedia.org/wiki/Intel_GMA#GMA_X3000 GMA X3000] (2006), GPGPU frameworks like CUDA and OpenCL emerged and gained in popularity.
-The large number of regular [https://en.wikipedia.org/wiki/Matrix_multiplication matrix multiplications] led to natural SIMD-style algorithms. The 3D graphics community drew upon the rich history of vector-compute and SIMD-compute from 1980s and 1970s supercomputers. As such, many publications relating to [[Cray X-MP|Cray-vector supercomputers]] or the [[Connection Machine]] supercomputer easily apply to modern GPUs. For example, all the algorithms described in the 1986 publication ''Data Parallel Algorithms'' <ref>[[Mathematician#Hillis|W. Daniel Hillis]], [[Mathematician#GSteele|Guy L. Steele, Jr.]] ('''1986'''). ''[https://dl.acm.org/citation.cfm?id=7903 Data parallel algorithms]''. [[ACM#Communications|Communications of the ACM]], Vol. 29, No. 12, Special Issue on Parallelism</ref> can be efficiently executed on a modern GPU workgroup (roughly ~256x GPU threads). The ''Data Parallel Algorithms'' paper is a beginner-level algorithms paper, demonstrating simple and efficient [[Parallel Prefix Algorithms|parallel-prefix sum]], parallel-linked list traversal, parallel RegEx matching on the 4096x parallel Connection Machine-2 supercomputer.
-Modern papers on GPUs, such as Nvidia's excellent ''Parallel Prefix Sum (Scan) with CUDA (GPU Gems 3)'' <ref>[https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html Chapter 39. Parallel Prefix Sum (Scan) with CUDA (GPU Gems 3)]</ref>, are built on top of these papers from the 1980s or 1990s. As such, the beginner will find it far easier to read the papers from the 1980s or 90s before attempting to read a modern piece like GPU Gems 3.
 =Chess Engines=
@@ Line 460: / Line 370: @@
 * [https://en.wikipedia.org/wiki/General-purpose_computing_on_graphics_processing_units General-purpose computing on graphics processing units (GPGPU) from Wikipedia]
 * [https://en.wikipedia.org/wiki/List_of_AMD_graphics_processing_units List of AMD graphics processing units from Wikipedia]
+* [https://en.wikipedia.org/wiki/List_of_Intel_graphics_processing_units List of Intel graphics processing units from Wikipedia]
 * [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units List of Nvidia graphics processing units from Wikipedia]
 * [https://developer.nvidia.com/ NVIDIA Developer]

Difference between revisions of "GPU"

Revision as of 11:43, 18 April 2021

Contents

History

GPGPU

Khronos OpenCL

AMD

Nvidia

Further

SIMT and SIMD on GPU

Memory Model

Architectures and Physical Hardware

AMD

CDNA

Navi 2X RDNA 2.0

Navi RDNA 1.0

Vega GCN 5th gen

Polaris GCN 4th gen

Apple

M1

ARM Mali

Bifrost (2016) and Valhall (2019)

Midgard (2012)

Intel

Intel Xe 'Gen12'

Nvidia

Ampere Architecture

Turing Architecture

Volta Architecture

Pascal Architecture

Maxwell Architecture

PowerVR - Imagination Technologies

PowerVR Series5 SGX

Instruction Throughput

Integer Instruction Throughput

Floating Point Instruction Throughput

Tensors

Nvidia TensorCores

AMD Matrix Cores

Intel XMX Cores

Throughput Examples

Host-Device Latencies

Deep Learning

Chess Engines

See also

Publications

1986

1990

2008 ...

2010...

2015 ...

Forum Posts

2005 ...

2010 ...

2015 ...

2020 ...

External Links

OpenCL

CUDA

Deep Learning

Game Programming

Chess Programming

References

Navigation menu

Search