Revision as of 17:06, 9 August 2019

GPU (Graphics Processing Unit),
a specialized processor primarily intended to fast image processing. GPUs may have more raw computing power than general purpose CPUs but need a specialized and massive parallelized way of programming. Leela Chess Zero has proven that a Best-first Monte-Carlo Tree Search (MCTS) with deep learning methodology will work with GPU architectures.

GPGPU

The traditional job of a GPU is to take the x,y,z coordinates of triangles, and map these triangles to screen space through a matrix multiplication. As video game graphics grew more sophisticated, the number of triangles per scene grew larger. GPUs similarly grew in size to massively parallel behemoths capable of performing billions of transformations hundreds of times per second.

These lists of triangles were specified in Graphics APIs like DirectX. But video game programmers demanded more flexibility from their hardware: such as lighting, transparency, and reflections. This flexibility was granted with specialized programming languages, called vertex shaders or pixel shaders.

Eventually, the fixed-functionality of GPUs disappeared, and GPUs became primarily a massively parallel general purpose computers. Instead of using vertex shaders inside of DirectX, general compute languages are designed to make sense outside of a graphical setting.

Khronos OpenCL

The Khronos group is a committee formed to oversee the OpenGL, OpenCL, and Vulkan standards. Although compute shaders exist in all languages, OpenCL is the designated general purpose compute language.

OpenCL 1.2 is widely supported by AMD, NVidia, and Intel. OpenCL 2.0, although specified in 2013, has had a slow rollout, and the specific features aren't necessarily widespread in modern GPUs yet. AMD continues to target OpenCL 2.0 support in their ROCm environment, while NVidia has implemented some OpenCL 2.0 features.

NVidia Software overview

NVidia CUDA is their general purpose compute framework. CUDA has a C++ compiler based on LLVM / clang, which compiles into an assembly-like language called PTX. NVidia device drivers take PTX and compile that down to the final machine code (called NVidia SASS). NVidia keeps PTX portable between its GPUs, while its SASS assembly language may change from year-to-year as NVidia releases new GPUs. A defining feature of CUDA was the "single source" C++ compiler, the same compiler would work with both CPU host-code and GPU device-code. This meant that the data-structures and even pointers from the CPU can be shared directly with the GPU code.

AMD Software Overview

AMD's original software stack, called AMDGPU-pro, provides OpenCL 1.2 and 2.0 capabilities on Linux and Windows. However, most of AMD's efforts today is on an experimental framework called ROCm. ROCm is AMD's open source compiler and device driver stack intended for general purpose compute. ROCm supports two languages: HIP (a CUDA-like single-source C++ compiler also based on LLVM/clang), and OpenCL 2.0. ROCm only works on Linux machines supporting modern hardware, such as PCIe 3.0 and relatively recent GPUs (such as the RX 580, and Vega GPUs).

AMD regularly publishes the assembly language details of their architectures. Their "GCN Assembly" changes slightly from generation to generation, but the fundamental principles have remained the same.

AMD's OpenCL documentation, especially the "OpenCL Programming Guide" and the "Optimization Guide" are good places to start for beginners looking to program their GPUs. For Linux developers, the ROCm environment is under active development and has enough features to get code working well.

Other 3rd party tools

DirectCompute (GPGPU API by Microsoft)
OpenMP 4.5 Device Offload

The SIMT Programming Model

CUDA, OpenCL, ROCm HIP, all have the same model of implicitly parallel programming. All threads are given an identifier: a threadIdx in CUDA or local_id in OpenCL. Aside from this index, all threads of a kernel will execute the same code. The only way to alter the behavior of code is to use this threadIdx to access different data.

The executed code is always implicitly SIMD. Instead of thinking of SIMD-lanes, each lane is considered its own thread. The smallest group of threads is called a CUDA Warp, or OpenCL Wavefront. NVidia GPUs execute 32-threads per warp, while AMD GCN GPUs execute 64-threads per wavefront. All threads within a Warp or Wavefront share an instruction pointer. Consider the following CUDA code:

   if(threadIdx.x == 0){
       doA(); 
   } else {
       doB(); 
   }

While there is only one thread in the warp that has threadIdx == 0, all 32 threads of the warp will have their shared instruction pointer execute doA() together. To keep the code semantically correct, threads #1 through #31 will have their NVidia Predicate-register cleared (or AMD Execution Mask cleared), which means the thread will throw away the work after executing a specific statement. For those familiar with x64 AVX code, a GPU thread is comparable to a SIMD-lane in AVX. All lanes of an AVX instruction will execute any particular instruction, but you may throw away the results of some registers using mask or comparison instructions.

Once doA() is complete, the machine will continue and doB(). In this case, thread#0 will have its execution mask-cleared, while threads #1 through #31 will actually complete the results of doB().

This highlights the fundamental trade off of the GPU platform. GPUs have many threads of execution, but they are forced to execute with their warps or wavefronts. In complicated loops or trees of if-statements, this thread divergence problem can cause your code to potentially leave many hardware threads idle.

Blocks and Workgroups

Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.

Grids and NDRange

CUDA Grids and OpenCL NDRange is the end of scaling for the programming model. Many blocks can be specified in a CUDA Grid, while many workgroups operate over an OpenCL NDRange.

The underlying hardware supports running many workgroups in parallel, across different compute units. An AMD Vega64 has 64 compute units for example, while an NVidia RTX 2070 has 40 symmetric multiprocessors. The hardware scheduler can fit many blocks and workgroups per compute unit. The exact number is dependent on the amount of registers, memory, and wavefronts a particular workgroup uses.

CUDA Grids and OpenCL NDRanges may operate in parallel, or may be traversed sequentially if the GPU doesn't have enough parallel resources.

Architectures and Physical Hardware

Each generation, the manufacturers create a series of cards, with set vRAM and SIMD Cores. The market is split into three categories: server, professional, and consumer. Consumer cards are cheapest and are primarily targeted for the video game market. Professional cards have better driver support for 3d programs like Autocad. Finally, server cards provide virtualization services, allowing cloud companies to virtually split their cards between customers.

While server and professional cards have more vRAM, consumer cards are more than adequate starting points for GPU Programmers.

GPUs use high-bandwidth RAM, such as GDDR6 or HBM2. These specialized RAM are designed for the extremely parallel nature of GPUs, and can provide 200GBps to 1000GBps throughput. In comparison: a typical DDR4 channel can provide 20GBps. A dual channel desktop will typically have under 50GBps bandwidth to DDR4 main memory.

NVidia

NVidia's consumer line of cards is Geforce, branded with RTX or GTX labels. Nvidia's professional line of cards is Quadro. Finally, Tesla cards constitute NVidia's server line.

NVidia's "Titan" line of Geforce cards use consumer drivers, but use professional or server class chips. As such, the Titan line can cost anywhere from $1000 to $3000 per card.

Turing Architecture

Architectural Whitepaper

Turing cards were first released in 2018. They are the first consumer cores to launch with RTX, or raytracing, features. RTX instructions will more quickly traverse an aabb tree to discover ray-intersections with lists of objects. These are also the first consumer cards to launch with Tensor cores, 4x4 matrix multiplication FP16 instructions to accelerate convolutional neural networks.

RTX 2080 Ti
RTX 2080
RTX 2070 Ti
RTX 2070 Super
RTX 2070
RTX 2060 Super
RTX 2060
GTX 1660

Volta Architecture

Architecture Whitepaper

Volta cards were released in 2018. Only Tesla and Titan cards were produced in this generation, aiming only for the most expensive end of the market. They were the first cards to launch with Tensor cores, supporting 4x4 FP16 matrix multiplications to accelerate convolutional neural networks.

Tesla V100
Titan V

Pascal Architecture

Architecture Whitepaper

Pascal cards were first released in 2016.

Tesla P100
Titan Xp
GTX 1080 Ti
GTX 1080
GTX 1070 Ti
GTX 1060
GTX 1050
GTX 1030

AMD

RDNA 1.0

Architecture Slide Deck

RDNA cards were first released in 2019. RDNA is a major change for AMD cards: the underlying hardware supports both Wave32 and Wave64 gangs of threads.

5700 XT
5700

Vega GCN 5th gen

Architecture Whitepaper

Vega cards were first released in 2017.

Radeon VII
Vega64
Vega56

Polaris GCN 4th gen

Architecture Whitepaper

RX 580
RX 570
RX 560

Inside

Modern GPUs consist of up to hundreds of SIMD or Vector units, coupled to compute units. Each compute unit processes multiple Warps (Nvidia term) resp. Wavefronts (AMD term) in SIMT fashion. Each Warp resp. Wavefront runs n (32 or 64) threads simultaneously.

The Nvidia GeForce GTX 580, for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. ^[2] The AMD Radeon HD 7970 is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores. ^[3]. In real life the register and shared memory size limits the amount of total threads.

Memory

The memory hierarchy of an GPU consists in main of private memory (registers accessed by an single thread resp. work-item), local memory (shared by threads of an block resp. work-items of an work-group ), constant memory, different types of cache and global memory. Size, latency and bandwidth vary between vendors and architectures.

Here the data for the Nvidia GeForce GTX 580 (Fermi) as an example: ^[4]

128 KiB private memory per compute unit
48 KiB (16 KiB) local memory per compute unit (configurable)
64 KiB constant memory
8 KiB constant cache per compute unit
16 KiB (48 KiB) L1 cache per compute unit (configurable)
768 KiB L2 cache
1.5 GiB to 3 GiB global memory

Here the data for the AMD Radeon HD 7970 (GCN) as an example: ^[5]

256 KiB private memory per compute unit
64 KiB local memory per compute unit
64 KiB constant memory
16 KiB constant cache per four compute units
16 KiB L1 cache per compute unit
768 KiB L2 cache
3 GiB to 6 GiB global memory

Instruction Throughput

GPUs are used in HPC environments because of their good FLOP/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's Terascale, GCN, RDNA), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, Radeon Pro, Radeon Instinct) and the specific model.

32 bit Integer Performance

The 32 bit integer performance can be architecture and operation depended less than 32 bit FLOP or 24 bit integer performance.

64 bit Integer Performance

Current GPU registers and Vector-ALUs are 32 bit wide and have to emulate 64 bit integer operations.^[6] ^[7]

Mixed Precision Support

Newer architectures like Nvidia Turing and AMD Vega have mixed precision support. Vega doubles the FP16 and quadruples the INT8 throughput.^[8]Turing doubles the FP16 throughput of its FPUs.^[9]

TensorCores

With Nvidia Volta series TensorCores were introduced. They offer fp16*fp16+fp32, matrix-multiplication-accumulate-units, used to accelerate neural networks.^[10] Turings 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.^[11]

Throughput Examples

Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations/clock cycle per compute unit ^[12]

   MAD 16
   MUL 16
   ADD 32
   Bit-shift 16
   Bitwise XOR 32

Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec

AMD Radeon HD 7970 (GCN 1.0) - 32 bit integer operations/clock cycle per processing element ^[13]

   MAD 1/4
   MUL 1/4
   ADD 1
   Bit-shift 1
   Bitwise XOR 1

Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec

Host-Device Latencies

One reason GPUs are not used as accelerators for chess engines is the host-device latency, aka. kernel-launch-overhead. Nvidia and AMD have not published official numbers, but in practice there is an measurable latency for null-kernels of 5 microseconds ^[14] up to 100s of microseconds ^[15]. One solution to overcome this limitation is to couple tasks to batches to be executed in one run ^[16].

Deep Learning

GPUs were originally intended to process matrix multiplications for graphical transformations and rendering. Convolutional Neural Networks can have their operations interpreted as a series of matrix multiplications. GPUs are therefore a natural fit to parallelize and process CNNs.

GPUs traditionally operated on 32-bit floating point numbers. However, CNNs can make due with 16-bit half floats (FP16), or even 8-bit or 4-bit numbers. One thousand single-precision floats will take up 4kB of space, while one-thousand FP16 will take up 2kB of space. A half-float uses half the memory, eats only half the memory bandwidth, and only half the space in caches. As such, GPUs such as AMD Vega or NVidia Volta added support for FP16 processing.

Specialized units, such as NVidia Volta's "Tensor cores", can perform an entire 4x4 block of FP16 matrix multiplications in just one PTX assembly language statement. It is with these instructions that CNN operations are accelerated.

GPUs are much more suited than CPUs to implement and train Convolutional Neural Networks (CNN), and were therefore also responsible for the deep learning boom, also affecting game playing programs combining CNN with MCTS, as pioneered by Google DeepMind's AlphaGo and AlphaZero entities in Go, Shogi and Chess using TPUs, and the open source projects Leela Zero headed by Gian-Carlo Pascutto for Go and its Leela Chess Zero adaption.

Publications

2009

Ren Wu, Bin Zhang, Meichun Hsu (2009). Clustering billions of data points using GPUs. ACM International Conference on Computing Frontiers
Mark Govett, Craig Tierney, Jacques Middlecoff, Tom Henderson (2009). Using Graphical Processing Units (GPUs) for Next Generation Weather and Climate Prediction Models. CAS2K9 Workshop

2010...

Avi Bleiweiss (2010). Playing Zero-Sum Games on the GPU. NVIDIA Corporation, GPU Technology Conference 2010, slides as pdf
Mark Govett, Jacques Middlecoff, Tom Henderson (2010). Running the NIM Next-Generation Weather Model on GPUs. CCGRID 2010
John Nickolls, William J. Dally (2010). The GPU Computing Era. IEEE Micro.

2011

Mark Govett, Jacques Middlecoff, Tom Henderson, Jim Rosinski, Craig Tierney (2011). Parallelization of the NIM Dynamical Core for GPUs. slides as pdf
Ľubomír Lackovič (2011). Parallel Game Tree Search Using GPU. Institute of Informatics and Software Engineering, Faculty of Informatics and Information Technologies, Slovak University of Technology in Bratislava, pdf
Dan Anthony Feliciano Alcantara (2011). Efficient Hash Tables on the GPU. Ph. D. thesis, University of California, Davis, pdf » Hash Table
Damian Sulewski (2011). Large-Scale Parallel State Space Search Utilizing Graphics Processing Units and Solid State Disks. Ph.D. thesis, University of Dortmund, pdf
Damjan Strnad, Nikola Guid (2011). Parallel Alpha-Beta Algorithm on the GPU. CIT. Journal of Computing and Information Technology, Vol. 19, No. 4 » Parallel Search, Reversi

2012

Liang Li, Hong Liu, Peiyu Liu, Taoying Liu, Wei Li, Hao Wang (2012). A Node-based Parallel Game Tree Algorithm Using GPUs. CLUSTER 2012 » Parallel Search

2013

S. Ali Mirsoleimani, Ali Karami Ali Karami, Farshad Khunjush (2013). A parallel memetic algorithm on GPU to solve the task scheduling problem in heterogeneous environments. GECCO '13
Ali Karami, S. Ali Mirsoleimani, Farshad Khunjush (2013). A statistical performance prediction model for OpenCL kernels on NVIDIA GPUs. CADS 2013
Diego Rodríguez-Losada, Pablo San Segundo, Miguel Hernando, Paloma de la Puente, Alberto Valero-Gomez (2013). GPU-Mapping: Robotic Map Building with Graphical Multiprocessors. IEEE Robotics & Automation Magazine, Vol. 20, No. 2, pdf

2014

Qingqing Dang, Shengen Yan, Ren Wu (2014). A fast integral image generation algorithm on GPUs. ICPADS 2014
S. Ali Mirsoleimani, Ali Karami Ali Karami, Farshad Khunjush (2014). A Two-Tier Design Space Exploration Algorithm to Construct a GPU Performance Predictor. ARCS 2014, Lecture Notes in Computer Science, Vol. 8350, Springer

2015 ...

Peter H. Jin, Kurt Keutzer (2015). Convolutional Monte Carlo Rollouts in Go. arXiv:1512.03375 » Deep Learning, Go, MCTS
Liang Li, Hong Liu, Hao Wang, Taoying Liu, Wei Li (2015). A Parallel Algorithm for Game Tree Search Using GPGPU. IEEE Transactions on Parallel and Distributed Systems, Vol. 26, No. 8 » Parallel Search

2016

Sean Sheen (2016). Astro - A Low-Cost, Low-Power Cluster for CPU-GPU Hybrid Computing using the Jetson TK1. Master's thesis, California Polytechnic State University, pdf ^[17] ^[18]
Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, Robert Hundt (2016). gpucc: an open-source GPGPU compiler. CGO 2016
David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, Demis Hassabis (2016). Mastering the game of Go with deep neural networks and tree search. Nature, Vol. 529 » AlphaGo

2017

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815 » AlphaZero
Tristan Cazenave (2017). Residual Networks for Computer Go. IEEE Transactions on Computational Intelligence and AI in Games, Vol. PP, No. 99, pdf

2018

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis (2018). A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, Vol. 362, No. 6419

Forum Posts

2005 ...

Hardware assist by Nicolai Czempin, Winboard Forum, August 27, 2006
Monte carlo on a NVIDIA GPU ? by Marco Costalba, CCC, August 01, 2008

2010 ...

Using the GPU by Louis Zulli, CCC, February 19, 2010

2011

GPGPU and computer chess by Wim Sjoho, CCC, February 09, 2011
Possible Board Presentation and Move Generation for GPUs? by Srdja Matovic, CCC, March 19, 2011

Re: Possible Board Presentation and Move Generation for GPUs by Steffan Westcott, CCC, March 20, 2011

Zeta plays chess on a gpu by Srdja Matovic, CCC, June 23, 2011 » Zeta
GPU Search Methods by Joshua Haglund, CCC, July 04, 2011

2012

Possible Search Algorithms for GPUs? by Srdja Matovic, CCC, January 07, 2012 ^[19] ^[20]
uct on gpu by Daniel Shawul, CCC, February 24, 2012 » UCT
Is there such a thing as branchless move generation? by John Hamlen, CCC, June 07, 2012 » Move Generation
Choosing a GPU platform: AMD and Nvidia by John Hamlen, CCC, June 10, 2012
Nvidias K20 with Recursion by Srdja Matovic, CCC, December 04, 2012 ^[21]

2013

Kogge Stone, Vector Based by Srdja Matovic, CCC, January 22, 2013 » Kogge-Stone Algorithm ^[22] ^[23]
GPU chess engine by Samuel Siltanen, CCC, February 27, 2013
Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013 » Perft, Kogge-Stone Algorithm ^[24]

2015 ...

GPU chess update, local memory... by Srdja Matovic, CCC, June 06, 2016
Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016 » Astro
Pigeon is now running on the GPU by Stuart Riffle, CCC, November 02, 2016 » Pigeon

2017

Back to the basics, generating moves on gpu in parallel... by Srdja Matovic, CCC, March 05, 2017 » Move Generation
Re: Perft(15): comparison of estimates with Ankan's result by Ankan Banerjee, CCC, August 26, 2017 » Perft(15)
Chess Engine and GPU by Fishpov , Rybka Forum, October 09, 2017
To TPU or not to TPU... by Srdja Matovic, CCC, December 16, 2017 » Deep Learning ^[25]

2018

Announcing lczero by Gary, CCC, January 09, 2018 » Leela Chess Zero
GPU ANN, how to deal with host-device latencies? by Srdja Matovic, CCC, May 06, 2018 » Neural Networks
How good is the RTX 2080 Ti for Leela? by Hai, September 15, 2018 » Leela Chess Zero ^[26]

Re: How good is the RTX 2080 Ti for Leela? by Ankan Banerjee, CCC, September 16, 2018

My non-OC RTX 2070 is very fast with Lc0 by Kai Laskos, CCC, November 19, 2018 » Leela Chess Zero
LC0 using 4 x 2080 Ti GPU's on Chess.com tourney? by M. Ansari, CCC, December 28, 2018 » Leela Chess Zero

2019

Generate EGTB with graphics cards? by Nguyen Pham, CCC, January 01, 2019 » Endgame Tablebases
LCZero FAQ is missing one important fact by Jouni Uski, CCC, January 01, 2019 » Leela Chess Zero
Wouldn't it be nice if C++ GPU by Chris Whittington, CCC, April 25, 2019
Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search by Percival Tiglao, CCC, June 06, 2019

External Links

OpenCL

CUDA

CUDA from Wikipedia
CUDA Zone | NVIDIA Developer
Nvidia CUDA Compiler (NVCC) from Wikipedia
Compiling CUDA with clang — LLVM Clang documentation
CppCon 2016: “Bringing Clang and C++ to GPUs: An Open-Source, CUDA-Compatible GPU C++ Compiler" by Justin Lebar, YouTube Video ^[27]

:

Deep Learning

Deep Learning | NVIDIA Developer » Deep Learning
NVIDIA cuDNN | NVIDIA Developer
Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster
Deep Learning in a Nutshell: Core Concepts by Tim Dettmers, Parallel Forall, November 3, 2015
Deep Learning in a Nutshell: History and Training by Tim Dettmers, Parallel Forall, December 16, 2015
Deep Learning in a Nutshell: Sequence Learning by Tim Dettmers, Parallel Forall, March 7, 2016
Deep Learning in a Nutshell: Reinforcement Learning by Tim Dettmers, Parallel Forall, September 8, 2016
Faster deep learning with GPUs and Theano
Theano (software) from Wikipedia
TensorFlow from Wikipedia

Game Programming

GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper

Chess Programming

References

↑ Graphics processing unit - Wikimedia Commons
↑ CUDA C Programming Guide v7.0, Appendix G. COMPUTE CAPABILITIES, Table 12 Technical Specifications per Compute Capability
↑ AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
↑ CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES
↑ AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
↑ AMD Vega White Paper
↑ Nvidia Turing White Paper
↑ Vega (GCN 5th generation) from Wikipedia
↑ AnandTech - Nvidia Turing Deep Dive page 4
↑ INSIDE VOLTA
↑ AnandTech - Nvidia Turing Deep Dive page 6
↑ CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions
↑ AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths
↑ host-device latencies? by Srdja Matovic, Nvidia CUDA ZONE, Feb 28, 2019
↑ host-device latencies? by Srdja Matovic AMD Developer Community, Feb 28, 2019
↑ Re: GPU ANN, how to deal with host-device latencies? by Milos Stanisavljevic, CCC, May 06, 2018
↑ Jetson TK1 Embedded Development Kit | NVIDIA
↑ Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016
↑ Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2
↑ Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2
↑ Tesla K20 GPU Compute Processor Specifications Released | techPowerUp
↑ Parallel Thread Execution from Wikipedia
↑ NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf
↑ ankan-ban/perft_gpu · GitHub
↑ Tensor processing unit from Wikipedia
↑ GeForce 20 series from Wikipedia
↑ Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019
↑ Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

Up one Level

[1] Graphics processing unit - Wikimedia Commons

[2] CUDA C Programming Guide v7.0, Appendix G. COMPUTE CAPABILITIES, Table 12 Technical Specifications per Compute Capability

[3] AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices

[4] CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES

[5] AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices

[6] AMD Vega White Paper

[7] Nvidia Turing White Paper

[8] Vega (GCN 5th generation) from Wikipedia

[9] AnandTech - Nvidia Turing Deep Dive page 4

[10] INSIDE VOLTA

[11] AnandTech - Nvidia Turing Deep Dive page 6

[12] CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions

[13] AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths

[14] st-device latencies? by Srdja Matovic, Nvidia CUDA ZONE, Feb 28, 2019

[15] st-device latencies? by Srdja Matovic AMD Developer Community, Feb 28, 2019

[16] Re: GPU ANN, how to deal with host-device latencies? by Milos Stanisavljevic, CCC, May 06, 2018

[17] Jetson TK1 Embedded Development Kit | NVIDIA

[18] Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016

[19] Yaron Shoham, Sivan Toledo (2002). Parallel Randomized Best-First Minimax Search. Artificial Intelligence, Vol. 137, Nos. 1-2

[20] Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence, Vol. 140, Nos. 1-2

[21] Tesla K20 GPU Compute Processor Specifications Released | techPowerUp

[22] Parallel Thread Execution from Wikipedia

[23] NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf

[24] -ban/perft_gpu · GitHub

[25] Tensor processing unit from Wikipedia

[26] GeForce 20 series from Wikipedia

[27] Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019

[28] Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

@@ Line 72: / Line 72: @@
 This highlights the fundamental trade off of the GPU platform. GPUs have many threads of execution, but they are forced to execute with their warps or wavefronts. In complicated loops or trees of if-statements, this thread divergence problem can cause your code to potentially leave many hardware threads idle.
-== Building up to larger thread groups ==
+== Blocks and Workgroups ==
-The GPU hardware will execute entire warps or wavefronts at a time. Anything less than 32-threads will force some SIMD-threads to idle. As such, high-performance programmers should try to schedule as many full-warps or wavefronts as possible.
+Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.
-Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.
+== Grids and NDRange ==
-Workgroups are not the end of scaling however. GPUs can support many workgroups to execute in parallel. AMD Vega Compute Units (CUs) can schedule 40 wavefronts per CU (although it only physically executes 4 wavefronts concurrently), and has 64 CUs available on a Vega64 GPU. AMD Vega64 (Vega) Summary: 64 Threads per Wavefront. 1 to 16 Wavefronts per Workgroup. With 64 CUs supporting 40 wavefronts, a total of 2560 wavefronts (163,840 threads) can be loaded per AMD Vega64.
+CUDA Grids and OpenCL NDRange is the end of scaling for the programming model. Many blocks can be specified in a CUDA Grid, while many workgroups operate over an OpenCL NDRange.
-NVidia has a similar language and mechanism. NVidia GPUs can support many blocks to execute in parallel. NVidia Symmetric Multiprocessors can schedule 32 warp per SM (although it can only physically execute 1 warp at a time). With 40 SMs available on a RTX 2070. NVidia RTX 2070 (Turing) Summary: 32 Threads per Warp. 1 to 32 Warps per Block. With 40 SMs, each supporting 32 warps, a total of 1280 warps (40,960 threads) can be scheduled per RTX 2070.
+The underlying hardware supports running many workgroups in parallel, across different compute units. An AMD Vega64 has 64 compute units for example, while an NVidia RTX 2070 has 40 symmetric multiprocessors. The hardware scheduler can fit many blocks and workgroups per compute unit. The exact number is dependent on the amount of registers, memory, and wavefronts a particular workgroup uses.
-The challenge of GPU Compute Languages is to provide the programmer the flexibility to take advantage of memory optimizations at the CUDA Block or OpenCL Workgroup level (~1024 threads), while still being able to specify the tens-of-thousands of physical threads working on the typical GPU.
+CUDA Grids and OpenCL NDRanges may operate in parallel, or may be traversed sequentially if the GPU doesn't have enough parallel resources.
 = Architectures and Physical Hardware =
@@ Line 96: / Line 96: @@
 NVidia's consumer line of cards is Geforce, branded with RTX or GTX labels. Nvidia's professional line of cards is Quadro. Finally, Tesla cards constitute NVidia's server line.
-NVidia's "Titan" line of Geforce cards are technically consumer cards, but internally are using professional or server class chips. As such, the Titan line can cost anywhere from $1000 to $3000 per card.
+NVidia's "Titan" line of Geforce cards use consumer drivers, but use professional or server class chips. As such, the Titan line can cost anywhere from $1000 to $3000 per card.
 === Turing Architecture ===
+[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]
 Turing cards were first released in 2018. They are the first consumer cores to launch with RTX, or raytracing, features. RTX instructions will more quickly traverse an aabb tree to discover ray-intersections with lists of objects. These are also the first consumer cards to launch with Tensor cores, 4x4 matrix multiplication FP16 instructions to accelerate convolutional neural networks.
@@ Line 112: / Line 114: @@
 === Volta Architecture ===
+[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]
 Volta cards were released in 2018. Only Tesla and Titan cards were produced in this generation, aiming only for the most expensive end of the market. They were the first cards to launch with Tensor cores, supporting 4x4 FP16 matrix multiplications to accelerate convolutional neural networks.
@@ Line 119: / Line 123: @@
 === Pascal Architecture ===
+[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]
 Pascal cards were first released in 2016.
@@ Line 133: / Line 139: @@
 == AMD ==
-== RDNA 1.0 ==
+=== RDNA 1.0 ===
+[https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]
 RDNA cards were first released in 2019. RDNA is a major change for AMD cards: the underlying hardware supports both Wave32 and Wave64 gangs of threads.
@@ Line 140: / Line 148: @@
 * 5700
-== Vega GCN 5th gen ==
+=== Vega GCN 5th gen ===
+[https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]
 Vega cards were first released in 2017.
@@ Line 148: / Line 158: @@
 * Vega56
-== Polaris GCN 4th gen ==
+=== Polaris GCN 4th gen ===
+[https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]
 * RX 580

Difference between revisions of "GPU"

Revision as of 17:06, 9 August 2019

Contents

GPGPU

Khronos OpenCL

NVidia Software overview

AMD Software Overview

Other 3rd party tools

The SIMT Programming Model

Blocks and Workgroups

Grids and NDRange

Architectures and Physical Hardware

NVidia

Turing Architecture

Volta Architecture

Pascal Architecture

AMD

RDNA 1.0

Vega GCN 5th gen

Polaris GCN 4th gen

Inside

Memory

Instruction Throughput

Throughput Examples

Host-Device Latencies

Deep Learning

See also

Publications

2009

2010...

2015 ...

Forum Posts

2005 ...

2010 ...

2015 ...

External Links

OpenCL

CUDA

Deep Learning

Game Programming

Chess Programming

References

Navigation menu

Search