GPU

From Chessprogramming wiki
Jump to: navigation, search

Home * Hardware * GPU

GPU (Graphics Processing Unit),
a specialized processor primarily intended for graphic cards to rapidly manipulate and alter memory for fast image processing, usually but not necessarily mapped to a framebuffer of a display. GPUs have more raw computing power than general purpose CPUs but need a limited, specialized and massive parallelized way of programming, not that conform with the serial nature of alpha-beta if it is about a massive parallel search in chess. Instead, Best-first Monte-Carlo Tree Search (MCTS) approaches in conjunction with deep learning proved a successful way to go on GPU architectures.

GPGPU

There are various frameworks for GPGPU, General Purpose computing on Graphics Processing Unit. Despite language wrappers and mobile devices with special APIs, there are in main three ways to make use of GPGPU.

Mapping to an API

Native Compilers

Intermediate Languages

Inside

Modern GPUs consist of up to hundreds of SIMD or Vector units, coupled to compute units. Each compute unit processes multiple Warps (Nvidia term) resp. Wavefronts (AMD term) in SIMT fashion. Each Warp resp. Wavefront runs n (32 or 64) threads simultaneously.

The Nvidia GeForce GTX 580, for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. [2] The AMD Radeon HD 7970 is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores. [3]. In real life the register and shared memory size limits this total.

Memory

The memory hierarchy of an GPU consists in main of private memory (registers accessed by an single thread resp. work-item), local memory (shared by threads of an block resp. work-items of an work-group ), constant memory, different types of cache and global memory. Size, latency and bandwidth vary between vendors and architectures.

Here the data for the Nvidia GeForce GTX 580 (Fermi) as an example: [4]

  • 128 KiB private memory per compute unit
  • 48 KiB (16 KiB) local memory per compute unit (configurable)
  • 64 KiB constant memory
  • 8 KiB constant cache per compute unit
  • 16 KiB (48 KiB) L1 cache per compute unit (configurable)
  • 768 KiB L2 cache
  • 1.5 GiB to 3 GiB global memory

Here the data for the AMD Radeon HD 7970 (GCN) as an example: [5]

  • 256 KiB private memory per compute unit
  • 64 KiB local memory per compute unit
  • 64 KiB constant memory
  • 16 KiB constant cache per four compute units
  • 16 KiB L1 cache per compute unit
  • 768 KiB L2 cache
  • 3 GiB to 6 GiB global memory

Integer Throughput

GPUs are used in HPC environments because of their good FLOP/Watt ratio. The 32 bit integer performance can be less than 32 bit FLOP or 24 bit integer performance. The instruction throughput depends on the architecture (like Nvidia's Tesla, Fermi, Kepler, Maxwell or AMD's Terascale, GCN), the brand (like Nvidia GeForce, Quadro, Tesla or AMD Radeon, FirePro, FireStream) and the specific model.

As an example, here the 32 bit integer performance of the Nvidia GeForce GTX 580 (Fermi, CC 2.0) and AMD Radeon HD 7970 (GCN 1.0):

Nvidia GeForce GTX 580 - 32 bit integer operations/clock cycle per compute unit [6]

  • MAD 16
  • MUL 16
  • ADD 32
  • Bit-shift 16
  • Bitwise XOR 32

Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec

AMD Radeon HD 7970 - 32 bit integer operations/clock cycle per processing element [7]

  • MAD 1/4
  • MUL 1/4
  • ADD 1
  • Bit-shift 1
  • Bitwise XOR 1

Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec

Deep Learning

GPUs are much more suited than CPUs to implement and train Convolutional Neural Networks (CNN), and were therefore also responsible for the deep learning boom, also affecting game playing programs combining CNN with MCTS, as pioneered by Google DeepMind's AlphaGo and AlphaZero entities in Go, Shogi and Chess using TPUs, and the open source projects Leela Zero headed by Gian-Carlo Pascutto for Go and its Leela Chess Zero adaption.

See also

Publications

2009

2010...

2015 ...

Forum Posts

2005 ...

2010 ...

2015 ...

External Links

OpenCL

CUDA

 :

Deep Learning

Game Programming

GitHub - gcp/leela-zero: Go engine with no human-provided knowledge, modeled after the AlphaGo Zero paper

Chess Programming

References

  1. Graphics processing unit - Wikimedia Commons
  2. CUDA C Programming Guide v7.0, Appendix G. COMPUTE CAPABILITIES, Table 12 Technical Specifications per Compute Capability
  3. AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
  4. CUDA C Programming Guide v7.0, Appendix G.COMPUTE CAPABILITIES
  5. AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices
  6. CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions
  7. AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths
  8. Jetson TK1 Embedded Development Kit | NVIDIA
  9. Jetson GPU architecture by Dann Corbit, CCC, October 18, 2016
  10. Yaron Shoham, Sivan Toledo (2001). Parallel randomized best-first minimax search. School of Computer Science, Tel-Aviv University, pdf
  11. Alberto Maria Segre, Sean Forman, Giovanni Resta, Andrew Wildenberg (2002). Nagging: A Scalable Fault-Tolerant Paradigm for Distributed Search. Artificial Intelligence 140, pdf, pdf
  12. Tesla K20 GPU Compute Processor Specifications Released | techPowerUp
  13. Parallel Thread Execution from Wikipedia
  14. NVIDIA Compute PTX: Parallel Thread Execution, ISA Version 1.4, March 31, 2009, pdf
  15. ankan-ban/perft_gpu · GitHub
  16. Tensor processing unit from Wikipedia
  17. Re: Generate EGTB with graphics cards? by Graham Jones, CCC, January 01, 2019
  18. Fast perft on GPU (upto 20 Billion nps w/o hashing) by Ankan Banerjee, CCC, June 22, 2013

Up one Level