Changes

GPU

183 bytes added, 16:09, 16 February 2019

updated the integer throughput section

The Nvidia [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_Series GeForce GTX 580], for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. <ref>CUDA C Programming Guide v7.0, Appendix G. COMPUTE CAPABILITIES, Table 12 Technical Specifications per Compute Capability</ref>

The AMD [https://en.wikipedia.org/wiki/Radeon_HD_7000_Series#Radeon_HD_7900 Radeon HD 7970] is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores. <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>. In real life the register and shared memory size limits ~~this~~ the amount of totalthreads.

=Memory=

* 3 GiB to 6 GiB global memory

=~~Integer~~ Instruction Throughput= GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. ~~The 32 bit integer performance can be less than 32 bit FLOP or 24 bit integer performance.~~The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 Terascale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/AMD_FirePro FirePro], [https://en.wikipedia.org/wiki/AMD_FireStream FireStream]) and the specific model.

~~As an example, here the~~ * 32 bit Integer Performance : The 32 bit integer performance can be architecture depended less than 32 bit FLOP or 24 bit integer performance ~~of the Nvidia GeForce GTX 580 (Fermi, CC 2~~.~~0) and AMD Radeon HD 7970 (GCN 1.0):~~

~~Nvidia GeForce GTX 580 - 32~~ * 64 bit Integer Performance: Current GPU architectures have no native 64 bit ALUs and have to emulate 64 bit integer operations. * Mixed Precision Support: Newer architectures like Nvidia [https:/~~clock cycle per compute unit <ref>CUDA C Programming Guide v7~~/en.wikipedia.0org/wiki/Turing_(microarchitecture) Turing] and AMD [https://en.wikipedia.org/wiki/AMD_RX_Vega_series Vega] have mixed precision support, ~~Chapter 5~~which doubles the [https://en.4wikipedia.1org/wiki/Half-precision_floating-point_format fp16] throughput and quadruples the int8 throughput, which can boost neural networks significantly. ~~Arithmetic Instructions</ref>~~* MAD 16* ~~MUL 16~~TensorCores: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] and Turing series TensorCores were introduced. They offer fp16* ~~ADD 32~~* Bitfp16+fp32, [https://on-~~shift 16~~* Bitwise XOR 32~~Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790~~demand.gputechconf.~~528 GigaOps~~com/gtc/2017/presentation/~~sec~~s7798-luke-durant-inside-volta.pdf matrix-matrix-multiplication units], used to accelerate neural networks.

~~AMD Radeon HD 7970 - 32 bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref>~~

* MAD 1/4

* MUL 1/4

* ADD 1

* Bit-shift 1

* Bitwise XOR 1

~~Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec~~

=Deep Learning=

GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom,

also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.

=See also=

Smatovic

422

edits

Changes

GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools