Changes

Jump to: navigation, search

GPU

183 bytes added, 16:09, 16 February 2019
updated the integer throughput section
The Nvidia [https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_500_Series GeForce GTX 580], for example, is able to run 32 threads in one Warp, in total of 24576 threads, spread on 16 compute units with a total of 512 cores. <ref>CUDA C Programming Guide v7.0, Appendix G. COMPUTE CAPABILITIES, Table 12 Technical Specifications per Compute Capability</ref>
The AMD [https://en.wikipedia.org/wiki/Radeon_HD_7000_Series#Radeon_HD_7900 Radeon HD 7970] is able to run 64 threads in one Wavefront, in total of 81920 threads, spread on 32 compute units with a total of 2048 cores. <ref>AMD Accelerated Parallel Processing OpenCL Programming Guide rev2.7, Appendix D Device Parameters, Table D.1 Parameters for 7xxx Devices</ref>. In real life the register and shared memory size limits this the amount of totalthreads.
=Memory=
* 3 GiB to 6 GiB global memory
=Integer Instruction Throughput= GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The 32 bit integer performance can be less than 32 bit FLOP or 24 bit integer performance.The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 Terascale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/AMD_FirePro FirePro], [https://en.wikipedia.org/wiki/AMD_FireStream FireStream]) and the specific model.
As an example, here the * 32 bit Integer Performance : The 32 bit integer performance can be architecture depended less than 32 bit FLOP or 24 bit integer performance of the Nvidia GeForce GTX 580 (Fermi, CC 2.0) and AMD Radeon HD 7970 (GCN 1.0):
Nvidia GeForce GTX 580 - 32 * 64 bit Integer Performance: Current GPU architectures have no native 64 bit ALUs and have to emulate 64 bit integer operations. * Mixed Precision Support: Newer architectures like Nvidia [https:/clock cycle per compute unit <ref>CUDA C Programming Guide v7/en.wikipedia.0org/wiki/Turing_(microarchitecture) Turing] and AMD [https://en.wikipedia.org/wiki/AMD_RX_Vega_series Vega] have mixed precision support, Chapter 5which doubles the [https://en.4wikipedia.1org/wiki/Half-precision_floating-point_format fp16] throughput and quadruples the int8 throughput, which can boost neural networks significantly. Arithmetic Instructions</ref>* MAD 16* MUL 16TensorCores: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] and Turing series TensorCores were introduced. They offer fp16* ADD 32* Bitfp16+fp32, [https://on-shift 16* Bitwise XOR 32Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790demand.gputechconf.528 GigaOpscom/gtc/2017/presentation/secs7798-luke-durant-inside-volta.pdf matrix-matrix-multiplication units], used to accelerate neural networks.
AMD Radeon HD 7970 - 32 bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, Chapter 2.7.1 Instruction Bandwidths</ref>
* MAD 1/4
* MUL 1/4
* ADD 1
* Bit-shift 1
* Bitwise XOR 1
Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec
=Deep Learning=
GPUs are much more suited than CPUs to implement and train [[Neural Networks#Convolutional|Convolutional Neural Networks]] (CNN), and were therefore also responsible for the [[Deep Learning|deep learning]] boom,
also affecting game playing programs combining CNN with [[Monte-Carlo Tree Search|MCTS]], as pioneered by [[Google]] [[DeepMind|DeepMind's]] [[AlphaGo]] and [[AlphaZero]] entities in [[Go]], [[Shogi]] and [[Chess]] using [https://en.wikipedia.org/wiki/Tensor_processing_unit TPUs], and the open source projects [[Leela Zero]] headed by [[Gian-Carlo Pascutto]] for [[Go]] and its [[Leela Chess Zero]] adaption.
=See also=
422
edits

Navigation menu