Changes

Jump to: navigation, search

GPU

1,059 bytes added, 13:47, 17 February 2019
update on instruction throughput section
* 64 bit Integer Performance
: Current GPU [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] are not 64 32 bit wide and have to emulate 64 bit integer operations.<ref>[https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf |AMD Vega White Paper]</ref> <ref>[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Nvidia Turing White Paper]</ref>
* Mixed Precision Support
: Newer architectures like Nvidia [https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] and AMD [https://en.wikipedia.org/wiki/AMD_RX_Vega_series Vega] have mixed precision support, which doubles can double the [https://en.wikipedia.org/wiki/Half-precision_floating-point_format fp16] throughput and quadruples resp. quadruple the int8 throughput, which can and boost neural networks significantly.
* TensorCores
: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] and Turing series TensorCores were introduced. They offer fp16*fp16+fp32, matrix-matrix-multiplication units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf matrix]</ref> ==Throughput Examples== Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref>  MAD 16 MUL 16 ADD 32 Bit-matrixshift 16 Bitwise XOR 32 Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec AMD Radeon HD 7970 (GCN 1.0) -multiplication units]32 bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, used to accelerate neural networksChapter 2.7.1 Instruction Bandwidths</ref>  MAD 1/4 MUL 1/4 ADD 1 Bit-shift 1 Bitwise XOR 1 Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec
=Deep Learning=
422
edits

Navigation menu