Changes

GPU

1,059 bytes added, 13:47, 17 February 2019

update on instruction throughput section

* 64 bit Integer Performance

: Current GPU [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] are ~~not 64~~ 32 bit wide and have to emulate 64 bit integer operations.<ref>[https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf |AMD Vega White Paper]</ref> <ref>[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Nvidia Turing White Paper]</ref>

* Mixed Precision Support

: Newer architectures like Nvidia [https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] and AMD [https://en.wikipedia.org/wiki/AMD_RX_Vega_series Vega] have mixed precision support, which ~~doubles~~ can double the [https://en.wikipedia.org/wiki/Half-precision_floating-point_format fp16] throughput ~~and quadruples~~ resp. quadruple the int8 throughput~~, which can~~ and boost neural networks significantly.

* TensorCores

: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] ~~and Turing~~ series TensorCores were introduced. They offer fp16*fp16+fp32, matrix-matrix-multiplication units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf ~~matrix~~]</ref> ==Throughput Examples== Nvidia GeForce GTX 580 (Fermi, CC 2.0) - 32 bit integer operations/clock cycle per compute unit <ref>CUDA C Programming Guide v7.0, Chapter 5.4.1. Arithmetic Instructions</ref> MAD 16 MUL 16 ADD 32 Bit-~~matrix~~shift 16 Bitwise XOR 32 Max theoretic ADD operation throughput: 32 Ops * 16 CUs * 1544 MHz = 790.528 GigaOps/sec AMD Radeon HD 7970 (GCN 1.0) -~~multiplication units]~~32 bit integer operations/clock cycle per processing element <ref>AMD_OpenCL_Programming_Optimization_Guide.pdf 3.0beta, ~~used to accelerate neural networks~~Chapter 2.7.1 Instruction Bandwidths</ref> MAD 1/4 MUL 1/4 ADD 1 Bit-shift 1 Bitwise XOR 1 Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec

=Deep Learning=

Smatovic

422

edits

Changes

GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools