Changes

GPU

598 bytes added, 11:25, 8 August 2020

→‎Instruction Throughput: added FLOP

GPUs are used in [https://en.wikipedia.org/wiki/High-performance_computing HPC] environments because of their good [https://en.wikipedia.org/wiki/FLOP FLOP]/Watt ratio. The instruction throughput in general depends on the architecture (like Nvidia's [https://en.wikipedia.org/wiki/Tesla_%28microarchitecture%29 Tesla], [https://en.wikipedia.org/wiki/Fermi_%28microarchitecture%29 Fermi], [https://en.wikipedia.org/wiki/Kepler_%28microarchitecture%29 Kepler], [https://en.wikipedia.org/wiki/Maxwell_%28microarchitecture%29 Maxwell] or AMD's [https://en.wikipedia.org/wiki/TeraScale_%28microarchitecture%29 Terascale], [https://en.wikipedia.org/wiki/Graphics_Core_Next GCN], [https://en.wikipedia.org/wiki/AMD_RDNA_Architecture RDNA]), the brand (like Nvidia [https://en.wikipedia.org/wiki/GeForce GeForce], [https://en.wikipedia.org/wiki/Nvidia_Quadro Quadro], [https://en.wikipedia.org/wiki/Nvidia_Tesla Tesla] or AMD [https://en.wikipedia.org/wiki/Radeon Radeon], [https://en.wikipedia.org/wiki/Radeon_Pro Radeon Pro], [https://en.wikipedia.org/wiki/Radeon_Instinct Radeon Instinct]) and the specific model.

==Integer Instruction Throughput==* ~~32 bit Integer Performance~~ INT32

: The 32 bit integer performance can be architecture and operation depended less than 32 bit FLOP or 24 bit integer performance.

* ~~64 bit Integer Performance~~INT64

: Current GPU [https://en.wikipedia.org/wiki/Processor_register registers] and Vector-[https://en.wikipedia.org/wiki/Arithmetic_logic_unit ALUs] are 32 bit wide and have to emulate 64 bit integer operations.<ref>[https://en.wikichip.org/w/images/a/a1/vega-whitepaper.pdf AMD Vega White Paper]</ref> <ref>[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Nvidia Turing White Paper]</ref>

* INT8/Mixed Precision Support

: Newer architectures like Nvidia [https://en.wikipedia.org/wiki/Turing_(microarchitecture) Turing] and AMD [https://en.wikipedia.org/wiki/AMD_RX_Vega_series Vega] have mixed precision support. Vega doubles the [https://en.wikipedia.org/wiki/FP16 FP16] and quadruples the [https://en.wikipedia.org/wiki/Integer_(computer_science)#Common_integral_data_types INT8] throughput.<ref>[https://en.wikipedia.org/wiki/Graphics_Core_Next#fifth Vega (GCN 5th generation) from Wikipedia]</ref>Turing doubles the FP16 throughput of its [https://en.wikipedia.org/wiki/Floating-point_unit FPUs].<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/4 AnandTech - Nvidia Turing Deep Dive page 4]</ref>

==Floating Point Instruction Throughput== * ~~TensorCores~~FP32: Consumer GPU performance is measured usually in single precision (32 bit) floating point FMA, fused-multiply-add, throughput. * FP64: Consumer GPUs have in general a lower ratio (FP32:FP64) for double-precision (64 bit) floating point operations than server brand GPUs, like 4:1 down to 32:1 compared to 2:1 to 4:1. * FP16: Newer GPGPU architectures offer half-precision (16 bit) floating point operation throughput with an FP32:FP16 ratio of 1:2. Older architectures migth not support FP16 at all, at the same rate as FP32, or at very low rates. ==Tensors==

: With Nvidia [https://en.wikipedia.org/wiki/Volta_(microarchitecture) Volta] series TensorCores were introduced. They offer fp16*fp16+fp32, matrix-multiplication-accumulate-units, used to accelerate neural networks.<ref>[https://on-demand.gputechconf.com/gtc/2017/presentation/s7798-luke-durant-inside-volta.pdf INSIDE VOLTA]</ref> Turing's 2nd gen TensorCores add FP16, INT8, INT4 optimized computation.<ref>[https://www.anandtech.com/show/13282/nvidia-turing-architecture-deep-dive/6 AnandTech - Nvidia Turing Deep Dive page 6]</ref> Amperes's 3rd gen adds support for bfloat16, TensorFloat-32 (TF32), FP64 and sparsity acceleration.<ref>[https://en.wikipedia.org/wiki/Ampere_(microarchitecture)#Details Wikipedia - Ampere microarchitecture]</ref>

Bitwise XOR 1

Max theoretic ADD operation throughput: 1 Op * 2048 PEs * 925 MHz = 1894.4 GigaOps/sec

=Host-Device Latencies=

Smatovic

422

edits

Changes

GPU

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools