Changes

Jump to: navigation, search

GPU

57 bytes removed, 17:06, 9 August 2019
Building up to larger thread groups
This highlights the fundamental trade off of the GPU platform. GPUs have many threads of execution, but they are forced to execute with their warps or wavefronts. In complicated loops or trees of if-statements, this thread divergence problem can cause your code to potentially leave many hardware threads idle.
== Building up to larger thread groups Blocks and Workgroups ==
The GPU hardware will execute entire Programmers can group warps or wavefronts at together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a timemodern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory and other resources. Anything less than 32-threads will force some SIMD-Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads to idlecan communicate extremely efficiently. As suchCase in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, high-performance programmers should try to schedule as many full-warps or wavefronts long as possiblethose threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.
Programmers can group warps or wavefronts together into larger clusters, called CUDA Blocks or OpenCL Workgroups. 1024 threads can work together on a modern GPU Compute Unit (AMD) or Symmetric Multiprocessor (NVidia), sharing L1 cache, shared memory == Grids and other resources. Because of the tight coupling of L1 cache and Shared Memory, these 1024 threads can communicate extremely efficiently. Case in point: both NVidia PTX and AMD GCN implement thread barriers as a singular assembly language instruction, as long as those threads are within the same workgroup. Atomic operations, memory fences, and other synchronization primitives are extremely fast and well optimized in these cases.NDRange ==
Workgroups are not CUDA Grids and OpenCL NDRange is the end of scaling howeverfor the programming model. GPUs Many blocks can support many workgroups to execute be specified in parallel. AMD Vega Compute Units (CUs) can schedule 40 wavefronts per CU (although it only physically executes 4 wavefronts concurrently), and has 64 CUs available on a Vega64 GPU. AMD Vega64 (Vega) Summary: 64 Threads per Wavefront. 1 to 16 Wavefronts per Workgroup. With 64 CUs supporting 40 wavefrontsCUDA Grid, a total of 2560 wavefronts (163,840 threads) can be loaded per AMD Vega64while many workgroups operate over an OpenCL NDRange.
NVidia has a similar language and mechanism. NVidia GPUs can support The underlying hardware supports running many blocks to execute workgroups in parallel, across different compute units. An AMD Vega64 has 64 compute units for example, while an NVidia Symmetric Multiprocessors can schedule 32 warp per SM (although it can only physically execute 1 warp at a time). With 40 SMs available on a RTX 2070has 40 symmetric multiprocessors. NVidia RTX 2070 (Turing) Summary: 32 Threads The hardware scheduler can fit many blocks and workgroups per Warpcompute unit. 1 to 32 Warps per Block. With 40 SMsThe exact number is dependent on the amount of registers, each supporting 32 warpsmemory, and wavefronts a total of 1280 warps (40,960 threads) can be scheduled per RTX 2070particular workgroup uses.
The challenge of GPU Compute Languages is to provide the programmer the flexibility to take advantage of memory optimizations at the CUDA Block or Grids and OpenCL Workgroup level (~1024 threads)NDRanges may operate in parallel, while still being able to specify the tens-of-thousands of physical threads working on or may be traversed sequentially if the typical GPUdoesn't have enough parallel resources.
= Architectures and Physical Hardware =
NVidia's consumer line of cards is Geforce, branded with RTX or GTX labels. Nvidia's professional line of cards is Quadro. Finally, Tesla cards constitute NVidia's server line.
NVidia's "Titan" line of Geforce cards are technically use consumer cardsdrivers, but internally are using use professional or server class chips. As such, the Titan line can cost anywhere from $1000 to $3000 per card.
=== Turing Architecture ===
 
[https://www.nvidia.com/content/dam/en-zz/Solutions/design-visualization/technologies/turing-architecture/NVIDIA-Turing-Architecture-Whitepaper.pdf Architectural Whitepaper]
Turing cards were first released in 2018. They are the first consumer cores to launch with RTX, or raytracing, features. RTX instructions will more quickly traverse an aabb tree to discover ray-intersections with lists of objects. These are also the first consumer cards to launch with Tensor cores, 4x4 matrix multiplication FP16 instructions to accelerate convolutional neural networks.
=== Volta Architecture ===
 
[https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf Architecture Whitepaper]
Volta cards were released in 2018. Only Tesla and Titan cards were produced in this generation, aiming only for the most expensive end of the market. They were the first cards to launch with Tensor cores, supporting 4x4 FP16 matrix multiplications to accelerate convolutional neural networks.
=== Pascal Architecture ===
 
[https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf Architecture Whitepaper]
Pascal cards were first released in 2016.
== AMD ==
=== RDNA 1.0 == =  [https://gpuopen.com/wp-content/uploads/2019/08/RDNA_Architecture_public.pdf Architecture Slide Deck]
RDNA cards were first released in 2019. RDNA is a major change for AMD cards: the underlying hardware supports both Wave32 and Wave64 gangs of threads.
* 5700
=== Vega GCN 5th gen === [https://www.techpowerup.com/gpu-specs/docs/amd-vega-architecture.pdf Architecture Whitepaper]
Vega cards were first released in 2017.
* Vega56
=== Polaris GCN 4th gen == =  [https://www.amd.com/system/files/documents/polaris-whitepaper.pdf Architecture Whitepaper]
* RX 580

Navigation menu