Advanced Vector Extensions (AVX) is a 256 bit extension to the x86 and x86-64 SSE, SSE2, SSE3, SSSE3, and SSE4 SIMD instruction sets, announced by Intel in March 2008, and first released in January, 2011 with Intel's Sandy Bridge architecture. With the Bulldozer microarchitecture, AVX is also available on AMD processors , along with their own XOP extension on Bulldozer only.
AVX supports 256-bit wide SIMD registers (YMM0-YMM7 in operating modes that are 32-bit or less, YMM0-YMM15 in 64-bit mode) to keep floating point vectors of either 8 floats or 4 doubles inside one register. The lower 128 bits of the YMM registers are aliased to the respective 128-bit XMM registers. AVX employs an instruction encoding scheme using a new VEX prefix, allowing a three-operand SIMD instruction format, where the destination register is distinct from the two source operands.
Advantages of AVX
AVX introduces expanded 256-bit versions of floating point instructions, which are typically not useful for chess programming. Though it does not yet expand the integer instructions to 256-bit, AVX does provide VEX-encoded versions of existing SSE 128-bit instructions. For instance, bitwise logical and:
|SSE2||pand xmm1, xmm2/m128||xmm1 := xmm1 & xmm2|
|AVX||vpand xmm1, xmm2, xmm3/m128||xmm1 := xmm2 & xmm3|
Though AVX does not yet support 256-bit integer operations, there are some benefits to using it. 3-operand support can be used to eliminate many "move" instructions, which otherwise can take up significant execution resources.
Additionally, when using xmm registers numbered 8 and higher, the AVX encoding of an SSE instruction is often one byte smaller, due to the more compact nature of the VEX encoding scheme. Finally, the ymm registers offer double the register space: even if the top halves aren't used for computation, they might be suitable as temporary storage space, avoiding the use of a scratch buffer or the stack.
While AVX can do 32-byte loads and stores, no CPU (as of Sandy Bridge) actually has a 32-byte load or store unit; such loads and stores are done simply by doing two separate 16-byte memory operations internally. Thus, AVX is no faster for memory operations (yet).
AVX on non-Intel CPUs
AMD's Bulldozer does not benefit from 3-operand in the same way that Intel's AVX-supporting CPU, Sandy Bridge, does. Bulldozer has a "move elimination" feature that resolves SIMD move instructions separately from the main execution pipeline. On Bulldozer, 3-operand support can still help reduce code size and reduce dispatch bottlenecks, but usually does not help performance much.
Additionally, Bulldozer only has a 128-bit floating-point execution unit, so 256-bit floating point operations are no faster than 128-bit ones, and sometimes actually slower. Nevertheless, some functions might still benefit from the extra register space.
Mixing AVX and SSE
Besides 3-operand support, the primary difference between the AVX and SSE encodings of an SSE instruction is that the AVX version clears the unused portion of the ymm register (the top 128 bits), while the SSE version does not modify it. Intel strongly advises against mixing SSE 128-bit instructions and AVX 256-bit instructions, as this "mode-switching" can cost upwards of 70 clock cycles. However, mixing SSE 128-bit and AVX 128-bit is okay, as is mixing AVX 128-bit and AVX 256-bit.
In order to safely switch modes, Intel recommends using vzeroupper after using 256-bit AVX instructions and before using 128-bit SSE instructions, if the two are being used in the same program.
see main article AVX2
- Introduction to Intel® Advanced Vector Extensions by Chris Lomont
- AMD64 Architecture Programmer’s Manual, Volume 4: 128-Bit and 256-Bit Media Instructions (pdf)
- Advanced Vector Extensions from Wikipedia
- VEX prefix From Wikipedia
- Intel Software Development Emulator, which can be used to experiment with AVX and AVX2 on a CPU that doesn't support them.
- Intel Intrinsics Guide
- Using AVX Without Writing AVX Code