Home * Hardware * x86-64 * AVX-512

AVX-512,
an expansion of Intel's AVX and AVX2 instructions using the EVEX prefix, featuring 32 512-bit wide vector SIMD registers zmm0 through zmm31, keeping either eight doubles or integer quad words such as bitboards, and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.

Extensions

AVX-512 consists of multiple extensions not all meant to be supported by all AVX-512 capable processors. Only the core extension AVX-512F (AVX-512 Foundation) is required by all implementations ^[1] AVX-512F and AVX-512CD were first implemented in the Xeon Phi processor and coprocessor known by the code name Knights Landing ^[2] , launched on June 20, 2016.

Extension	Description	Architecture	CPUID 7 Reg:Bit ^[3]
AVX-512 F	Foundation	Knights Landing	EBX:16
AVX-512 CD	Conflict Detection Instructions		EBX:28
AVX-512 ER	Exponential and Reciprocal Instructions		EBX:27
AVX-512 PF	Prefetch Instructions		EBX:26
AVX-512 BW	Byte and Word Instructions	Skylake X	EBX:30
AVX-512 DQ	Doubleword and Quadword Instructions		EBX:17
AVX-512 VL	Vector Length Extensions		EBX:31
AVX-512 IFMA	Integer Fused Multiply Add	Cannonlake	EBX:21
AVX-512 VBMI	Vector Byte Manipulation Instructions	Cannonlake	ECX:01
AVX-512 VPOPCNTDQ	Vector Population Count	Knights Mill	ECX:14
AVX-512-4VNNIW	Vector Neural Network Instructions Word variable precision		EDX:02
AVX-512-4FMAPS	Fused Multiply Accumulation Packed Single precision		EDX:03
AVX-512 VNNI	Vector Neural Network Instructions Vector Instructions for Deep Learning	Ice Lake	ECX:11
AVX-512 VBMI2	Vector Byte Manipulation Instructions 2 Byte/Word Load, Store and Concatenation with Shift
AVX-512 BITALG	Bit Algorithms Byte/Word Bit Manipulation Instructions expanding VPOPCNTDQn
AVX-512 GFNI	Galois field New Instructions Vector Instructions for calculating Galois Field GF(2^8)

Selected Instructions

VPTERNLOG

AVX-512 F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise ternary logic, for instance to operate on vectors of bitboards. Three input vectors are bitwise combined by an operation determined by an immediate byte operand (imm8), whose 256 possible values corresponds with the boolean output vector of the truth table for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below ^[4] ^[5] :

Input					Output of Operations
				imm8	0x00	0x01	0x16	0x17	0x28	0x80	0x88	0x96	0xca	0xe8	0xfe	0xff
#	a	b	c	C-exp	false	~(a\|b\|c)	a?~(b\|c):b^c	minor(a,b,c)	c&(a^b)	a&b&c	b&c	a^b^c	a?b:c	major(a,b,c)	a\|b\|c	true
0	0	0	0		0	1	0	1	0	0	0	0	0	0	0	1
1	0	0	1		0	0	1	1	0	0	0	1	1	0	1	1
2	0	1	0		0	0	1	1	0	0	0	1	0	0	1	1
3	0	1	1		0	0	0	0	1	0	1	0	1	1	1	1
4	1	0	0		0	0	1	1	0	0	0	1	0	0	1	1
5	1	0	1		0	0	0	0	1	0	0	0	0	1	1	1
6	1	1	0		0	0	0	0	0	0	0	0	1	1	1	1
7	1	1	1		0	0	0	0	0	1	1	1	1	1	1	1

Following VPTERNLOGQ intrinsics are declared, where the maskz version sets unmasked destination quad word elements to zero, while the mask version copies unmasked elements from s:

__m256i _mm256_ternarylogic_epi64(__m256i a, __m256i b, __m256i c, int imm8);
__m256i _mm256_maskz_ternarylogic_epi64(__mmask8 k, __m256i a, __m256i b, __m256i c, int imm8);
__m256i _mm256_mask_ternarylogic_epi64(__m256i src, __mmask8 k, __m256i a, __m256i b, int imm8);
__m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_mask_ternarylogic_epi64(__m512i s, __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);

VPLZCNT

AVX-512 CD has Vector Leading Zero Count - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel ^[6] - using following intrinsics ^[7], where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:

__m256i _mm256_lzcnt_epi64(__m256i a);
__m256i _mm256_maskz_lzcnt_epi64(__mmask8 k, __m256i a);
__m256i _mm256_mask_lzcnt_epi64(__m256i src, __mmask8 k, __m256i a);
__m512i _mm512_lzcnt_epi64(__m512i a);
__m512i _mm512_maskz_lzcnt_epi64(__mmask8 k, __m512i a);
__m512i _mm512_mask_lzcnt_epi64(__m512i src, __mmask8 k, __m512i a);

VPOPCNT

The AVX-512 VPOPCNTDQ extension has a vector population count instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel ^[8] ^[9] ^[10].

__m128i _mm_mask_popcnt_epi32(__m128i src, __mmask8 k, __m128i a);
__m128i _mm_maskz_popcnt_epi32(__mmask8 k, __m128i a);
__m128i _mm_popcnt_epi3 (__m128i a);
__m256i _mm256_mask_popcnt_epi32(__m256i src, __mmask8 k, __m256i a);
__m256i _mm256_maskz_popcnt_epi32(__mmask8 k, __m256i a);
__m256i _mm256_popcnt_epi32(__m256i a);
__m512i _mm512_mask_popcnt_epi32(__m512i src, __mmask16 k, __m512i a);
__m512i _mm512_maskz_popcnt_epi32(__mmask16 k, __m512i a);
__m512i _mm512_popcnt_epi32(__m512i a);

__m128i _mm_mask_popcnt_epi64(__m128i src, __mmask8 k, __m128i a);
__m128i _mm_maskz_popcnt_epi64(__mmask8 k, __m128i a);
__m128i _mm_popcnt_epi64(__m128i a);
__m256i _mm256_mask_popcnt_epi64(__m256i src, __mmask8 k, __m256i a);
__m256i _mm256_maskz_popcnt_epi64(__mmask8 k, __m256i a);
__m256i _mm256_popcnt_epi64(__m256i a);
__m512i _mm512_mask_popcnt_epi64(__m512i src, __mmask8 k, __m512i a);
__m512i _mm512_maskz_popcnt_epi64(__mmask8 k, __m512i a);
__m512i _mm512_popcnt_epi64(__m512i a)

VPDPBUSD

The AVX-512 VNNI extension features several instructions speeding up neural network and deep learning calculations on the CPU, for instance NNUE inference using uint8/int8. VPDPBUSD - Multiply and Add Unsigned and Signed Bytes ^[11], executes on both port 0 and port 5 in one cycle ^[12].

 
__m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
{
  for (j=0; j < 16; j++) {
    tmp1.word := Signed(ZeroExtend16(a.byte[4*j  ]) * SignExtend16(b.byte[4*j  ]);
    tmp2.word := Signed(ZeroExtend16(a.byte[4*j+1]) * SignExtend16(b.byte[4*j+1]);
    tmp3.word := Signed(ZeroExtend16(a.byte[4*j+2]) * SignExtend16(b.byte[4*j+2]);
    tmp4.word := Signed(ZeroExtend16(a.byte[4*j+3]) * SignExtend16(b.byte[4*j+3]);
    dst.dword[j] := src.dword[j] + tmp1 + tmp2 + tmp3 + tmp4
  }
  return dst;
}

SIMD

Publications

Mathias Gottschlag, Frank Bellosa (2018). Mechanism to Mitigate AVX-Induced Frequency Reduction. arXiv:1901.04982
Mathias Gottschlag, Philipp Machauer, Yussuf Khalil, Frank Bellosa (2021). Fair Scheduling for AVX2 and AVX-512 Workloads. USENIX ATC '21

Manuals

Forum Posts

AVX-512 and NNUE by Gian-Carlo Pascutto, CCC, September 08, 2020 » NNUE
VPOPCNTDQ and VBMI2 by Vivien Clauzon, CCC, May 04, 2021

External Links

Blogs

AVX-512 instructions | Intel® Developer Zone by James Reinders, July 23, 2013
Future instruction set: AVX-512 by Agner Fog, October, 09, 2013
Additional AVX-512 instructions | Intel® Developer Zone by James Reinders, July 17, 2014
Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) | Intel® Developer Zone by Thomas Willhalm, July 24, 2014
AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors by James Reinders, HPCwire, June 29, 2017
Lower Numerical Precision Deep Learning Inference and Training by Andres Rodriguez et al., January 19, 2018

Compiler Support

Intel Intrinsics Guide - AVX-512
Intel® Advanced Vector Extensions 2015/2016 Support in GNU Compiler Collection (pdf) by Kirill Yukhin, July 2014
Guide to Automatic Vectorization with Intel AVX-512 Instructions in Knights Landing Processors - Colfax Research, May 11, 2016
Microsoft Visual Studio 2017 Supports Intel® AVX-512 | Visual C++ Team Blog by Eric Battalio, July 11, 2017

References

↑ AVX-512 from Wikipedia
↑ Additional AVX-512 instructions | Intel® Developer Zone by James Reinders, July 17, 2014
↑ AVX512 table from Heise
↑ AVX512: ternary functions evaluation by Wojciech Muła, March 03, 2015
↑ Intel® Architecture Instruction Set Extensions Programming Reference (pdf) 5.3 TERNARY BIT VECTOR LOGIC TABLE
↑ Patent US9372692 - Methods, apparatus, instructions, and logic to provide permute controls with leading zero count functionality - Google Patent Search
↑ VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values
↑ sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub
↑ Wojciech Muła, Nathan Kurz, Daniel Lemire (2016). Faster Population Counts Using AVX2 Instructions. arXiv:1611.07612
↑ Intel® Intrinsics Guide VPOPCNTD
↑ Intel® Intrinsics Guide VPDPBUSD
↑ Lower Numerical Precision Deep Learning Inference and Training by Andres Rodriguez et al., January 19, 2018

Up one Level

[1] AVX-512 from Wikipedia

[2] Additional AVX-512 instructions | Intel® Developer Zone by James Reinders, July 17, 2014

[3] AVX512 table from Heise

[4] AVX512: ternary functions evaluation by Wojciech Muła, March 03, 2015

[5] Intel® Architecture Instruction Set Extensions Programming Reference (pdf) 5.3 TERNARY BIT VECTOR LOGIC TABLE

[6] Patent US9372692 - Methods, apparatus, instructions, and logic to provide permute controls with leading zero count functionality - Google Patent Search

[7] VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values

[8] sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub

[9] Wojciech Muła, Nathan Kurz, Daniel Lemire (2016). Faster Population Counts Using AVX2 Instructions. arXiv:1611.07612

[10] Intel® Intrinsics Guide VPOPCNTD

[11] Intel® Intrinsics Guide VPDPBUSD

[12] Lower Numerical Precision Deep Learning Inference and Training by Andres Rodriguez et al., January 19, 2018

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

AVX-512

Contents

Extensions

Selected Instructions

VPTERNLOG

VPLZCNT

VPOPCNT

VPDPBUSD

See also

SIMD

Publications

Manuals

Forum Posts

External Links

Blogs

Compiler Support

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools