Difference between revisions of "AVX-512"

From Chessprogramming wiki
Jump to: navigation, search
Line 318: Line 318:
 
=Forum Posts=
 
=Forum Posts=
 
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75049 AVX-512 and NNUE] by [[Gian-Carlo Pascutto]], [[CCC]], September 08, 2020 » [[NNUE]]
 
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75049 AVX-512 and NNUE] by [[Gian-Carlo Pascutto]], [[CCC]], September 08, 2020 » [[NNUE]]
 +
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=77246 VPOPCNTDQ and VBMI2] by [[Vivien Clauzon]], [[CCC]], May 04, 2021
  
 
=External Links=  
 
=External Links=  

Revision as of 08:37, 8 May 2021

Home * Hardware * x86-64 * AVX-512

AVX-512,
an expansion of Intel's the AVX and AVX2 instructions using the EVEX prefix, featuring 32 512-bit wide vector SIMD registers zmm0 through zmm31, keeping either eight doubles or integer quad words such as bitboards, and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.

Extensions

AVX-512 consists of multiple extensions not all meant to be supported by all AVX-512 capable processors. Only the core extension AVX-512F (AVX-512 Foundation) is required by all implementations [1] AVX-512F and AVX-512CD were first implemented in the Xeon Phi processor and coprocessor known by the code name Knights Landing [2] , launched on June 20, 2016.

Extension Description Architecture CPUID 7

Reg:Bit [3]

AVX-512F Foundation Knights Landing EBX:16
AVX-512CD Conflict Detection Instructions EBX:28
AVX-512ER Exponential and Reciprocal Instructions EBX:27
AVX-512PF Prefetch Instructions EBX:26
AVX-512BW Byte and Word Instructions Skylake X EBX:30
AVX-512DQ Doubleword and Quadword Instructions EBX:17
AVX-512VL Vector Length Extensions EBX:31
AVX-512IFMA Integer Fused Multiply Add Cannonlake EBX:21
AVX-512VBMI Vector Byte Manipulation Instructions ECX:01
AVX-512VPOPCNTDQ Vector Population Count Knights Mill ECX:14
AVX-512-4VNNIW Vector Neural Network Instructions
Word variable precision
EDX:02
AVX-512-4FMAPS Fused Multiply Accumulation
Packed Single precision
EDX:03

Selected Instructions

VPTERNLOG

AVX-512F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise ternary logic, for instance to operate on vectors of bitboards. Three input vectors are bitwise combined by an operation determined by an immediate byte operand (imm8), whose 256 possible values corresponds with the boolean output vector of the truth table for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below [4] [5] :

Input Output of Operations
imm8 0x00 0x01 0x16 0x17 0x28 0x80 0x88 0x96 0xca 0xe8 0xfe 0xff
# a b c C-exp false ~(a|b|c) a?~(b|c):b^c minor(a,b,c) c&(a^b) a&b&c b&c a^b^c a?b:c major(a,b,c) a|b|c true
0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1
1 0 0 1 0 0 1 1 0 0 0 1 1 0 1 1
2 0 1 0 0 0 1 1 0 0 0 1 0 0 1 1
3 0 1 1 0 0 0 0 1 0 1 0 1 1 1 1
4 1 0 0 0 0 1 1 0 0 0 1 0 0 1 1
5 1 0 1 0 0 0 0 1 0 0 0 0 1 1 1
6 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1
7 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1

Following VPTERNLOGQ intrinsics are declared, where the maskz version sets unmasked destination quad word elements to zero, while the mask version copies unmasked elements from s:

__m256i _mm256_ternarylogic_epi64(__m256i a, __m256i b, __m256i c, int imm8);
__m256i _mm256_maskz_ternarylogic_epi64(__mmask8 k, __m256i a, __m256i b, __m256i c, int imm8);
__m256i _mm256_mask_ternarylogic_epi64(__m256i src, __mmask8 k, __m256i a, __m256i b, int imm8);
__m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_mask_ternarylogic_epi64(__m512i s, __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);

VPLZCNT

AVX-512CD has Vector Leading Zero Count - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel [6] - using following intrinsics [7], where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:

__m256i _mm256_lzcnt_epi64(__m256i a);
__m256i _mm256_maskz_lzcnt_epi64(__mmask8 k, __m256i a);
__m256i _mm256_mask_lzcnt_epi64(__m256i src, __mmask8 k, __m256i a);
__m512i _mm512_lzcnt_epi64(__m512i a);
__m512i _mm512_maskz_lzcnt_epi64(__mmask8 k, __m512i a);
__m512i _mm512_mask_lzcnt_epi64(__m512i src, __mmask8 k, __m512i a);

VPOPCNT

The AVX-512VPOPCNTDQ extension has a vector population count instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel [8] [9] [10].

__m128i _mm_mask_popcnt_epi32(__m128i src, __mmask8 k, __m128i a);
__m128i _mm_maskz_popcnt_epi32(__mmask8 k, __m128i a);
__m128i _mm_popcnt_epi3 (__m128i a);
__m256i _mm256_mask_popcnt_epi32(__m256i src, __mmask8 k, __m256i a);
__m256i _mm256_maskz_popcnt_epi32(__mmask8 k, __m256i a);
__m256i _mm256_popcnt_epi32(__m256i a);
__m512i _mm512_mask_popcnt_epi32(__m512i src, __mmask16 k, __m512i a);
__m512i _mm512_maskz_popcnt_epi32(__mmask16 k, __m512i a);
__m512i _mm512_popcnt_epi32(__m512i a);

__m128i _mm_mask_popcnt_epi64(__m128i src, __mmask8 k, __m128i a);
__m128i _mm_maskz_popcnt_epi64(__mmask8 k, __m128i a);
__m128i _mm_popcnt_epi64(__m128i a);
__m256i _mm256_mask_popcnt_epi64(__m256i src, __mmask8 k, __m256i a);
__m256i _mm256_maskz_popcnt_epi64(__mmask8 k, __m256i a);
__m256i _mm256_popcnt_epi64(__m256i a);
__m512i _mm512_mask_popcnt_epi64(__m512i src, __mmask8 k, __m512i a);
__m512i _mm512_maskz_popcnt_epi64(__mmask8 k, __m512i a);
__m512i _mm512_popcnt_epi64(__m512i a)

See also

SIMD

Manuals

Forum Posts

External Links

Blogs

Compiler Support

References

Up one Level