Changes

Jump to: navigation, search

AVX-512

5,075 bytes added, 13:42, 17 March 2022
no edit summary
'''AVX-512''',<br/>
an expansion of [[Intel|Intel's]] the [[AVX]] and [[AVX2]] instructions using the [https://en.wikipedia.org/wiki/EVEX_prefix EVEX prefix], featuring '''32''' 512-bit wide vector [[SIMD and SWAR Techniques|SIMD]] registers zmm0 through zmm31, keeping either eight [[Double|doubles]] or integer [[Quad Word|quad words]] such as [[Bitboards|bitboards]], and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.
=Extensions=
Reg:Bit <ref>[https://www.heise.de/ct/zcontent/17/16-hocmsmeta/1501873687265857/ct.1617.016-017.qxp_table_29578.html AVX512 table] from [https://en.wikipedia.org/wiki/Heinz_Heise Heise]</ref>
|-
| AVX-512F 512 F
| Foundation
| rowspan="4" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing Knights Landing]
| EBX:16
|-
| AVX-512CD 512 CD
| Conflict Detection Instructions
| EBX:28
|-
| AVX-512ER 512 ER
| Exponential and Reciprocal Instructions
| EBX:27
|-
| AVX-512PF 512 PF
| Prefetch Instructions
| EBX:26
|-
| AVX-512BW 512 BW
| [[Byte]] and [[Word]] Instructions
| rowspan="3" | [https://en.wikipedia.org/wiki/Skylake_(microarchitecture) Skylake X]
| EBX:30
|-
| AVX-512DQ 512 DQ
| [[Double Word|Doubleword]] and [[Quad Word|Quadword]] Instructions
| EBX:17
|-
| AVX-512VL 512 VL
| Vector Length Extensions
| EBX:31
|-
| AVX-512IFMA 512 IFMA
| Integer Fused Multiply Add
| rowspan="2" | [https://en.wikipedia.org/wiki/Cannonlake Cannonlake]
| EBX:21
|-
| AVX-512VBMI 512 VBMI
| Vector Byte Manipulation Instructions
| ECX:01
|-
| AVX-512VPOPCNTDQ 512 VPOPCNTDQ
| Vector [[Population Count]]
| rowspan="3" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Mill Knights Mill]
| Fused Multiply Accumulation<br/>Packed Single precision
| EDX:03
|-
| AVX-512 VNNI
| Vector Neural Network Instructions <br/>Vector Instructions for [[Deep Learning]]
| rowspan="4" | [https://en.wikipedia.org/wiki/Ice_Lake_(microprocessor) Ice Lake]
| ECX:11
|-
| AVX-512 VBMI2
| Vector Byte Manipulation Instructions 2<br/>[[Byte]]/[[Word]] Load, Store and Concatenation with Shift
|
|-
| AVX-512 BITALG
| Bit Algorithms<br/>Byte/Word Bit Manipulation Instructions expanding VPOPCNTDQn
|
|-
| AVX-512 GFNI
| Galois field New Instructions<br/>Vector Instructions for calculating [https://en.wikipedia.org/wiki/Finite_field Galois Field] GF(2^8)
|
|}
=Selected Instructions=
==VPTERNLOG<span id="VPTERNLOG"></span>==
AVX-512F 512 F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise [https://en.wikipedia.org/wiki/Ternary_operation ternary logic], for instance to [[General Setwise Operations|operate]] on vectors of [[Bitboards|bitboards]]. Three input vectors are bitwise [[Combinatorial Logic|combined]] by an operation determined by an immediate byte operand ('''imm8'''), whose 256 possible values corresponds with the boolean output vector of the [https://en.wikipedia.org/wiki/Truth_table truth table] for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below <ref>[http://0x80.pl/articles/avx512-ternary-functions.html AVX512: ternary functions evaluation] by [[Wojciech Muła]], March 03, 2015</ref> <ref>[https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf) 5.3 TERNARY BIT VECTOR LOGIC TABLE</ref> :
{| class="wikitable"
|-
Following VPTERNLOGQ intrinsics are declared, where the maskz version sets unmasked destination quad word elements to zero, while the mask version copies unmasked elements from s:
<pre>
__m256i _mm256_ternarylogic_epi64(__m256i a, __m256i b, __m256i c, int imm8);
__m256i _mm256_maskz_ternarylogic_epi64(__mmask8 k, __m256i a, __m256i b, __m256i c, int imm8);
__m256i _mm256_mask_ternarylogic_epi64(__m256i src, __mmask8 k, __m256i a, __m256i b, int imm8);
__m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, __m512i c, int imm8);
__m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);
<span id="VPLZCNT"></span>
==VPLZCNT==
AVX-512CD 512 CD has Vector [[BitScan#LeadingZeroCount|Leading Zero Count]] - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel <ref>[https://www.google.com/patents/US9372692 Patent US9372692 - Methods, apparatus, instructions, and logic to provide permute controls with leading zero count functionality - Google Patent Search]</ref> - using following intrinsics <ref>[https://hjlebbink.github.io/x86doc/html/VPLZCNTD_Q.html VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values]</ref>, where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:
<pre>
__m256i _mm256_lzcnt_epi64(__m256i a);
__m256i _mm256_maskz_lzcnt_epi64(__mmask8 k, __m256i a);
__m256i _mm256_mask_lzcnt_epi64(__m256i src, __mmask8 k, __m256i a);
__m512i _mm512_lzcnt_epi64(__m512i a);
__m512i _mm512_maskz_lzcnt_epi64(__mmask8 mk, __m512i a);__m512i _mm512_mask_lzcnt_epi64(__m512i ssrc, __mmask8 mk, __m512i a);
</pre>
<span id="VPOPCNT"></span>
==VPOPCNT==
The future AVX-512VPOPCNTDQ 512 VPOPCNTDQ extension has a vector [[Population Count|population count]] instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel <ref>[https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx512-harley-seal.cpp sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub]</ref> <ref>[[Wojciech Muła]], [http://dblp.uni-trier.de/pers/hd/k/Kurz:Nathan Nathan Kurz], [https://github.com/lemire Daniel Lemire] ('''2016'''). ''Faster Population Counts Using AVX2 Instructions''. [https://arxiv.org/abs/1611.07612 arXiv:1611.07612]</ref><ref>[https://software. intel.com/sites/landingpage/IntrinsicsGuide/#text=VPOPCNTD&expand=4368 Intel® Intrinsics Guide VPOPCNTD]</ref>. <pre>__m128i _mm_mask_popcnt_epi32(__m128i src, __mmask8 k, __m128i a);__m128i _mm_maskz_popcnt_epi32(__mmask8 k, __m128i a);__m128i _mm_popcnt_epi3 (__m128i a);__m256i _mm256_mask_popcnt_epi32(__m256i src, __mmask8 k, __m256i a);__m256i _mm256_maskz_popcnt_epi32(__mmask8 k, __m256i a);__m256i _mm256_popcnt_epi32(__m256i a);__m512i _mm512_mask_popcnt_epi32(__m512i src, __mmask16 k, __m512i a);__m512i _mm512_maskz_popcnt_epi32(__mmask16 k, __m512i a);__m512i _mm512_popcnt_epi32(__m512i a);
__m128i _mm_mask_popcnt_epi64(__m128i src, __mmask8 k, __m128i a);
__m128i _mm_maskz_popcnt_epi64(__mmask8 k, __m128i a);
__m128i _mm_popcnt_epi64(__m128i a);
__m256i _mm256_mask_popcnt_epi64(__m256i src, __mmask8 k, __m256i a);
__m256i _mm256_maskz_popcnt_epi64(__mmask8 k, __m256i a);
__m256i _mm256_popcnt_epi64(__m256i a);
__m512i _mm512_mask_popcnt_epi64(__m512i src, __mmask8 k, __m512i a);
__m512i _mm512_maskz_popcnt_epi64(__mmask8 k, __m512i a);
__m512i _mm512_popcnt_epi64(__m512i a)
</pre>
==VPDPBUSD==
The AVX-512 VNNI extension features several instructions speeding up [[Neural Networks|neural network]] and [[Deep Learning|deep learning]] calculations on the CPU, for instance [[NNUE]] inference using uint8/int8. VPDPBUSD - Multiply and Add Unsigned and Signed Bytes <ref>[https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2168,2201&text=VPDPBUSD&avx512techs=AVX512_VNNI Intel® Intrinsics Guide VPDPBUSD]</ref>, executes on both port 0 and port 5 in one cycle <ref>[https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html Lower Numerical Precision Deep Learning Inference and Training] by [https://community.intel.com/t5/user/viewprofilepage/user-id/134067 Andres Rodriguez] et al., January 19, 2018</ref>.
<pre>
__m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)
{
for (j=0; j < 16; j++) {
tmp1.word := Signed(ZeroExtend16(a.byte[4*j ]) * SignExtend16(b.byte[4*j ]);
tmp2.word := Signed(ZeroExtend16(a.byte[4*j+1]) * SignExtend16(b.byte[4*j+1]);
tmp3.word := Signed(ZeroExtend16(a.byte[4*j+2]) * SignExtend16(b.byte[4*j+2]);
tmp4.word := Signed(ZeroExtend16(a.byte[4*j+3]) * SignExtend16(b.byte[4*j+3]);
dst.dword[j] := src.dword[j] + tmp1 + tmp2 + tmp3 + tmp4
}
return dst;
}
</pre>
=See also=
* [[CFish#AVX2 Attacks|CFish - AVX2 Attacks]]
* [[DirGolem]]
* [[NNUE]]
* [[Stockfish NNUE]]
* [[SIMD and SWAR Techniques]]
 
=SIMD=
* [[AltiVec]]
* [[AVX]]
* [[AVX2]]
* [[NNUE]]
* [[SIMD and SWAR Techniques]]
* [[SSE2]]
* [[Stockfish NNUE]]
* [[XOP]]
 
=Publications=
* [https://os.itec.kit.edu/21_3247.php Mathias Gottschlag], [https://os.itec.kit.edu/21_31.php Frank Bellosa] ('''2018'''). ''[https://os.itec.kit.edu/21_3486.php Mechanism to Mitigate AVX-Induced Frequency Reduction]''. [https://arxiv.org/abs/1901.04982 arXiv:1901.04982]
* [https://os.itec.kit.edu/21_3247.php Mathias Gottschlag], [https://os.itec.kit.edu/97_3742.php Philipp Machauer], [https://os.itec.kit.edu/21_3571.php Yussuf Khalil], [https://os.itec.kit.edu/21_31.php Frank Bellosa] ('''2021'''). ''[https://www.usenix.org/conference/atc21/presentation/gottschlag Fair Scheduling for AVX2 and AVX-512 Workloads]''. [https://www.usenix.org/conference/atc21 USENIX ATC '21]
=Manuals=
* [https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf)
* [https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-optimization-reference-manual.html Intel® 64 and IA-32 Architectures Optimization Reference Manual]
 
=Forum Posts=
* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75049 AVX-512 and NNUE] by [[Gian-Carlo Pascutto]], [[CCC]], September 08, 2020 » [[NNUE]]
* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=77246 VPOPCNTDQ and VBMI2] by [[Vivien Clauzon]], [[CCC]], May 04, 2021
=External Links=
* [https://en.wikichip.org/wiki/x86/avx512_vnni AVX-512 Vector Neural Network Instructions (VNNI) - x86 - WikiChip]
* [https://www.intel.com/content/www/us/en/architecture-and-technology/avx-512-overview.html Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Overview]
* [https://software.intel.com/en-us/intel-isa-extensions Intel Instruction Set Architecture Extensions]==Blog PostingsBlogs==
* [https://software.intel.com/en-us/blogs/2013/avx-512-instructions AVX-512 instructions | Intel® Developer Zone] by [https://software.intel.com/en-us/user/335550 James Reinders], July 23, 2013
* [http://www.agner.org/optimize/blog/read.php?i=288 Future instruction set: AVX-512] by [http://www.agner.org/ Agner Fog], October, 09, 2013
* [https://software.intel.com/en-us/blogs/2014/07/24/processing-arrays-of-bits-with-intel-advanced-vector-extensions-512-intel-avx-512 Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) | Intel® Developer Zone] by [https://software.intel.com/en-us/user/123920 Thomas Willhalm], July 24, 2014
* [https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors/ AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors] by [https://software.intel.com/en-us/user/335550 James Reinders], [https://www.hpcwire.com/ HPCwire], June 29, 2017
* [https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html Lower Numerical Precision Deep Learning Inference and Training] by [https://community.intel.com/t5/user/viewprofilepage/user-id/134067 Andres Rodriguez] et al., January 19, 2018
==Compiler Support==
* [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512 Intel Intrinsics Guide - AVX-512]
=References=
<references />
 
'''[[x86-64|Up one Level]]'''

Navigation menu