Changes

Jump to: navigation, search

AVX-512

2,041 bytes added, 13:42, 17 March 2022
no edit summary
'''AVX-512''',<br/>
an expansion of [[Intel|Intel's]] the [[AVX]] and [[AVX2]] instructions using the [https://en.wikipedia.org/wiki/EVEX_prefix EVEX prefix], featuring '''32''' 512-bit wide vector [[SIMD and SWAR Techniques|SIMD]] registers zmm0 through zmm31, keeping either eight [[Double|doubles]] or integer [[Quad Word|quad words]] such as [[Bitboards|bitboards]], and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.
=Extensions=
Reg:Bit <ref>[https://www.heise.de/ct/zcontent/17/16-hocmsmeta/1501873687265857/ct.1617.016-017.qxp_table_29578.html AVX512 table] from [https://en.wikipedia.org/wiki/Heinz_Heise Heise]</ref>
|-
| AVX-512F 512 F
| Foundation
| rowspan="4" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing Knights Landing]
| EBX:16
|-
| AVX-512CD 512 CD
| Conflict Detection Instructions
| EBX:28
|-
| AVX-512ER 512 ER
| Exponential and Reciprocal Instructions
| EBX:27
|-
| AVX-512PF 512 PF
| Prefetch Instructions
| EBX:26
|-
| AVX-512BW 512 BW
| [[Byte]] and [[Word]] Instructions
| rowspan="3" | [https://en.wikipedia.org/wiki/Skylake_(microarchitecture) Skylake X]
| EBX:30
|-
| AVX-512DQ 512 DQ
| [[Double Word|Doubleword]] and [[Quad Word|Quadword]] Instructions
| EBX:17
|-
| AVX-512VL 512 VL
| Vector Length Extensions
| EBX:31
|-
| AVX-512IFMA 512 IFMA
| Integer Fused Multiply Add
| rowspan="2" | [https://en.wikipedia.org/wiki/Cannonlake Cannonlake]
| EBX:21
|-
| AVX-512VBMI 512 VBMI
| Vector Byte Manipulation Instructions
| ECX:01
|-
| AVX-512VPOPCNTDQ 512 VPOPCNTDQ
| Vector [[Population Count]]
| rowspan="3" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Mill Knights Mill]
| EDX:03
|-
| AVX-512-VNNI
| Vector Neural Network Instructions <br/>Vector Instructions for [[Deep Learning]]
| rowspan="4" | [https://en.wikipedia.org/wiki/Ice_Lake_(microprocessor) Ice Lake]
| ECX:11
|-
| AVX-512-VBMI2
| Vector Byte Manipulation Instructions 2<br/>[[Byte]]/[[Word]] Load, Store and Concatenation with Shift
|
|-
| AVX-512-BITALG
| Bit Algorithms<br/>Byte/Word Bit Manipulation Instructions expanding VPOPCNTDQn
|
|-
| AVX-512-GFNI
| Galois field New Instructions<br/>Vector Instructions for calculating [https://en.wikipedia.org/wiki/Finite_field Galois Field] GF(2^8)
|
=Selected Instructions=
==VPTERNLOG<span id="VPTERNLOG"></span>==
AVX-512F 512 F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise [https://en.wikipedia.org/wiki/Ternary_operation ternary logic], for instance to [[General Setwise Operations|operate]] on vectors of [[Bitboards|bitboards]]. Three input vectors are bitwise [[Combinatorial Logic|combined]] by an operation determined by an immediate byte operand ('''imm8'''), whose 256 possible values corresponds with the boolean output vector of the [https://en.wikipedia.org/wiki/Truth_table truth table] for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below <ref>[http://0x80.pl/articles/avx512-ternary-functions.html AVX512: ternary functions evaluation] by [[Wojciech Muła]], March 03, 2015</ref> <ref>[https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf) 5.3 TERNARY BIT VECTOR LOGIC TABLE</ref> :
{| class="wikitable"
|-
<span id="VPLZCNT"></span>
==VPLZCNT==
AVX-512CD 512 CD has Vector [[BitScan#LeadingZeroCount|Leading Zero Count]] - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel <ref>[https://www.google.com/patents/US9372692 Patent US9372692 - Methods, apparatus, instructions, and logic to provide permute controls with leading zero count functionality - Google Patent Search]</ref> - using following intrinsics <ref>[https://hjlebbink.github.io/x86doc/html/VPLZCNTD_Q.html VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values]</ref>, where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:
<pre>
__m256i _mm256_lzcnt_epi64(__m256i a);
<span id="VPOPCNT"></span>
==VPOPCNT==
The AVX-512VPOPCNTDQ 512 VPOPCNTDQ extension has a vector [[Population Count|population count]] instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel <ref>[https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx512-harley-seal.cpp sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub]</ref> <ref>[[Wojciech Muła]], [http://dblp.uni-trier.de/pers/hd/k/Kurz:Nathan Nathan Kurz], [https://github.com/lemire Daniel Lemire] ('''2016'''). ''Faster Population Counts Using AVX2 Instructions''. [https://arxiv.org/abs/1611.07612 arXiv:1611.07612]</ref> <ref>[https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=VPOPCNTD&expand=4368 Intel® Intrinsics Guide VPOPCNTD]</ref>.
<pre>
__m128i _mm_mask_popcnt_epi32(__m128i src, __mmask8 k, __m128i a);
__m512i _mm512_popcnt_epi64(__m512i a)
</pre>
==VPDPBUSD==The AVX-512 VNNI extension features several instructions speeding up [[Neural Networks|neural network]] and [[Deep Learning|deep learning]] calculations on the CPU, for instance [[NNUE]] inference using uint8/int8. VPDPBUSD - Multiply and Add Unsigned and Signed Bytes <ref>[https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2168,2201&text=VPDPBUSD&avx512techs=AVX512_VNNI Intel® Intrinsics Guide VPDPBUSD]</ref>, executes on both port 0 and port 5 in one cycle <ref>[https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html Lower Numerical Precision Deep Learning Inference and Training] by [https://community.intel.com/t5/user/viewprofilepage/user-id/134067 Andres Rodriguez] et al., January 19, 2018</ref>.<pre> __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b){ for (j=0; j < 16; j++) { tmp1.word := Signed(ZeroExtend16(a.byte[4*j ]) * SignExtend16(b.byte[4*j ]); tmp2.word := Signed(ZeroExtend16(a.byte[4*j+1]) * SignExtend16(b.byte[4*j+1]); tmp3.word := Signed(ZeroExtend16(a.byte[4*j+2]) * SignExtend16(b.byte[4*j+2]); tmp4.word := Signed(ZeroExtend16(a.byte[4*j+3]) * SignExtend16(b.byte[4*j+3]); dst.dword[j] := src.dword[j] + tmp1 + tmp2 + tmp3 + tmp4 } return dst;}</pre>
=See also=
* [[CFish#AVX2 Attacks|CFish - AVX2 Attacks]]
* [[SSE2]]
* [[XOP]]
 
=Publications=
* [https://os.itec.kit.edu/21_3247.php Mathias Gottschlag], [https://os.itec.kit.edu/21_31.php Frank Bellosa] ('''2018'''). ''[https://os.itec.kit.edu/21_3486.php Mechanism to Mitigate AVX-Induced Frequency Reduction]''. [https://arxiv.org/abs/1901.04982 arXiv:1901.04982]
* [https://os.itec.kit.edu/21_3247.php Mathias Gottschlag], [https://os.itec.kit.edu/97_3742.php Philipp Machauer], [https://os.itec.kit.edu/21_3571.php Yussuf Khalil], [https://os.itec.kit.edu/21_31.php Frank Bellosa] ('''2021'''). ''[https://www.usenix.org/conference/atc21/presentation/gottschlag Fair Scheduling for AVX2 and AVX-512 Workloads]''. [https://www.usenix.org/conference/atc21 USENIX ATC '21]
=Manuals=

Navigation menu