Changes

← Older edit

AVX-512

2,041 bytes added, 13:42, 17 March 2022

no edit summary

'''AVX-512''',

an expansion of [[Intel|Intel's]] ~~the~~ [[AVX]] and [[AVX2]] instructions using the [https://en.wikipedia.org/wiki/EVEX_prefix EVEX prefix], featuring '''32''' 512-bit wide vector [[SIMD and SWAR Techniques|SIMD]] registers zmm0 through zmm31, keeping either eight [[Double|doubles]] or integer [[Quad Word|quad words]] such as [[Bitboards|bitboards]], and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.

=Extensions=

Reg:Bit <ref>[https://www.heise.de/ct/zcontent/17/16-hocmsmeta/1501873687265857/ct.1617.016-017.qxp_table_29578.html AVX512 table] from [https://en.wikipedia.org/wiki/Heinz_Heise Heise]</ref>

|-

| AVX-~~512F~~ 512 F

| Foundation

| rowspan="4" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing Knights Landing]

| EBX:16

|-

| AVX-~~512CD~~ 512 CD

| Conflict Detection Instructions

| EBX:28

|-

| AVX-~~512ER~~ 512 ER

| Exponential and Reciprocal Instructions

| EBX:27

|-

| AVX-~~512PF~~ 512 PF

| Prefetch Instructions

| EBX:26

|-

| AVX-~~512BW~~ 512 BW

| [[Byte]] and [[Word]] Instructions

| rowspan="3" | [https://en.wikipedia.org/wiki/Skylake_(microarchitecture) Skylake X]

| EBX:30

|-

| AVX-~~512DQ~~ 512 DQ

| [[Double Word|Doubleword]] and [[Quad Word|Quadword]] Instructions

| EBX:17

|-

| AVX-~~512VL~~ 512 VL

| Vector Length Extensions

| EBX:31

|-

| AVX-~~512IFMA~~ 512 IFMA

| Integer Fused Multiply Add

| rowspan="2" | [https://en.wikipedia.org/wiki/Cannonlake Cannonlake]

| EBX:21

|-

| AVX-~~512VBMI~~ 512 VBMI

| Vector Byte Manipulation Instructions

| ECX:01

|-

| AVX-~~512VPOPCNTDQ~~ 512 VPOPCNTDQ

| Vector [[Population Count]]

| rowspan="3" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Mill Knights Mill]

| EDX:03

|-

| AVX-512-VNNI

| Vector Neural Network Instructions Vector Instructions for [[Deep Learning]]

| rowspan="4" | [https://en.wikipedia.org/wiki/Ice_Lake_(microprocessor) Ice Lake]

| ECX:11

|-

| AVX-512-VBMI2

| Vector Byte Manipulation Instructions 2 [[Byte]]/[[Word]] Load, Store and Concatenation with Shift

|

|-

| AVX-512-BITALG

| Bit Algorithms Byte/Word Bit Manipulation Instructions expanding VPOPCNTDQn

|

|-

| AVX-512-GFNI

| Galois field New Instructions Vector Instructions for calculating [https://en.wikipedia.org/wiki/Finite_field Galois Field] GF(2^8)

|

=Selected Instructions=

==VPTERNLOG==

AVX-~~512F~~ 512 F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise [https://en.wikipedia.org/wiki/Ternary_operation ternary logic], for instance to [[General Setwise Operations|operate]] on vectors of [[Bitboards|bitboards]]. Three input vectors are bitwise [[Combinatorial Logic|combined]] by an operation determined by an immediate byte operand ('''imm8'''), whose 256 possible values corresponds with the boolean output vector of the [https://en.wikipedia.org/wiki/Truth_table truth table] for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below <ref>[http://0x80.pl/articles/avx512-ternary-functions.html AVX512: ternary functions evaluation] by [[Wojciech Muła]], March 03, 2015</ref> <ref>[https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf) 5.3 TERNARY BIT VECTOR LOGIC TABLE</ref> :

{| class="wikitable"

|-

==VPLZCNT==

AVX-~~512CD~~ 512 CD has Vector [[BitScan#LeadingZeroCount|Leading Zero Count]] - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel <ref>[https://www.google.com/patents/US9372692 Patent US9372692 - Methods, apparatus, instructions, and logic to provide permute controls with leading zero count functionality - Google Patent Search]</ref> - using following intrinsics <ref>[https://hjlebbink.github.io/x86doc/html/VPLZCNTD_Q.html VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values]</ref>, where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:

<pre>

__m256i _mm256_lzcnt_epi64(__m256i a);

==VPOPCNT==

The AVX-~~512VPOPCNTDQ~~ 512 VPOPCNTDQ extension has a vector [[Population Count|population count]] instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel <ref>[https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx512-harley-seal.cpp sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub]</ref> <ref>[[Wojciech Muła]], [http://dblp.uni-trier.de/pers/hd/k/Kurz:Nathan Nathan Kurz], [https://github.com/lemire Daniel Lemire] ('''2016'''). ''Faster Population Counts Using AVX2 Instructions''. [https://arxiv.org/abs/1611.07612 arXiv:1611.07612]</ref> <ref>[https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=VPOPCNTD&expand=4368 Intel® Intrinsics Guide VPOPCNTD]</ref>.

<pre>

__m128i _mm_mask_popcnt_epi32(__m128i src, __mmask8 k, __m128i a);

__m512i _mm512_popcnt_epi64(__m512i a)

</pre>

==VPDPBUSD==The AVX-512 VNNI extension features several instructions speeding up [[Neural Networks|neural network]] and [[Deep Learning|deep learning]] calculations on the CPU, for instance [[NNUE]] inference using uint8/int8. VPDPBUSD - Multiply and Add Unsigned and Signed Bytes <ref>[https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2168,2201&text=VPDPBUSD&avx512techs=AVX512_VNNI Intel® Intrinsics Guide VPDPBUSD]</ref>, executes on both port 0 and port 5 in one cycle <ref>[https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html Lower Numerical Precision Deep Learning Inference and Training] by [https://community.intel.com/t5/user/viewprofilepage/user-id/134067 Andres Rodriguez] et al., January 19, 2018</ref>.<pre> __m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b){ for (j=0; j < 16; j++) { tmp1.word := Signed(ZeroExtend16(a.byte[4*j ]) * SignExtend16(b.byte[4*j ]); tmp2.word := Signed(ZeroExtend16(a.byte[4*j+1]) * SignExtend16(b.byte[4*j+1]); tmp3.word := Signed(ZeroExtend16(a.byte[4*j+2]) * SignExtend16(b.byte[4*j+2]); tmp4.word := Signed(ZeroExtend16(a.byte[4*j+3]) * SignExtend16(b.byte[4*j+3]); dst.dword[j] := src.dword[j] + tmp1 + tmp2 + tmp3 + tmp4 } return dst;}</pre>

=See also=

* [[CFish#AVX2 Attacks|CFish - AVX2 Attacks]]

* [[SSE2]]

* [[XOP]]

=Publications=

* [https://os.itec.kit.edu/21_3247.php Mathias Gottschlag], [https://os.itec.kit.edu/21_31.php Frank Bellosa] ('''2018'''). ''[https://os.itec.kit.edu/21_3486.php Mechanism to Mitigate AVX-Induced Frequency Reduction]''. [https://arxiv.org/abs/1901.04982 arXiv:1901.04982]

* [https://os.itec.kit.edu/21_3247.php Mathias Gottschlag], [https://os.itec.kit.edu/97_3742.php Philipp Machauer], [https://os.itec.kit.edu/21_3571.php Yussuf Khalil], [https://os.itec.kit.edu/21_31.php Frank Bellosa] ('''2021'''). ''[https://www.usenix.org/conference/atc21/presentation/gottschlag Fair Scheduling for AVX2 and AVX-512 Workloads]''. [https://www.usenix.org/conference/atc21 USENIX ATC '21]

=Manuals=

GerdIsenberg

Bureaucrats, Administrators

25,161

edits

Changes

AVX-512

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools