Changes

← Older edit

AVX-512

5,075 bytes added, 13:42, 17 March 2022

no edit summary

'''AVX-512''',

an expansion of [[Intel|Intel's]] ~~the~~ [[AVX]] and [[AVX2]] instructions using the [https://en.wikipedia.org/wiki/EVEX_prefix EVEX prefix], featuring '''32''' 512-bit wide vector [[SIMD and SWAR Techniques|SIMD]] registers zmm0 through zmm31, keeping either eight [[Double|doubles]] or integer [[Quad Word|quad words]] such as [[Bitboards|bitboards]], and eight (seven) dedicated mask registers which specify which vector elements are operated on and written. If the Nth bit of a vector mask register is set, then the Nth element of the destination vector is overridden with the result of the operation; otherwise, dependent of the instruction, the element is zeroed, or overridden by an element from another source register (remains unchanged if same source). A vector mask register can be set using vector compare instructions, instructions to move contents from a GP register, or a special subset of vector mask arithmetic instructions.

=Extensions=

Reg:Bit <ref>[https://www.heise.de/ct/zcontent/17/16-hocmsmeta/1501873687265857/ct.1617.016-017.qxp_table_29578.html AVX512 table] from [https://en.wikipedia.org/wiki/Heinz_Heise Heise]</ref>

|-

| AVX-~~512F~~ 512 F

| Foundation

| rowspan="4" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Landing Knights Landing]

| EBX:16

|-

| AVX-~~512CD~~ 512 CD

| Conflict Detection Instructions

| EBX:28

|-

| AVX-~~512ER~~ 512 ER

| Exponential and Reciprocal Instructions

| EBX:27

|-

| AVX-~~512PF~~ 512 PF

| Prefetch Instructions

| EBX:26

|-

| AVX-~~512BW~~ 512 BW

| [[Byte]] and [[Word]] Instructions

| rowspan="3" | [https://en.wikipedia.org/wiki/Skylake_(microarchitecture) Skylake X]

| EBX:30

|-

| AVX-~~512DQ~~ 512 DQ

| [[Double Word|Doubleword]] and [[Quad Word|Quadword]] Instructions

| EBX:17

|-

| AVX-~~512VL~~ 512 VL

| Vector Length Extensions

| EBX:31

|-

| AVX-~~512IFMA~~ 512 IFMA

| Integer Fused Multiply Add

| rowspan="2" | [https://en.wikipedia.org/wiki/Cannonlake Cannonlake]

| EBX:21

|-

| AVX-~~512VBMI~~ 512 VBMI

| Vector Byte Manipulation Instructions

| ECX:01

|-

| AVX-~~512VPOPCNTDQ~~ 512 VPOPCNTDQ

| Vector [[Population Count]]

| rowspan="3" | [https://en.wikipedia.org/wiki/Xeon_Phi#Knights_Mill Knights Mill]

| Fused Multiply Accumulation Packed Single precision

| EDX:03

|-

| AVX-512 VNNI

| Vector Neural Network Instructions Vector Instructions for [[Deep Learning]]

| rowspan="4" | [https://en.wikipedia.org/wiki/Ice_Lake_(microprocessor) Ice Lake]

| ECX:11

|-

| AVX-512 VBMI2

| Vector Byte Manipulation Instructions 2 [[Byte]]/[[Word]] Load, Store and Concatenation with Shift

|

|-

| AVX-512 BITALG

| Bit Algorithms Byte/Word Bit Manipulation Instructions expanding VPOPCNTDQn

|

|-

| AVX-512 GFNI

| Galois field New Instructions Vector Instructions for calculating [https://en.wikipedia.org/wiki/Finite_field Galois Field] GF(2^8)

|

|}

=Selected Instructions=

==VPTERNLOG==

AVX-~~512F~~ 512 F features the instruction VPTERNLOGQ (or VPTERNLOGD) to perform bitwise [https://en.wikipedia.org/wiki/Ternary_operation ternary logic], for instance to [[General Setwise Operations|operate]] on vectors of [[Bitboards|bitboards]]. Three input vectors are bitwise [[Combinatorial Logic|combined]] by an operation determined by an immediate byte operand ('''imm8'''), whose 256 possible values corresponds with the boolean output vector of the [https://en.wikipedia.org/wiki/Truth_table truth table] for all eight combinations of the three input bits, as demonstrated with some selected imm8 values in the table below <ref>[http://0x80.pl/articles/avx512-ternary-functions.html AVX512: ternary functions evaluation] by [[Wojciech Muła]], March 03, 2015</ref> <ref>[https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf) 5.3 TERNARY BIT VECTOR LOGIC TABLE</ref> :

{| class="wikitable"

|-

Following VPTERNLOGQ intrinsics are declared, where the maskz version sets unmasked destination quad word elements to zero, while the mask version copies unmasked elements from s:

<pre>

__m256i _mm256_ternarylogic_epi64(__m256i a, __m256i b, __m256i c, int imm8);

__m256i _mm256_maskz_ternarylogic_epi64(__mmask8 k, __m256i a, __m256i b, __m256i c, int imm8);

__m256i _mm256_mask_ternarylogic_epi64(__m256i src, __mmask8 k, __m256i a, __m256i b, int imm8);

__m512i _mm512_ternarylogic_epi64(__m512i a, __m512i b, __m512i c, int imm8);

__m512i _mm512_maskz_ternarylogic_epi64( __mmask8 m, __m512i a, __m512i b, __m512i c, int imm8);

==VPLZCNT==

AVX-~~512CD~~ 512 CD has Vector [[BitScan#LeadingZeroCount|Leading Zero Count]] - VPLZCNTQ counts leading zeroes on a vector of eight bitboards in parallel <ref>[https://www.google.com/patents/US9372692 Patent US9372692 - Methods, apparatus, instructions, and logic to provide permute controls with leading zero count functionality - Google Patent Search]</ref> - using following intrinsics <ref>[https://hjlebbink.github.io/x86doc/html/VPLZCNTD_Q.html VPLZCNTD/Q—Count the Number of Leading Zero Bits for Packed Dword, Packed Qword Values]</ref>, where the maskz version sets unmasked destination elements to zero, while the mask version copies unmasked elements from s:

<pre>

__m256i _mm256_lzcnt_epi64(__m256i a);

__m256i _mm256_maskz_lzcnt_epi64(__mmask8 k, __m256i a);

__m256i _mm256_mask_lzcnt_epi64(__m256i src, __mmask8 k, __m256i a);

__m512i _mm512_lzcnt_epi64(__m512i a);

__m512i _mm512_maskz_lzcnt_epi64(__mmask8 mk, __m512i a);__m512i _mm512_mask_lzcnt_epi64(__m512i ssrc, __mmask8 mk, __m512i a);

</pre>

==VPOPCNT==

The ~~future~~ AVX-~~512VPOPCNTDQ~~ 512 VPOPCNTDQ extension has a vector [[Population Count|population count]] instruction to count one bits of either 16 32-bit double words (VPOPCNTD) or 8 64-bit quad words aka bitboards (VPOPCNTQ) in parallel <ref>[https://github.com/WojciechMula/sse-popcount/blob/master/popcnt-avx512-harley-seal.cpp sse-popcount/popcnt-avx512-harley-seal.cpp at master · WojciechMula/sse-popcount · GitHub]</ref> <ref>[[Wojciech Muła]], [http://dblp.uni-trier.de/pers/hd/k/Kurz:Nathan Nathan Kurz], [https://github.com/lemire Daniel Lemire] ('''2016'''). ''Faster Population Counts Using AVX2 Instructions''. [https://arxiv.org/abs/1611.07612 arXiv:1611.07612]</ref><ref>[https://software. intel.com/sites/landingpage/IntrinsicsGuide/#text=VPOPCNTD&expand=4368 Intel® Intrinsics Guide VPOPCNTD]</ref>. <pre>__m128i _mm_mask_popcnt_epi32(__m128i src, __mmask8 k, __m128i a);__m128i _mm_maskz_popcnt_epi32(__mmask8 k, __m128i a);__m128i _mm_popcnt_epi3 (__m128i a);__m256i _mm256_mask_popcnt_epi32(__m256i src, __mmask8 k, __m256i a);__m256i _mm256_maskz_popcnt_epi32(__mmask8 k, __m256i a);__m256i _mm256_popcnt_epi32(__m256i a);__m512i _mm512_mask_popcnt_epi32(__m512i src, __mmask16 k, __m512i a);__m512i _mm512_maskz_popcnt_epi32(__mmask16 k, __m512i a);__m512i _mm512_popcnt_epi32(__m512i a);

__m128i _mm_mask_popcnt_epi64(__m128i src, __mmask8 k, __m128i a);

__m128i _mm_maskz_popcnt_epi64(__mmask8 k, __m128i a);

__m128i _mm_popcnt_epi64(__m128i a);

__m256i _mm256_mask_popcnt_epi64(__m256i src, __mmask8 k, __m256i a);

__m256i _mm256_maskz_popcnt_epi64(__mmask8 k, __m256i a);

__m256i _mm256_popcnt_epi64(__m256i a);

__m512i _mm512_mask_popcnt_epi64(__m512i src, __mmask8 k, __m512i a);

__m512i _mm512_maskz_popcnt_epi64(__mmask8 k, __m512i a);

__m512i _mm512_popcnt_epi64(__m512i a)

</pre>

==VPDPBUSD==

The AVX-512 VNNI extension features several instructions speeding up [[Neural Networks|neural network]] and [[Deep Learning|deep learning]] calculations on the CPU, for instance [[NNUE]] inference using uint8/int8. VPDPBUSD - Multiply and Add Unsigned and Signed Bytes <ref>[https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2168,2201&text=VPDPBUSD&avx512techs=AVX512_VNNI Intel® Intrinsics Guide VPDPBUSD]</ref>, executes on both port 0 and port 5 in one cycle <ref>[https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html Lower Numerical Precision Deep Learning Inference and Training] by [https://community.intel.com/t5/user/viewprofilepage/user-id/134067 Andres Rodriguez] et al., January 19, 2018</ref>.

<pre>

__m512i _mm512_dpbusd_epi32 (__m512i src, __m512i a, __m512i b)

{

for (j=0; j < 16; j++) {

tmp1.word := Signed(ZeroExtend16(a.byte[4*j ]) * SignExtend16(b.byte[4*j ]);

tmp2.word := Signed(ZeroExtend16(a.byte[4*j+1]) * SignExtend16(b.byte[4*j+1]);

tmp3.word := Signed(ZeroExtend16(a.byte[4*j+2]) * SignExtend16(b.byte[4*j+2]);

tmp4.word := Signed(ZeroExtend16(a.byte[4*j+3]) * SignExtend16(b.byte[4*j+3]);

dst.dword[j] := src.dword[j] + tmp1 + tmp2 + tmp3 + tmp4

}

return dst;

}

</pre>

=See also=

* [[CFish#AVX2 Attacks|CFish - AVX2 Attacks]]

* [[DirGolem]]

* [[NNUE]]

* [[Stockfish NNUE]]

* [[SIMD and SWAR Techniques]]

=SIMD=

* [[AltiVec]]

* [[AVX]]

* [[AVX2]]

* [[NNUE]]

* [[SIMD and SWAR Techniques]]

* [[SSE2]]

* [[Stockfish NNUE]]

* [[XOP]]

=Publications=

* [https://os.itec.kit.edu/21_3247.php Mathias Gottschlag], [https://os.itec.kit.edu/21_31.php Frank Bellosa] ('''2018'''). ''[https://os.itec.kit.edu/21_3486.php Mechanism to Mitigate AVX-Induced Frequency Reduction]''. [https://arxiv.org/abs/1901.04982 arXiv:1901.04982]

* [https://os.itec.kit.edu/21_3247.php Mathias Gottschlag], [https://os.itec.kit.edu/97_3742.php Philipp Machauer], [https://os.itec.kit.edu/21_3571.php Yussuf Khalil], [https://os.itec.kit.edu/21_31.php Frank Bellosa] ('''2021'''). ''[https://www.usenix.org/conference/atc21/presentation/gottschlag Fair Scheduling for AVX2 and AVX-512 Workloads]''. [https://www.usenix.org/conference/atc21 USENIX ATC '21]

=Manuals=

* [https://software.intel.com/sites/default/files/managed/c5/15/architecture-instruction-set-extensions-programming-reference.pdf Intel® Architecture Instruction Set Extensions Programming Reference] (pdf)

* [https://software.intel.com/content/www/us/en/develop/download/intel-64-and-ia-32-architectures-optimization-reference-manual.html Intel® 64 and IA-32 Architectures Optimization Reference Manual]

=Forum Posts=

* [http://www.talkchess.com/forum3/viewtopic.php?f=2&t=75049 AVX-512 and NNUE] by [[Gian-Carlo Pascutto]], [[CCC]], September 08, 2020 » [[NNUE]]

* [http://www.talkchess.com/forum3/viewtopic.php?f=7&t=77246 VPOPCNTDQ and VBMI2] by [[Vivien Clauzon]], [[CCC]], May 04, 2021

=External Links=

* [https://en.wikichip.org/wiki/x86/avx512_vnni AVX-512 Vector Neural Network Instructions (VNNI) - x86 - WikiChip]

* [https://www.intel.com/content/www/us/en/architecture-and-technology/avx-512-overview.html Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Overview]

* [https://software.intel.com/en-us/intel-isa-extensions Intel Instruction Set Architecture Extensions]==~~Blog Postings~~Blogs==

* [https://software.intel.com/en-us/blogs/2013/avx-512-instructions AVX-512 instructions | Intel® Developer Zone] by [https://software.intel.com/en-us/user/335550 James Reinders], July 23, 2013

* [http://www.agner.org/optimize/blog/read.php?i=288 Future instruction set: AVX-512] by [http://www.agner.org/ Agner Fog], October, 09, 2013

* [https://software.intel.com/en-us/blogs/2014/07/24/processing-arrays-of-bits-with-intel-advanced-vector-extensions-512-intel-avx-512 Processing Arrays of Bits with Intel® Advanced Vector Extensions 512 (Intel® AVX-512) | Intel® Developer Zone] by [https://software.intel.com/en-us/user/123920 Thomas Willhalm], July 24, 2014

* [https://www.hpcwire.com/2017/06/29/reinders-avx-512-may-hidden-gem-intel-xeon-scalable-processors/ AVX-512 May Be a Hidden Gem” in Intel Xeon Scalable Processors] by [https://software.intel.com/en-us/user/335550 James Reinders], [https://www.hpcwire.com/ HPCwire], June 29, 2017

* [https://software.intel.com/content/www/us/en/develop/articles/lower-numerical-precision-deep-learning-inference-and-training.html Lower Numerical Precision Deep Learning Inference and Training] by [https://community.intel.com/t5/user/viewprofilepage/user-id/134067 Andres Rodriguez] et al., January 19, 2018

==Compiler Support==

* [https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX_512 Intel Intrinsics Guide - AVX-512]

=References=

'''[[x86-64|Up one Level]]'''

GerdIsenberg

Bureaucrats, Administrators

25,161

edits

Changes

AVX-512

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools